Translation memory mining from the Europarl corpus

We came up with a little challenge over Christmas – while ruminating. The challenge was to come up with a process to automatically identify proper (used) noun phrases from one language to another. This seems to be a simple enough task for a human but is a bit challenging for a computer.

To set the stage, we do have a system (the NeuroCollective) that contains a sizable English-German dictionary with mechanisms to map all the morphological variants to the proper senses and translations. What we are looking for is the next step, to identify 2-grams of one language and correlate them to – actually used – 2-grams in the other language, without adding nonsensical garbage.

As translation source we use the readily available Europarl parallel corpus in English and German.

The files that we had downloaded contain about 345,000 distinct German words (in all kinds of morphological variations and little garbage and mis-spellings) and about 111,000 English words (some garbage).

The first task was to sort out the noun variants of the German word list and reduce the list to the corresponding stems (not really a stem in the linguistic sense, but close enough). The NeuroCollective was able convert the raw German word list into list of German-English stem pairs if the word was of a noun type syntactic role.

The next step was to select any occurrence of the noun paired with another prior word. The utilized pattern was: egrep -o “[A-Za-z]* $noun “, in case anyone is interested to follow this process. We do this from the German corpus with the German noun and from the English corpus with the corresponding English noun. We end up with two lists of word pairs in L1 (German) and L2 (English).

The 2 lists of word pair lists are then fed back to a small program utilizing the NeuroCollective where essentially the syntactic role of the first word of the phrase L1 is analyzed and if an adjective is stored in a hash.

The second list L2 is processed in a similar manner. If an adjective is detected, it is translated, through the known data of the NeuroCollective. We then check if the translation (L1’) is in our prior L1 hash. If so, we identified a English adjectival noun-phrase and a corresponding German adjectival noun-phrase that is actually used.

Now to be clear, we are not claiming that each of the phrase pairs are necessarily used in the corresponding sentences is th Europarl. But rather that suggesting the individual translations of the adjective and the noun and then let the translator sort it out, these phrases represent small pieces of accepted translation memory that can be recommended in a translation system. The NeuroCollective stores them as semantic links and recommends them if needed.

Expansions: other than noun word anchors: adverb/adjective, noun/noun? Going from 2-grams to 3-grams, ..


This started out to utilize the Europarl parallel corpus but there are plenty of other opportunities. We are running an extract from the German Wikipedia dump against Google 2-grams right now (because we had it already downloaded). Note that this is by no means generating full sentence translations. It is only identifying corresponding noun phrases translations in both languages. The decision which of these are used in a larger translation is still to be determined.

You may also like...

Popular Posts