Apertium 2 cent tip: how to add analysis and generation of unknown words, and *why you shouldn't*

Apertium 2 cent tip: how to add analysis and generation of unknown words, and why you shouldn't

Jimmy O'Regan [joregan at gmail.com]

Thu, 1 Jan 2009 15:27:30 +0000

In my article about Apertium, I promised to follow it up with another article of a more 'HOWTO' nature. And I've been writing it. And constantly rewriting it, every time somebody asks how to do something that I think is moronic, to explain why they shouldn't do that... and I need to accept that people will always want to do stupid things, and I should just write a HOWTO.

Anyway... recently, someone asked how to implement generation of unknown words. There are only two reasons I can think of, why someone would want this: either they have words in the bilingual dictionary that they don't have in the monolingual dictionary, or they want to use it in conjunction with morphological guessing.

In general, the usual method used in Apertium's translators is, if we don't know the word, we don't try to translate it -- we're honest about it, essentially. Apertium has an option to mark unknown words, which we generally recommend that people use. It doesn't cover 'hidden' unknown words, where the same word an be two different parts of speech--we're looking into how to attempt that. One result of this, is that before a release, we specifically remove some words from the monolingual dictionary, if we can't add a translation.

Anyway, in the first case, we generally write scripts to automate adding those words to the bidix. One plus of this is that it can be manually checked afterwards, and fixed. Another is that, by adding the word to the monolingual dictionary, we can also analyse it: we generally try to make bilingual translators, but sometimes we can only make a single direction translator--but we still have the option of adding the other direction later. And, as our translators are open source, it increases the amount of freely available linguistic data to do so, so it's a win all round.

The latter case, of also using a mophological guesser, is one source of some of the worst translations out there. For example, at the moment, I'm translating a short story by Adam Mickiewicz, which contains the phrase 'tu i owdzie', which is either a misspelling of 'tu i ówdzie' ('here and there') or an old form, or typesetting error[1], but in any case, the word 'owdzie' does not exist in the modern Polish language.

Translatica, the leading Polish-English translator, gave: "here and he is owdzying"

Now, if I knew nothing of Polish, that would send me scrambling to the English dictionary, to search for the non-existant verb 'to owdzy'.

(Google gave: "here said". SMT is a great idea, in theory, but in practice[2] has the potential to give translations that bear no resemblance to the original meaning of the source text. Google's own method of 'augmenting' SMT by extracting correlating phrase pairs based on a pivot language also leads to extra ambiguities[3])

Anyway. The tip, for anyone who still wants to try it

Apetium's dictionaries can have a limited subset of regular expressions; these can be used by someone who wishes to have both analysis and generation of unknown words. The <re> tag can be placed before the <par> tag, so the entry:

<e>
  <re>[a-z]*</re>
  <par n="accept__vblex"/>
</e>

will accept, and generate, any otherwise unknown word with the set of endings represented by the paradigm for the verb 'accept', -s, -ed, -ing, -0, etc. That gets more complicated when you want to do the same with verbs like 'live', or 'plug', but judicious use of regexes should get around that. It's still a bad idea, though, and if anyone tries this and has poor results and for some reason feels compelled to tell me about it, expect only 'I told you so'

[1] I ORC'd and proofread the text for Project Gutenberg; that's what appears in the original text.

[2] Word reordering, case restoration, punctuation restoration, etc. are typically handled in an SMT system in a way that is functionally similar to the translation process, by scoring the phrases generated by these stages against a statistical model, which can lead to words being replaced, replacing a correct translation with an incorrect one that happens to have better punctuation, etc.

[3] The French 'Je viens de manger' ('I have just eaten') translated to 'Ja po prostu zjeść' ('I simply to eat'; 'po prostu zjedz!' is the equivalent of 'just eat it!' ) in Polish, because of the ambiguity of 'just' in English, which doesn't exist between French and Polish (that's today's translation; before it said 'mam tylko jeść' 'I have only to eat', mixing another ambiguity of 'just' in English, 'I have just five eggs').

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 1 Jan 2009 15:35:57 +0000

2009/1/1 Jimmy O'Regan <joregan@gmail.com>:

> In general, the usual method used in Apertium's translators is, if we
> don't know the word, we don't try to translate it -- we're honest
> about it, essentially. Apertium has an option to mark unknown words,
> which we generally recommend that people use. It doesn't cover
> 'hidden' unknown words, where the same word an be two different parts

can be... I can only imagine how poorly that would translate

> Anyway, in the first case, we generally write scripts to automate
> adding those words to the bidix. One plus of this is that it can be

adding those words from the bidix, to the monodix.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Thu, 1 Jan 2009 11:06:39 -0500

On Thu, Jan 01, 2009 at 03:35:57PM +0000, Jimmy O'Regan wrote:

> 2009/1/1 Jimmy O'Regan <joregan@gmail.com>:
> > In general, the usual method used in Apertium's translators is, if we
> > don't know the word, we don't try to translate it -- we're honest
> > about it, essentially. Apertium has an option to mark unknown words,
> > which we generally recommend that people use. It doesn't cover
> > 'hidden' unknown words, where the same word an be two different parts
> 
> can be... I can only imagine how poorly that would translate

That would be the major downfall of machine translation: the underlying assumption (which pretty much has to be that way) is that the input makes sense in the first place. Misspellings, of course, void that: the above is an instant - you might even say automatic and thus invisible - correction for a human, but an insoluble problem for a machine.

Until someone comes up with systems that can handle context, on a fairly broad scale, mechanical translation must perforce remain limited. And even then...

"Prostitutes appeal to pope"
"Queen Mary having bottom scraped"
"Milk drinkers are turning to powder"
"I saw the Alps flying to Romania"
"The horse raced past the barn fell"
"Time flies" "You can't; they move too fast"
"Cheney hunts quail; companions duck"
"Drunk gets nine months in violin case"

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 1 Jan 2009 19:46:17 +0000

2009/1/1 Ben Okopnik <ben@linuxgazette.net>:

> On Thu, Jan 01, 2009 at 03:35:57PM +0000, Jimmy O'Regan wrote:
>> 2009/1/1 Jimmy O'Regan <joregan@gmail.com>:
>> > In general, the usual method used in Apertium's translators is, if we
>> > don't know the word, we don't try to translate it -- we're honest
>> > about it, essentially. Apertium has an option to mark unknown words,
>> > which we generally recommend that people use. It doesn't cover
>> > 'hidden' unknown words, where the same word an be two different parts
>>
>> can be... I can only imagine how poorly that would translate 
>
> That would be the major downfall of machine translation: the underlying
> assumption (which pretty much has to be that way) is that the input
> makes sense in the first place. Misspellings, of course, void that: the
> above is an instant - you might even say automatic and thus invisible -
> correction for a human, but an insoluble problem for a machine.
>

Misspellings, orthographic variations in different regions (our Spanish-English translator still has a curious mix of American and British spellings), false derivations (we had an example of that here, recently , archaisms, the list goes on and on. Even the presence or absence of punctuation can be significant.

> Until someone comes up with systems that can handle context, on a fairly
> broad scale, mechanical translation must perforce remain limited.

Nice phrase that, 'mechanical translation': it equally covers machine translation and, say, the collected works of Jeremiah Curtin[1] and his ilk

Semantic based translation seemed more or less abandoned, but I see signs of it making a comeback: a paper I read recently more or less said that the reason attempts to plug systems like WordNet into machine translators hasn't yielded significantly better results is that it was not approached in the correct manner (all of the prior research in the area was wrong ; GramTrans make heavy use of semantic knowledge in their translators[2].

Apertium has a module called 'lextor' that uses statistically collected co-occurrences to perform lexical selection, but I don't like trusting to statistics anything that can be manually specified (our part of speech tagger is also statistically based, but it also accepts rules). I'm writing a new module that's strictly rule based -- because I'm primarily interested in trying to properly translate prepositions in relation to verbs, and lextor specifically ignores prepositions (they would really screw up the statistics -- but it also requires changes to the main rule engine, and possibly extending the stream format, which I'd prefer to avoid.

The most promising development in SMT is the Berkeley aligner, which is open source: http://code.google.com/p/berkeleyaligner/ Instead of blindly trying to align n-grams, it aligns elements of parse trees. (Google have done some work in trying to do something similar, but they've had some difficulty in retrofitting parse trees to the n-grams they already have).

> And
> even then...
>
> "Prostitutes appeal to pope"
> "Queen Mary having bottom scraped"
> "Milk drinkers are turning to powder"
> "I saw the Alps flying to Romania"
> "The horse raced past the barn fell"
> "Time flies" "You can't; they move too fast"
> "Cheney hunts quail; companions duck"
> "Drunk gets nine months in violin case"
>
> 
>

Those all remind me that there's one thing a human translator can do that a computer program never can: add a footnote

If you'll forgive my choice of example, 'te przeklęte Moskale' -- 'those cursed Muscovites'. That's an easy, word to word translation, but: in Polish, a distinction is made in the plural between human males and anything else: the correct form should be 'ci przeklęci Moskali': using the incorrect form is possibly intended to either show that the speaker has been poorly educated, or that he intends to intensify the insult by speaking of the Muscovites as 'non-men'. I've been assured that in the time the story was set[3], that would have been grammatically correct, but the rest of the text contradicts that.

[1] Douglas Hyde wrote of him: "Mr. Curtin tells us that he has taken his tales from the old Gaelic-speaking men; but he must have done so through the awkward medium of an interpreter, for his ignorance of the commonest Irish words is as startling as Lady Wilde's." Curtin is more famous for his bad translations of Polish stories, though.

[2] http://gramtrans.com/ Their data is proprietary, but their semantic engine, CG, is open source, and used in a few translation modules in Apertium - we don't use it to its full extent yet, but we have an experimental translator that does (between two dialects of the Sami language). The main developer of our Esperanto-English translator is friends with their main developer, who has been quite helpful.

[3] The Battle of Stoczek, the first major battle of the November Uprising of 1830.

Top Back