Multi-language: working in progress

From: Michele Andreoli (m.andreoli@tin.it)
Date: Thu Apr 05 2001 - 19:13:12 CEST


I managed to put a rudimental form of fuzzy-logics in the "tell" command.
This is a fashinating question, i'm enjoyng with the subject.

Fuzzy, i.e. approximate pattern matching, is being introduced in muLinux
using the standard mulinux gate: i.e. the rustic gate :-)

You know the subject: "tell" consult a DB to find the traslation of
a sentence. The DB is simply a list of pairs (x,y) in a file, where
x="rustic english sentence" and y="translated sentence". Sentences
may spans over several lines:

Examples: segment from it.db

=======================================================================
^B

        You may want to give the PASSWORD and/or an USER_NAME
        required by the server. Otherwise, leave blank.

^A
        Potresti voler specificare la PASSWORD e/o l'utente
        (USER_NAME) richieste dal server. Altrimenti, lascia bianco.
=======================================================================

The problem is: If someone modify "y", no problem. "Tell" use the "x"
field to get a translation.
But if I modify "x" in the Setup scripts? "Tell" is unable to find
the right "y", because sentences aren't numbered.

So, I introduced approximate pattern matching in AWK: if changes
are not relevant for "x", the good "y" is always chosen.
I introduced also a "threshold": if score<thereshold, the "tell"
program simply outputs "x", not "y".
"Tell" is "changes-tolerant".

How it works? (still experimental)
--------------
When you write:

                        tell "my name is Bond"

the program scan the DB and compile a dictionary for every "x"
sentence. Every word is simply replaced with a single character A,B,C ...
The original sentence is rewritten using the new dictionary, as
a single string like ACDE, ABFH, etc.

Example: is the program found "is Bond from England?" , this second
is converted as "ABCD" and "my name is Bond" is converted as "CD",
because "my" and "name" are not in the dictionary.

A this point, I can use the built-in, fast functions in AWK that handles
regular expressions and wild-cards, such match(), etc.

                     match("ABCD","[CD]+")

mAWK return also the "length of the match", in this case 2, and
score is set to 50%.

Mah!

Michele

-- 
In summing up, I wish I had some kind of affirmative message to leave
you with, I don't. Would you take two negative messages? - Woody Allen
---------------------------------------------------------------------
To unsubscribe, e-mail: mulinux-unsubscribe@sunsite.dk
For additional commands, e-mail: mulinux-help@sunsite.dk


This archive was generated by hypermail 2.1.6 : Sat Feb 08 2003 - 15:27:18 CET