Tux

...making Linux just a little more fun!

Followup: [Apertium-stuff] Par iaith newydd: apertium-cy-en / New language pair: apertium-cy-en

Jimmy O'Regan [joregan at gmail.com]


Fri, 1 Aug 2008 18:26:29 +0100

---------- Forwarded message ----------

From: Francis Tyers <ftyers@prompsit.com>
Date: 2008/8/1
Subject: [Apertium-stuff] Par iaith newydd: apertium-cy-en / New
language pair: apertium-cy-en
To: apertium-stuff@lists.sourceforge.net
Cc: Dafydd Jones <dafyddj@gmail.com>, "John D. Phillips"
<john@john.hmt.yamaguchi-u.ac.jp>

(Saesneg isod / English below)

Rydem newydd ryddhau par iaith newydd ar gyfer Cymraeg i Saesneg, apertium-cy-en. Y bwriadau penodol ar gyfer y fersiwn yma oedd:

* I alluogi i ddysgwyr canfod beth yw testun newyddion cyffredinol. * I alluogi canfod pwy ddywedodd be wrth bwy. * I alluogi gwahaniaethu a yw eitem benodol yn ddigon diddorol i gael ei chyfieithu'n iawn. * Dylai brawddegau o tua 5 o eiriau cael ei gyfieithu'n weddol dda o Gymraeg i Saesneg.

Mi rydem yn meddwl ein bod wedi rhagori'r bwriadau yma cryn lawer ac yr ydym yn eitha hapus efo'r canlyniadau. Mae Cymraeg i Saesneg yn par iaith gymhleth gan nad yw'r ieithoedd yn perthyn yn agos, felly tra nad yw'r canlyniadau ddim beth mae pobl yn ei ddisgwyl gan barau iaith Apertium, rydem yn meddwl ein bod yn curo'r gystadleuaeth ac wedi gwneud rhywbeth a fydd pobl yn weld yn ddefnyddiol.

Hwn yw'r par iaith gyntaf i ddibynnu ar ddefodaeth Cyfyngiad Gramadeg VISL ar gyfer rhannol-diamwys o destun a ddadansoddir yn forffolegol. Gellir cael y ffynhonnell ar gyfer hyn yma: http://beta.visl.sdu.dk/download/vislcg3/ mi rydwyf hefyd wedi paratoi pecyn Debian ar gyfer hwn yma:

http://xixona.dlsi.ua.es/~fran/debian/vislcg3/

Mae pecyn Debian ar gyfer y par iaith a'r fersiwn newydd o lttoolbox ac Apertium hefyd ar gael yma:

http://xixona.dlsi.ua.es/~fran/debian/apertium-cy-en/ http://xixona.dlsi.ua.es/~fran/debian/lttoolbox/ http://xixona.dlsi.ua.es/~fran/debian/apertium/

Mi fyddai'n cael rhain i Debian mor gynted a sydd bosibl ar ôl yr arhosiad.

Derbynnir unrhyw ymatebion, profi, cwestiynnau, a sylwadau. Gwnawn ddatganiad i'r wasg hwyrach ymlaen ond ar y foment dyma ychydig o ystadegau isod:

Fran

==Ystadegau==

;Ymdruniaeth:

Wicipedia Cymraeg[1] (615,238 o eiriau): 84.8% PNAW[2] (11,338,509 o eiriau): 95.7% Newyddion BBC[3] (127,948): 91.2%

;Geiriau:

Dadansoddydd Cymraeg: 10,497 lemata Geiriadur dwyieithog: 11,083 gohebyddion

;Rheolau:

Cam 1 (chunk): 72 Cam 2 (inter-chunk): 31 Cam 3 (post-chunk): 9

;Nodiadau

1. http://cy.wikipedia.org/ 2. http://xixona.dlsi.ua.es/corpora/UAGT-PNAW/ 3. http://news.bbc.co.uk/welsh/

**********************************************************************

We've just released a new language pair, for Welsh to English, apertium-cy-en. The stated release goals for this version were:

* For a non-native speaker to be able to discern the topic of a general news item. * To be able to identify who said what to who. * To be able to distinguish is a particular item is interesting enough to be translated properly. * Sentences of up to 5 words should be translated reasonably well from Welsh to English.

We think we've surpassed these goals quite considerably and are quite happy with the results. Welsh to English is a difficult language pair as the languages are not closely related, so while the results might not be what people are used to with Apertium language pairs, we think we beat the competition and have made something that people will find useful.

This is the first language pair to depend on the VISL Constraint Grammar formalism for partial-disambiguation of morphologically analysed text. The source for this can be found here: http://beta.visl.sdu.dk/download/vislcg3/ and I've also prepared a Debian package which you can find here:

http://xixona.dlsi.ua.es/~fran/debian/vislcg3/

A Debian package of the language pair and the new versions of lttoolbox and apertium are also available here:

http://xixona.dlsi.ua.es/~fran/debian/apertium-cy-en/ http://xixona.dlsi.ua.es/~fran/debian/lttoolbox/ http://xixona.dlsi.ua.es/~fran/debian/apertium/

I'll get these into Debian as soon as is practical after the freeze.

Testing, questions, and comments would be well received. We'll do a full press release later, but for the moment, there are some statistics below:

Fran

==Statistics==

;Coverage:

Welsh Wikipedia[1] (615,238 words): 84.8% PNAW[2] (11,338,509 words): 95.7% BBC Newyddion[3] (127,948): 91.2%

;Lexis:

Welsh analyser: 10,497 lemmata Bilingual dictionary: 11,083 correspondences

;Rules:

Stage 1 (chunk): 72 Stage 2 (inter-chunk): 31 Stage 3 (post-chunk): 9

;Notes

1. http://cy.wikipedia.org/ 2. http://xixona.dlsi.ua.es/corpora/UAGT-PNAW/ 3. http://news.bbc.co.uk/welsh/

------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _____________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Fri, 1 Aug 2008 19:12:14 +0100

2008/8/1 Francis Tyers <ftyers@prompsit.com>:

> We've just released a new language pair, for Welsh to English,
> apertium-cy-en. The stated release goals for this version were:
>
> * For a non-native speaker to be able to discern the topic of a general
>  news item.
> * To be able to identify who said what to who.
> * To be able to distinguish is a particular item is interesting enough
>  to be translated properly.
> * Sentences of up to 5 words should be translated reasonably well from
>  Welsh to English.
>

From the included acknowledgements:

Many thanks to Kevin Donnelly for helping to integrate his work with Eurfa and Klebran into the project, see http://www.eurfa.org.uk for more information on these tools. And for designing the bulk of the system of transfer rules.

The data for English come from the Apertium English to Catalan language pair, which was funded by the Generalitat de Catalunya (Government of Catalonia).

Many thanks to Mark Nodine to allowing his extensive Welsh--English bilingual dictionary be used in the project. The original can be found here: http://www.cs.cf.ac.uk/fun/welsh/LexiconWE.html

Thanks also to: Prompsit Language Engineering for technical assistance, Tino Didriksen for help with VISL CG, and Dafydd Francis, Liam Tomkins, Telsa Gwynne, Thomas Thurman and others for answering so many questions.

To which I would add: Thanks to Francis Tyers, who did the bulk of the implementation of this language pair, as well as the integration of VISL CG3, necessary for pretagging in this package, and his extension to lttoolbox, to allow abbreviated, postfixed words to be processed. (More below).

> We think we've surpassed these goals quite considerably and are quite
> happy with the results. Welsh to English is a difficult language pair as
> the languages are not closely related, so while the results might not be
> what people are used to with Apertium language pairs, we think we beat
> the competition and have made something that people will find useful.
>

In addition, this is the first release of linguistic data for Apertium that does not include a Romance language; it is also (I think) the first 'community' developed package: i.e., developed in a 'bazaar'-like fashion using volunteer contributions - hopefully, the first of many!

> This is the first language pair to depend on the VISL Constraint Grammar
> formalism for partial-disambiguation of morphologically analysed text.
> The source for this can be found here:
> http://beta.visl.sdu.dk/download/vislcg3/ and I've also prepared a
> Debian package which you can find here:
>
> http://xixona.dlsi.ua.es/~fran/debian/vislcg3/
>
> A Debian package of the language pair and the new versions of lttoolbox
> and apertium are also available here:
>
> http://xixona.dlsi.ua.es/~fran/debian/apertium-cy-en/
> http://xixona.dlsi.ua.es/~fran/debian/lttoolbox/
> http://xixona.dlsi.ua.es/~fran/debian/apertium/
>
> I'll get these into Debian as soon as is practical after the freeze.

The new version of lttoolbox introduces support for 'preblank' sections. Similarly to the existing 'postblank' section type, which allows us to process, for example, the French "j'ai" as two separate words with a space inserted between them, preblank does the same for the Welsh "i'r", (where "'r" is a contracted form of "yr"), by prefixing a space.

The cy-en pair can be made to work with older versions of lttoolbox, with some degradation in translation quality, but this is not supported.


Top    Back


Jimmy O'Regan [joregan at gmail.com]


Fri, 1 Aug 2008 19:22:09 +0100

2008/8/1 Jimmy O'Regan <joregan@gmail.com>:

> 2008/8/1 Francis Tyers <ftyers@prompsit.com>:
> The data for English come from the Apertium English to Catalan language pair,
> which was funded by the Generalitat de Catalunya (Government of Catalonia).
>

That really should have read:

The data for English come from the Apertium English to Catalan language pair, which was developed by the Transducens group at Universitat d'Alacant, and funded by the Generalitat de Catalunya (Government of Catalonia).

It's a little too late to fix that now; I hope the en-ca developers will accept our apologies and our assurance that this will be fixed in the next release.


Top    Back