procmail
[Top] [All Lists]

Re: Auto-identifying languages - samples needed.

1997-10-07 01:40:00
On Monday 6 October 97, at 15 h 24, the keyboard of 
wwgrol(_at_)sparc01(_dot_)fw(_dot_)hac(_dot_)com (W. Wesley Groleau x4923) 
wrote:

I would like to test a method to differentiate English, Spanish, German,
and French.  Anyone who is able to provide a good-sized "typical"
collection of messages in Spanish, French, or German in

There are a lot of complete dictionary in many languages in every Crack 
archive :-)

For French, if you want a sample of real texts, you can use the ABU 
library:

ftp://ftp.cnam.fr/pub/ABU

But, since they are only public-domain texts, they are more 
representative of the written French of one century ago than of the 
"spoken" French of the typical email message.

You can also check the archives of a daily newspaper like l'Humanité :

ftp://ftp.internatif.org/humanite/archives/

The method is based on letter frequencies.  It is very inefficient, but

Why not using an algorithm based on word frequencies?



<Prev in Thread] Current Thread [Next in Thread>