Re: Auto-identifying languages

On Monday 6 October 97, at 15 h 24, the keyboard of 
wwgrol(_at_)sparc01(_dot_)fw(_dot_)hac(_dot_)com (W. Wesley Groleau x4923) 
wrote:

I would like to test a method to differentiate English, Spanish, German,
and French.  Anyone who is able to provide a good-sized "typical"
collection of messages in Spanish, French, or German in


There are a lot of complete dictionary in many languages in every Crack 
archive :-)

For French, if you want a sample of real texts, you can use the ABU 
library:

ftp://ftp.cnam.fr/pub/ABU

But, since they are only public-domain texts, they are more 
representative of the written French of one century ago than of the 
"spoken" French of the typical email message.

You can also check the archives of a daily newspaper like l'Humanité :

ftp://ftp.internatif.org/humanite/archives/

The method is based on letter frequencies.  It is very inefficient, but


Why not using an algorithm based on word frequencies?

<Prev in Thread]	Current Thread	[Next in Thread>
Auto-identifying languages - samples needed., W. Wesley Groleau x4923 Re: Auto-identifying languages - samples needed., Stephane Bortzmeyer <= Re: Auto-identifying languages - samples needed., Paul Castro

Re: Auto-identifying languages - samples needed.