procmail
[Top] [All Lists]

Auto-identifying languages - samples needed.

1997-10-06 13:32:10

I would like to test a method to differentiate English, Spanish, German,
and French.  Anyone who is able to provide a good-sized "typical"
collection of messages in Spanish, French, or German in
compressed/gzipped/uuencoded format, please contact me.  (Don't send the
collection without contacting me or it will bounce.)

The method is based on letter frequencies.  It is very inefficient, but
basically the algorithm is

if    T > N then  English
elsif O > N then  Spanish
elsif I > A then  German
else              French

where T, N, I, O, A are the number of times each LOWER-CASE letter appears
in the text.  

Not only is it inefficient, but it will NOT work on a message which is
bilingual.  But I am nevertheless interested in testing it just to study
its reliability.

Also, if you happen to know this has already been tried, I'd like to hear
about it.

Although I have no intention on "snooping," I may glance at some of the
messages, and in Spanish or English, a glance is all it takes for me to
understand a hundred words or more.  Keep that in mind before sending
anything "private."

<Prev in Thread] Current Thread [Next in Thread>