ietf-822
[Top] [All Lists]

Re: internationalization of mail

2004-08-27 05:24:12

Laird Breyer writes:
I understand your point, and I agree with you. What I did have in mind was a system which uses sequences of several bytes to try to characterize languages. (eg the sequence "qu" is more common in French than in English). I once saw such a library on the web, but I can't remember what it was called, or I would have linked to it :-(.

http://odur.let.rug.nl/~vannoord/TextCat/ is probably what you're thinking of. At least I've never seen anything better.

From memory, the idea was to scan a few lines of text trying to match those sequences, with language dependent weightings. This would give the "most probable" language, etc.

Right. That page seems to link to various papers on the subject. Also, Asmus Freytag spoke about it at IUC14, as I recall. He used character pair frequencies and wrote code that was fairly reliable after about 18 characters. His code lost accuracy again after about 44 characters, for some reason. (I think I have a copy of the talk at home, can look at it.)

Possibly you know all this already, but your post seemed to me to concentrate on single character issues.

Yes, I did, because that's where I've run into problems. Identifying the language's not so hard, but pretty soon a name like Danuta Hübner will crop up and what do you do then. (She's a Polish politican, and ü is not a letter one often sees in Polish.)

Arnt


<Prev in Thread] Current Thread [Next in Thread>