ietf-822
[Top] [All Lists]

Re: internationalization of mail

2004-08-27 17:03:06

On Aug 27 2004, Arnt Gulbrandsen wrote:

http://odur.let.rug.nl/~vannoord/TextCat/ is probably what you're 
thinking of. At least I've never seen anything better.

This looks like it, although I had remembered it as a C library, not Perl.
But I must have been mistaken. The above web page also has a link to 
competitors, so it's a good place to refer. 

Asmus Freytag spoke about it at IUC14, as I recall. He used character 
pair frequencies and wrote code that was fairly reliable after about 18 
characters. His code lost accuracy again after about 44 characters, for 

There's a common phenomenon with machine learning algorithms which
states that as you learn more and more data, first generalization
accuracy improves but if you keep learning forever, then
generalization accuracy worsens again. 

(An analogy would be learning about cats: first you learn that they
have four legs and so you are able to distinguish cats from
birds. Success! Then you learn that cats have a long tail, and you can
distinguish them from pigs. Success! Then you learn that they have
fur, and when somebody shows you a hairless cat, you fail to recognize
it).

Yes, I did, because that's where I've run into problems. Identifying the 
language's not so hard, but pretty soon a name like Danuta Hübner will 
crop up and what do you do then. (She's a Polish politican, and ü is 
not a letter one often sees in Polish.)

Good example. 

-- 
Laird Breyer.


<Prev in Thread] Current Thread [Next in Thread>