On Aug 27 2004, Arnt Gulbrandsen wrote:
http://odur.let.rug.nl/~vannoord/TextCat/ is probably what you're
thinking of. At least I've never seen anything better.
This looks like it, although I had remembered it as a C library, not Perl.
But I must have been mistaken. The above web page also has a link to
competitors, so it's a good place to refer.
Asmus Freytag spoke about it at IUC14, as I recall. He used character
pair frequencies and wrote code that was fairly reliable after about 18
characters. His code lost accuracy again after about 44 characters, for
There's a common phenomenon with machine learning algorithms which
states that as you learn more and more data, first generalization
accuracy improves but if you keep learning forever, then
generalization accuracy worsens again.
(An analogy would be learning about cats: first you learn that they
have four legs and so you are able to distinguish cats from
birds. Success! Then you learn that cats have a long tail, and you can
distinguish them from pigs. Success! Then you learn that they have
fur, and when somebody shows you a hairless cat, you fail to recognize
it).
Yes, I did, because that's where I've run into problems. Identifying the
language's not so hard, but pretty soon a name like Danuta Hübner will
crop up and what do you do then. (She's a Polish politican, and ü is
not a letter one often sees in Polish.)
Good example.
--
Laird Breyer.