ietf-822
[Top] [All Lists]

Re: internationalization of mail

2004-08-27 04:47:56

On Aug 27 2004, Arnt Gulbrandsen wrote:

There are. I've worked on two, myself. The problem with them all is that 
many character sets are so similar. The values that are lower-case in 
one is often lowercase in another, the ones that are illegal in one is 
illegal in another.

 <snip>

I understand your point, and I agree with you. What I did have in mind
was a system which uses sequences of several bytes to try to
characterize languages. (eg the sequence "qu" is more common in French
than in English). I once saw such a library on the web, but I can't
remember what it was called, or I would have linked to it :-(.  From
memory, the idea was to scan a few lines of text trying to match those
sequences, with language dependent weightings. This would give the
"most probable" language, etc.

Possibly you know all this already, but your post seemed to me to
concentrate on single character issues.

But you are right overall, getting perfect results won't work in this
way, and in fact if the input text is smaller than a few lines, there
may not be enough data to run meaningful statistics at all.

-- 
Laird Breyer.