perl-unicode

Re: PS (Malformed UTF-8 character)

2003-10-29 17:30:06
On Sun, 26 Oct 2003, Marco Baroni wrote:

So, I would like to ask you or anybody else: is there some kind of tool 
(e.g., a text editor) that I could use to discover which encoding is 
being used? (I tried with emacs but failed).

The only way I have successfully coped with massive amounts of data in
unknown encodings and/or languages was n-gram analysis. I used it to
determine encoding and languages for Vietnamese/English web pages a few
years ago.  To do it though, you need, first, good data sets in known
encodings in known languages. Then you do a 'closest match' statistically
on texts to determine the encoding and language.

Here are some URLs that you might find useful.

http://lists.w3.org/Archives/Public/www-international/2001JulSep/0188.html

http://www.basistech.com/products/rli.html

http://odur.let.rug.nl/~vannoord/TextCat/

http://www.dougb.com/ident.html

-- 
Benjamin Franz

Gauss's law is always true, but it is not always useful.
    -- David J. Griffiths, "Introduction to Electrodynamics"


<Prev in Thread] Current Thread [Next in Thread>