On Sun, 26 Oct 2003, Marco Baroni wrote:
So, I would like to ask you or anybody else: is there some kind of tool
(e.g., a text editor) that I could use to discover which encoding is
being used? (I tried with emacs but failed).
The only way I have successfully coped with massive amounts of data in
unknown encodings and/or languages was n-gram analysis. I used it to
determine encoding and languages for Vietnamese/English web pages a few
years ago. To do it though, you need, first, good data sets in known
encodings in known languages. Then you do a 'closest match' statistically
on texts to determine the encoding and language.
Here are some URLs that you might find useful.
http://lists.w3.org/Archives/Public/www-international/2001JulSep/0188.html
http://www.basistech.com/products/rli.html
http://odur.let.rug.nl/~vannoord/TextCat/
http://www.dougb.com/ident.html
--
Benjamin Franz
Gauss's law is always true, but it is not always useful.
-- David J. Griffiths, "Introduction to Electrodynamics"