Re: PS (Malformed UTF-8 character)

On Sun, 26 Oct 2003, Marco Baroni wrote:

So, I would like to ask you or anybody else: is there some kind of tool 
(e.g., a text editor) that I could use to discover which encoding is 
being used? (I tried with emacs but failed).


The only way I have successfully coped with massive amounts of data in
unknown encodings and/or languages was n-gram analysis. I used it to
determine encoding and languages for Vietnamese/English web pages a few
years ago.  To do it though, you need, first, good data sets in known
encodings in known languages. Then you do a 'closest match' statistically
on texts to determine the encoding and language.

Here are some URLs that you might find useful.

http://lists.w3.org/Archives/Public/www-international/2001JulSep/0188.html

http://www.basistech.com/products/rli.html

http://odur.let.rug.nl/~vannoord/TextCat/

http://www.dougb.com/ident.html

-- 
Benjamin Franz

Gauss's law is always true, but it is not always useful.
    -- David J. Griffiths, "Introduction to Electrodynamics"

Previous by Date:	Re: Bidirectional (bidi) Support?, Chris Whiting
Next by Date:	Re: possible patch for Perl 5.8.2's Alias.pm, Jarkko Hietaniemi
Previous by Thread:	Re: PS (Malformed UTF-8 character), David Graff
Next by Thread:	Re: Malformed UTF-8 character, John Delacour
Indexes:	[Date] [Thread] [Top] [All Lists]