perl-unicode

Website encoding

2004-11-17 17:30:07
Please forgive this going to both lists but I'm not sure where things
are going wrong...

I have many website around the world that I need to index. They're
straight HTML pages rather than perl-served and thus the headers say the
content-type is 'text/html' .. without mentioning the encoding.

The source of said pages has a meta-tag that sets the charset:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

The page then contains text in the language of its author.

I have several problems (or really one with multiple questions)

The task is to retrieve the title, meta description and meta keywords,
store them in a mysql (4.1) database and then later retrieve them and
put them all on one page.

My thought process is to convert them into utf8 and store that in the
database. Then it's just a case of retrieving them later and outputting
them all on one page marked as utf8.

That being the case, I grab the charset and use Encode's decode function
to turn it into 'perl's internal format' .. which in 5.8.5 is utf8
right? I then store that in the db.

However it's not working.

Does that mean that the encoding of the actual characters on the page is
not in the charset in the meta tag? Or am I missing some piece of the
puzzle?

A random example page would be 
http://www.reitsport-schill.de/index1053542873.html

This page is in German and *says* the charset it ISO-8859-1. However the
characters with the umlauts are displaying as unknown chars in a page
tagged as utf8.

(You can see the result at
http://www.santu.com/search.pl?q=Das+mit+Kompetenz+l%3Ade )

What am I doing wrong? Please help me someone!

Cheers!
Rick Measham




Attachment: signature.asc
Description: This is a digitally signed message part

<Prev in Thread] Current Thread [Next in Thread>