perl-unicode

Re: Perl and unicode file names

2005-02-24 07:23:15
Hi,

sorry, my original reply (see below) went to the sender, not to the list.

Peter Gordon wrote:
I am using ActiveState Perl 5.008006.

I am trying on Hebrew filenames at the moment, but the program will need
to run on all languages.

The language does not matter, it is the charset. Hebrew can be coded in Unicode/UTF-8 or iso-8859-8 or cp-whatever. You really have to find out which charset your file system uses.

I tried "use bytes" and still get back question marks.

What is "back" and what are the "question marks"? Do you see "back" (the output of your script) in your terminal window/DOS box or in an output file? And are there really question marks or are they not displayed correctly?

Does your script throw warnings? Do you "use warnings"?

That's all the information that I have.

The information about the charset used in your input data is required. A simple way to find that out goes like this:

#! /usr/bin/perl

use strict;
use warnings;
use bytes;

opendir DIR, "/path/to/dir" or die "opendir: $!";
my @files = readdir DIR;

open HANDLE, ">filelist.html" or die "open filelist.html: $!";
print HANDLE "<html><body><ul>\n";
foreach (@files) {
        print HANDLE "<li>$_</li>\n";
}
print HANDLE "</body></html>\n";
__END__

Provided that you have changed the path argument to opendir in line
7 this will create a "filelist.html" in the current directory. Open that file in a browser and then change the encoding to some western european charset like iso-8859-1 or windows-1252. In Mozilla this is View->Chacter Encoding->...

When you see question marks here, then they are real, i. e. something (readdir, the OS?) has converted the input to question marks. Otherwise you should see accented western european characters instead of Hebrew.

Now change the encoding to utf-8/Unicode. Question marks? Then it is _not_ Unicode.

Change it to some Hebrew character set. You see Hebrew? Then you have an 8 bit Hebrew character set, probably IBM-862 or ISO-8859-8.

Both utf-8 and 8 bit character sets only show question marks or empty boxes? Then your font probably lacks the Hebrew glyphs.

You can make the test again with "use utf8" and compare the results.

What is your script supposed to do? If you just want to pass data from here to there, you have no problem. But if you want to process it together with data from other languages, you have to make sure that all data is converted to Unicode internally.

Guido

My original reply below:

The problem is, that filenames, when using opendir, are returned as
question marks. In the DOS box I have set the codepage to 862. So DIR
returns accented characters, but Perl still returns question marks. I
have also set "use utf8", but that didn't help either.

Are the filenames really in UTF-8? If not, you would need "use bytes" instead of "use utf8". If that dos not help, you should give more detailed information: Which Perl version? Which character sets are actually used in the filenames?


So the problem I have is how to proceed. Should I give up with Perl and
use Java or C? Any suggestions gratefully received.

Do you want to blackmail us? ;-)

Regards,
Guido


--
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.net/


<Prev in Thread] Current Thread [Next in Thread>