perl-unicode

Re: Perl and unicode file names

2005-02-24 07:45:48
I am working on XP. If I leave the active code page as default, when I
do dir, I get question marks for the file name. If I change the code
page to 862, for example, I get accented Latin characters.

However, no matter what I do in Perl, I get real question marks back. I
know that because I dumped the values with ord(). It is ascii 63.

Peter

On Thu, 2005-02-24 at 15:23 +0100, Guido Flohr wrote:
Hi,

sorry, my original reply (see below) went to the sender, not to the list.

Peter Gordon wrote:
I am using ActiveState Perl 5.008006.

I am trying on Hebrew filenames at the moment, but the program will need
to run on all languages.

The language does not matter, it is the charset.  Hebrew can be coded in 
Unicode/UTF-8 or iso-8859-8 or cp-whatever.  You really have to find out 
which charset your file system uses.

I tried "use bytes" and still get back question marks. 

What is "back" and what are the "question marks"? Do you see "back" (the 
output of your script) in your terminal window/DOS box or in an output 
file? And are there really question marks or are they not displayed 
correctly?

Does your script throw warnings? Do you "use warnings"?

That's all the information that I have.

The information about the charset used in your input data is required. A 
simple way to find that out goes like this:

#! /usr/bin/perl

use strict;
use warnings;
use bytes;

opendir DIR, "/path/to/dir" or die "opendir: $!";
my @files = readdir DIR;

open HANDLE, ">filelist.html" or die "open filelist.html: $!";
print HANDLE "<html><body><ul>\n";
foreach (@files) {
      print HANDLE "<li>$_</li>\n";
}
print HANDLE "</body></html>\n";
__END__

Provided that you have changed the path argument to opendir in line
7 this will create a "filelist.html" in the current directory.  Open 
that file in a browser and then change the encoding to some western 
european charset like iso-8859-1 or windows-1252.  In Mozilla this is 
View->Chacter Encoding->...

When you see question marks here, then they are real, i. e. something 
(readdir, the OS?) has converted the input to question marks.  Otherwise 
you should see accented western european characters instead of Hebrew.

Now change the encoding to utf-8/Unicode.  Question marks? Then it is 
_not_ Unicode.

Change it to some Hebrew character set.  You see Hebrew? Then you have 
an 8 bit Hebrew character set, probably IBM-862 or ISO-8859-8.

Both utf-8 and 8 bit character sets only show question marks or empty 
boxes? Then your font probably lacks the Hebrew glyphs.

You can make the test again with "use utf8" and compare the results.

What is your script supposed to do? If you just want to pass data from 
here to there, you have no problem.  But if you want to process it 
together with data from other languages, you have to make sure that all 
data is converted to Unicode internally.

Guido

My original reply below:

The problem is, that filenames, when using opendir, are returned as
question marks. In the DOS box I have set the codepage to 862. So DIR
returns accented characters, but Perl still returns question marks. I
have also set "use utf8", but that didn't help either.

Are the filenames really in UTF-8? If not, you would need "use bytes" 
instead of "use utf8".  If that dos not help, you should give more 
detailed information: Which Perl version? Which character sets are 
actually used in the filenames?


So the problem I have is how to proceed. Should I give up with Perl and
use Java or C? Any suggestions gratefully received.

Do you want to blackmail us? ;-)

Regards,
Guido


-- 
Peter Gordon
Phone: +972 544 438029
Email: peter(_at_)pg-consultants(_dot_)com
Web: www.pg-consultants.com


<Prev in Thread] Current Thread [Next in Thread>