xsl-list
[Top] [All Lists]

Re: [xsl] Need to remove unusual character in source

2006-09-26 18:52:07
Figured it out a little while ago.

Off Topic

I'm using ant
<replaceregexp match="[\cX]" replace="" flags="g">

Thanks to Abel and Michael for your input.

Mario

Quoting Abel Braaksma <abel(_dot_)online(_at_)xs4all(_dot_)nl>:

Mario Madunic wrote:
the character is and its a control character

0x18 CAN
  

Unfortunately, that says it all. Control characters are not allowed in 
UTF-8 and as a result, are not allowed in XML, when the encoding is 
UTF-8 (making XML not well-formed)

the error message I recieve is
SXXP0003: Error reported by XML parser: Illegal XML character:  &#x18;.
  

This is indeed illegal. The other day I accidentally used &#x08;, which 
is also illegal (I had it mistaken for a tab character, x09, which *is* 
legal) .

I've tried using ANT to clean it out but with no luck using native2ascii
or
escapeunicode
  

Won't help either. Escaping these characters will not help. But you are 
on the right track: use a filter to remove this character, or replace it 
with something useful. I use a filter to get Micrososft *.msg format, 
which has some useful lines, but the rest are control characters and 
other illegal data. Here's what it might look like when you'd resort to 
using Ruby (you can call it from Ant if you like), see www.ruby-lang.org.

(spoiler warning: this is off-topic and only marginally related to xslt)


# create working dir
if not FileTest::exist?('trimmed')
  Dir.mkdir('trimmed')
end

Dir.entries(".").each do |fn|
  if fn =~ /\.yourextension/
    # open file and set it to binmode
    file = File.new(fn)
    file.binmode
   
    # read complete file contents and scan it
    newfile = File.new("trimmed/#{fn}.txt", 'w')
    file.gets(nil).scan(/[^\x18]+/m) do |found|
      newfile.puts(found);
    end
  end
end


Just replace "yourextension" with the extension of your file and replace 
"trimmed" with an output dirname of your choice. Replace '.txt" with 
whatever extension you would like yourself. It runs through the current 
directory and copies all files to the "trimmed" directory, with one 
change: the x18 character is removed.

Of course, you can use Perl, a DOS Batch file (takes some practice), 
Bash, VBScript, PHP, Grep, Awk or any other tool you'd prefer.

HTH,

Cheers,
Abel Braaksma
http://abelleba.metacarpus.com



Can this be done or do I need to ask the client to remove it from their
data,
which might not be an option?

Any help or insight would be greatly appreciated.

Marijan Madunic

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



  


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--





--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--