perl-unicode

Re: In-Band Information Considered Harmful

1998-10-22 20:48:21
Chip Salzenberg writes:

Consider: Why should it be that "<b>Hello</b> there!" no longer
matches the pattern /hello there/i ?  Wouldn't it be nice to keep the
metadata off to the side?  Then you have a much easier time of pattern
matching, and as a bonus you're no longer limited to one set of
metadata.

No, I get exactly the opposite conclusion: 

    a) the RE engine is broken wrt Unicode/whatever support.
    b) HTML is broken since the markup data is *not* inband, but
       "mixed" with the string data.

Suppose that utf8.pm knows about screen-width of chars (whatever this
means, for me width 0 and 1 is enough).  Suppose that <b> and <\b>
above denote 0-screen-width inband data (say, encoded as utf chars
above 1<<32 which are known to be 0-width, or Unicode inband
"Language" chars).  Then

  use re_ignore_zerowidth_char;

or

  use utf8_screen_width;

(or whatever) makes /hello there/i match "<b>Hello</b> there!".  

I already raised this question here on p5p in slightly different term,
without reference to screen-width.  Now I think it pares well with
screen width support.

At the Conference, I was pleased to speak at length with Ted Nelson on
many subjects, and made the point to me that one of the Xanadu
system's best features was its total separation of markup
(i.e. formatting and hyperlinks) from content.  It would have allowed
(e.g.) me to use one set of markup and (e.g.)  you to use another set,
all without duplicating or mangling the original content.  None of
this is at all feasible with HTML/XML, which would require either
duplicating content or commingling various independent sets of markup
data.

Can you provide more context/details?

Ilya