perl-unicode

Re: In-Band Information Considered Harmful

1998-10-26 21:47:33
Chip Salzenberg writes:
1. Imagine there's a new "meta(VALUE, NAME)" operator that extracts from
   VALUE the first metadata with the given NAME.  (Additional optional
   parameters could specify position and length.  But we'll get back to
   that later.)

Did you ever get back to the 4 argument meta()?

@a = ($page =~ /(\m{a}.+)(?{ meta($1,'a')->{href} =~ /perl\.org/i })/);

This can be written without the neat (?{}) feature:

  @a = grep { meta($_, 'a')->{href} =~ /perl\.org/i }
            ($page =~ /(?:\m{a}.)+/g);

If =~ returned a stream then efficiency should be about the same. <grin>
Anyways, I think it's useful to simplify the example so that we're not
talking about so many things at once.

I didn't see the definition of \m{} either, so here's the one I assume:

  \m{ATTR} -- zero width match that requires the current position to
              have the metadata attribute ATTR defined

A zero width match is consistent with having utf8_with_width turned on
and certainly makes sense with out-of-band attributes.  (Width is not
length!)

This could be implemented using a four argument form of meta():

  meta(STR, ATTR, OFFSET, LEN)
  meta(STR, ATTR, OFFSET)
  meta(STR, ATTR)

    Find the first metadata object containing ATTR applying to STR
    between OFFSET and OFFSET+LEN, or if LEN is omitted between
    OFFSET and length(STR), or if OFFSET is omitted between 0 and
    length(STR).

    ATTR is a set expression or predicate function.

    The returned metadata object is shared with STR.  meta() may be
    assigned to.

This forms an iteration technique too:

  $offset = 0;
  while ($a = meta($str, 'a', $offset)) {
    $offset = $a->end;
  }

Overlapping metadata is a problem though.  Maybe meta() returns a metadata
stream?  If metadata streams automatically forward unhandled methods to the
head of the stream, most people could just write code to assume the stream
is a metadata object.  (It would even be correct in some cases -- like the
anchor searching example above.)

The implementation of meta() in the perl core could be really easy -- just
delegate to metadata magic.  This would let metadata implementations using
zero-width characters co-exist with out-of-band metadata.  The same scalar
could even be blessed with metadata several times.  (Although then ATTR
test operations get tricky -- even though a single metadata object doesn't
satisfy the set expression, a combination of metadata objects could.)

Scalars with metadata magic will (hopefully) have a specialized type that
implements basic (abstract?) operations.  The cost of handling metadata
shouldn't be paid by a regular scalar.

The question is what methods does core need to talk to metadata magic?
How about these for a start:

  meta -- support regex searches with metadata qualifiers
  after_change -- sync metadata with data changes

A system that uses inline markup is going to have to cache attribute state
during a search, otherwise repeated checking of meta() is going to *really*
hurt.  Even an out-of-band markup system will probably want some sort of
cache.  None of this has to be in the core though -- a dynamically loaded
C++ extension should be fine.

Maybe these too?

  before_change -- let metadata control access to data
  serialize -- let metadata be saved to disk or sent across network

BTW, I was wondering about the semantics of '\m{a}.+' in Chip's
example:

@a = ($page =~ /(\m{a}.+)(?{ meta($1,'a')->{href} =~ /perl\.org/i })/);

What happens with:

  <a ...>foo</a><a ...>bar</a>

Is it best to merge them, as in $1 = 'foobar', and have two metadata
objects on $1?  Or should a '\m{a}' match fail unless the matched
metadata object is the same as the previous match?

- Ken

-- 
Ken Fox, kfox(_at_)ford(_dot_)com, (313)59-44794
------------------------------------------------------------------------
Ford Motor Company, Powertrain           | "Is this some sort of trick
Analytical Powertrain Methods Department |  question or what?" -- Calvin
C3P Implementation Section               |