Re: \C, UTF-8, and regular expressions

On Thu, Aug 03, 2000 at 02:49:11AM -0400, Owen Taylor wrote:


The output of -Dr makes it pretty clear what is going on:

  Compiling REx `^\C\C(c)'
  size 10 first at 2
  rarest char c at 0
     1: BOL(2)
     2: SANY(3)
     3: SANY(4)
     4: OPEN1(6)
     6:   EXACT <c>(8)
     8: CLOSE1(10)
    10: END(0)
  anchored `c' at 2 (checking anchored) anchored(BOL) minlen 3 
             
  [...]

  Guessing start of match, REx `^\C\C(c)' against `Ã?cole'...
  String not equal...
  Match rejected by optimizer

For regexes compiled with 'use utf8' the anchor position
is in chars, not bytes, and the re optimizer (study_chunk)
things that \C counts as one char.

Fixing this looks decidedly unfun.


I now submitted a perlbug on this so that this bug (which
unfortunately still seems to be there) won't be forgotten.

Regards,
                                        Owen


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen