ietf-openpgp
[Top] [All Lists]

Re: [openpgp] Possible ambiguity in description of regular expressions: [^][]

2021-01-07 18:29:59
On 2021-01-05 at 17:11 +0000, Andrew Gallagher wrote:
Is there anything to be said for referring out to an external regex 
definition instead of reinventing the wheel? :-)

The problem is that there is not a single regex specification. Although
it would be beneficial to be able to change the regex definition in
some minor ways.

On this same topic, I had started the following reply last weekend:

On 2020-12-23 at 22:58 +0100, Neal H. Walfield wrote:
Wait... that can also be parsed as match anything (2) followed by
match nothing (1)!


Perhaps I'm misreading the standard.  I'd appreciate confirmation or
any help clarifying my mistake.

Thanks,

:) Neal


I think it's kinda implicit in the 
To include a literal ']' in the sequence, make it the first character
(following a possible '^').

that a range cannot be "[]" or "[^]" since it is specified that in such
case the "]" will be a literal character.

The place-]-at-the-start (along make-'-'-first-or-last) is a well-known 
trick when using old regex flavors which don't support escapes inside
ranges. I would say the whole section makes sense. But there's room for
improvement.


A trickier case would be a regular expression such as:
Werner Koch (dist.*

This could be taken as a valid regular expression, with the "(" «a
single character with no other significance (matching that character)»,
or a syntax error, since parentheses are 'special' per «An atom is a
regular expression in parentheses». Exactly the same case applies to
"[foo". Although rfc 4880 makes no reference to invalid regular
expressions, I think that's how these should be categorised (another
example would be a regular expression beginning with a quantifier).*

And since the usage of regular expressions is for trust signatures
packets, 5.2.3.15 should probably state that a regular expression that
is invalid, or the implementation cannot support for whatever reason
e.g. implementations _will_ place a recursion limit), then trust MUST
NOT be extended.


There's a second definition of the Regular Expressions, which is
The regular expression uses the same syntax as the Henry Spencer's
"almost public domain" regular expression [REGEX] package.

with
   [REGEX]          Jeffrey Friedl, "Mastering Regular Expressions,"
                    O'Reilly, ISBN 0-596-00289-0.


However, someone which turned to that book will find that the latest
edition (14.5 years old 3rd edition, from August 2006), which is the
one readily available, does not describe Henry Spencer regex flavor. It
mentions it as historically relevant, and that Perl 2 used an enhanced
version of that, but it is not described by itself nor included in the
tables comparing different flavors (I guess more details about it might
have been removed in the rewrite that went into the second version).

It is possible to dig out the original code[1] and actually test how it
performs (spoiler: it does reject the above constructs), but one should
not need to rely on how that code works.


If I had to define the message now from new, I would probably define it
as being a POSIX Extended Regular Expressions (ERE)[2] (or a subset of
that). Those are relatively similar to the existing definition, are
well-known and well-defined, and such definition would allow to simply
use existing libraries conforming to that one (including regexec on
a POSIX  libc). An openpgp client shouldn't really need to care much
about creating a regular expression engine. It is a complex part for
the tiny usage it would get.
In fact, the easier way to implement it would probably be to barely
parse the 4880 regex to convert it into an ERE, and then use an
existing facility to execute that.

The main differences are:

Curly brackets { } are special for EREs (used for the range quantifier)
but not for 4880 regex, where they would be literals.

An empty regex alternation (a | at the beginning or end of an ERE, or
of a group inside brackets) is undefined on an ERE. A 4880 regex
supports it with the expected meaning. An equivalent regex using ? can
be used instead.

4880 regex doesn't support collating expressions inside equivalence
sets.

4880 regex allow escaping any character with a backslash. On an ERE you
can only escape special characters, an ordinary character preceded by a
backslash is undefined (and often used for extensions e.g. \w)


Regular expressions are a little-used feature, and the "natural" way to
write them would conform to both of those specifications. It is
unlikely that someone would have restricted a trust value based on the
presence of curly brackets on an User ID (they are legal in the local
part of email addresses, even unquoted, but it would be very rare to
find one). Equally, it would be strange to needlessly escape
characters.
So it _may_ be possible the change the definition without adversely
affecting existing usage. For full compatibility, changing the regex
would need to wait for V5 signatures or, preferably, use a new
subpacket type.


Happy New Year to all!

Ángel González



1- there is a nice copy preserved at  https://github.com/garyhouston/regexp.old,
see https://garyhouston.github.io/regex/
2- https://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html#tag_007_004



_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp