WG15 Defect Report Ref: 9945-2-43.1
Topic: Regular Expressions - imprecise specification

This is an approved interpretation of 9945-2:1993.


Last update: 1997-05-20



	Topic:			Regular Expressions - imprecise specification
	Relevant Sections:	2.8
	Classification:		Q12: Ambiguous.
				Q13: Defect.

Defect Report:

(from Andrew Hume Doug McIlroy)

Issue D

     These questions address areas  of	imprecision  in	 the
specifications	of REs.	With regard to question	[13], 9945-2
uses various  terms roughly synonymously  (leftmost,  first,
earliest)   (subpattern,  subexpression)  for  describing  a
left-to-right order across a line of text.  It would be	well
if the text (and rationale) used one term consistently.

 [12] Can a duplicated subexpression match the null  string?
     If	 so,  will  the	 duplication  be  repeated until the
     expression	does match the null string?

Proposed Solution:

     A subexpression that can match a null string shall	 not
be duplicated.


     Although adjacent duplication symbols are illegal	(for
no  apparent  reason,  particularly for	EREs),	\(x*\)*	is a
legal expression, in which the	*s are	not  adjacent,	that
raises	the  question:	how  many  times  is the null string

     Suppose that  \(x*\)* were	allowed.  Does	matching  it
to  xxx	yield  \1 =  xxx or  \1	= null?	 The latter alterna-
tive is	consistent with	the  rule  that	 a  null  string  is
longer	than no	match.	By extending  xxx with a null string
instead	of nothing, one	gets  a	 longer	 match.	  More	null
strings	make even longer matches.

     To	avert metaphysical questions about the last  element
of an infinite sequence, or an element following an infinite
sequence, one could forbid null	iterations.  But this has an
unsatisfying corollary that  \(x*\)* wouldn't match the	null
string.	 Compromise positions might forbid  null  iterations
after the first	iteration or after the first null iteration.
Such special pleading and the resultant	implementation	com-
plexity	is not worth the return.

     Lord and McIlroy  have  implemented  a  no-null-unless-
first  policy;	Spencer	 has  implemented repeat-until-null.
Spencer's interpretation seems perhaps less strained,  until
you realize that it makes references ( \1,  \2,	etc) to	con-
tents of duplications useless because they  will  always  be
null or	undefined.

     The proposed solution also	outlaws	EREs like  (a?|b)*,
(a?)\{2,2\},  and  .   ([a-z]*(|^))* The set of	regular	lan-
guages is unchanged, however.  There would  still  be  legal
equivalents to these and all other outlawed expressions.

     An	alternative, to	 leave	the  meaning  undefined,  is
unacceptable.	Users will be unaware of the exact bounds of
portability, and their biggest	intellectual  investments  -
the  hardest  expressions and sed scripts - are	liable to be

 [13] What is the meaning of the BRE  \(\)?  []

Proposed Solution:

     Forbid this expression.

     It	would be also acceptable, but  rather  less  so,  to
define that the	pattern	 \(\) matches the null string.


     The pattern has no	analog in EREs,	no utility,  and  to
our  knowledge	no  basis  in  history.	  It  should thus be
banned.	Otherwise, at least define what	it means.

WG15 response for 9945-2:1993 


        The standard does not require the successful
        matching of a null string after a successful match on a non null
        string for duplication symbols.  However, the
        standard does not clearly say that you are not allowed to match
        null strings after a successful match.  Therefore the standard
        in ambiguous.

        The definition of a back reference expression in section
        does not specify which match of a subexpression followed by a
        duplication symbol is to be returned.  We note however that the
        definition regcomp() defined in section B5.1 pg 728 344-346
        indicate that the back reference expression will match the last
        string matched by the sub expression.

        The standard is unclear on this issue, and no conformance
        distinction can be made between alternative implementations
        based on this.  This is being referred to the sponsor.    


        Page 89 line 3263 requires this form to be accepted.  But, the
        text 2.8.3 does not describe its meaning.

        Concerns about the wording of this part of the standard have
        been forwarded to the sponsor.

Rationale for Interpretation: