WG15 Defect Report Ref: 9945-2-40
Topic: I18N issues

This is an approved interpretation of 9945-2:1993.


Last update: 1997-05-20



	Topic:			I18N issues
	Relevant Sections:
	Classification:	 ambiguous

Defect Report:

(from Andrew Hume Doug McIlroy)

I18N Issues

Issue A

     POSIX has defined a mechanism for talking about  multi-
character  sequences  as  a single unit, namely	as collating
elements  (CEs).  Although  CEs	 are  motivated	 by  sorting
issues,	 they  appear  in  REs.	 This obviously	leads to the
question of how	to parse text into CEs?	There are many	pos-
sible	answers,  and  furthermore,  the  parsing  might  be
affected by context.  For example, given the usual  alphabet
augmented by the collating element  <ij> defined as  <i><j>,
can the	string	ij ever	be parsed as two collating elements?

 [1] says in the context of sorting,  ``strings	 are
     first  broken  up into a series of	collating elements''
     (line 1668). Does this apply to pattern matching?	 And
     if	so, how	exactly	is this	done (for sorting or pattern

Proposed Solution:

     Add the following text somewhere; this text  should  be
referred  to by	line 1668 and by the general RE	introduction

     ``When a string is	interpreted as	a  sequence  of
     CEs, the sequence shall be	as found by the	follow-
     ing process: starting at the  first  character  of
     the  string,  determine  the longest prefix of the
     string that matches a  CE,	 add  that  CE	to  the
     sequence  and continue this process with the char-
     acter  after  that	 prefix	 until	the  string  is

     Note that this applies even if  a	sort  key  indicates
that  a	 piece of the text is processed	in backwards (right-
to-left)  order;  that	is,  the  right-to-left	  processing
applies	to the CEs found by a left-to-right lexical scan.


     This is the greedy	algorithm normally done	 in  lexical
analysis.   Any	other choice would require backtracking	with
potentially exponential	runtime.  It  implies  that,  when
<i><j>	is a collating element,	under no circumstances can a
bracket	expression match the  i	alone in the string  ij.  In
particular,  neither   [[.i.][.ij.]]j nor  [[.i.]]j matches
ij.  By	contrast,  i[[.j.]] does match	ij, because in	this
regular	 expression  i denotes a character and is unaffected
by concerns about collating elements.

WG15 response for 9945-2:1993 
The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this.  This is being referred to the sponsor.
Rationale for Interpretation: