SC22/WG20 N946


From: Kenneth Whistler [] Sent: Friday, September 01, 2000 7:35 PM


 Subject: Some technical TR 10176


6. Identifiers


An issue that WG20 has had to deal with fairly recently is the list of recommended characters for identifiers, in Annex A of TR 10176, "Guidelines for the preparation of programming language standards". Because the list of recommended characters for identifiers is based on the repertoire of ISO 10646, this is another area where repeated maintenance into the future can be foreseen, as the repertoire of 10646 continues to expand.


Once again, because of the location of character expertise regarding all the characters added to 10646, the logical source for recommendations about how to extend the list in Annex A in the future is SC2. This is supported by the additional fact that determination of which characters are and are not appropriate in identifiers implicitly depends on specification of a constellation of properties for those characters -- again an area in which the expertise is located in SC2.


However, there is somewhat of a conundrum here, since the remainder of the content of TR 10176 is clearly in the domain of SC22, and the TR as a whole is inappropriate for maintenance in SC2. Perhaps some kind of understanding could be arranged between the SC's to guarantee that modifications to Annex A or TR 10176 should only be made with timely, coequal input from SC2.


A better solution, in the long run, would be to sever the contents of the exact table in Annex A, which has to track character repertoires and properties that are (or should be) the responsibility of SC2, from TR 10176 per se, and instead insert a reference there to a standard list maintained by SC2, either in the context of 10646 itself or in some associated TR to be developed by WG2 for this purpose. That would more appropriately divide the responsibilities for the part of TR 10176 associated with formal language syntax and design and the part which is attempting to track the universal character encoding repertoire as it expands over time.


Another reason for moving in this direction is the particular interest that the Unicode Technical Committee has in the identifier content problem. The Unicode Standard has detailed recommendations regarding identifiers, and the Unicode Technical Committee is currently working on even more detailed specifications regarding identifiers and identifier-like constructs for use in various contexts on the Worldwide Web and the Internet. It is in JTC1's interest to keep this particular technical issue active in a venue, namely SC2/WG2, where the character encoding expertise is available and the working relation with the UTC is strong. Even though on the surface it might seem that programming identifier syntax clearly belongs to SC22, the real issue is not the syntax per se (which is quite simple), nor the concept of an identifier and its relation to other programming language constructs (which the UTC and SC2 have little interest in and consider to be long ago fixed and decided by the SC22 standards). No, the *real* issue that remains open and problematical is how to classify and distribute all the thousands of additional characters in 10646, and how to deal with the complex ramifications of inclusions of various compatibility characters which may or may not change under various kinds of identifier normalization processes. That is where the UTC and WG2 expertise would be most helpful, and where joint development of Unicode and ISO standards would be most likely to minimize interoperability problems for identifiers in different programming languages and Internet and Web protocols.


This entire issue, is, by the way, also of intense interest to the Database standards arena, where it is of direct relevance to the SQL standard, for example. So the SC22 working groups are not the only JTC1 groups with an interest in standard, interoperable results in this area for 10646 characters.



7. Case Mapping and Case Folding


WG20 has not spent much time dealing with case mapping and case folding issues, although those clearly have an internationalization angle, because of local differences in case mapping preferences.


The one point where this has been dealt with by WG20 is in the LC_CTYPE specification in DTR 14652. This is because LC_CTYPE is the location of the information used by the tolower() and toupper() case mapping transforms for C (and by extension, other languages). As a result, PDTR 14652 includes tables of case pairs for all of the 10646 characters that have case pairs.


However, the inclusion of these case mappings explicitly in the "i18n" LC_CTYPE definition in DTR 14652 has been controversial in the committee, in part because of a small number of unexplained inconsistencies between those tables and the case mappings provided by the Unicode Consortium on its website. The Unicode case mappings are very widely implemented in many products, and are being treated by the industry as a de facto standard. So it is problematical for DTR 14652 to be proposing slightly different case mappings for a standards document that contradict widespread practice.


This is once again an area where the JTC1 standards arena would be better served by using references to de facto practice, rather than trying to reinvent the wheel with long lists in other standards or TR's, subject to the introduction of error or drift that can introduce interoperability problems. Perhaps here the SC22 language working groups could work with SC2/WG2 to find a way to get the de facto Unicode tables to be referenceable through an SC2 TR of some sort, to avoid the synchronization issues of trying to maintain two (huge) lists separately.


The area of case folding is related to case mapping, but is subtly different. WG20 has not dealt with this issue, but it is clear that SC22 language working groups need to deal with this. In particular, COBOL, Pascal, and other languages that have case-insensitive identifiers, need to be able to do reliable case-folding during their parsing/lexing phases of program text interpretation. For that, they need reliable definitions of case-folding as applied to 10646 characters for the domain of characters allowed inside identifiers for each language.


While WG20 has not touched on this issue and the SC22 working groups are starting to search for an answer, the Unicode Technical Committee and the IETF have moved ahead, creating de facto solutions that will see widespread implementation in the near future.


The Unicode Technical Committee has already published CaseFolding.txt, a machine-readable file with recommendations on exactly how to do case-folding for all Unicode 3.0 characters (i.e. 10646-1:2000 characters). The SC22 committees should be reviewing that file, and the associated case mapping information available in UnicodeData.txt and in SpecialCasing.txt -- also available on the Unicode website -- before concluding that new standardization efforts need to be initiated in SC22 (whether in WG20 or in other working groups), to repeat the work involved in creating those files, which are already freely available to all implementers.


The UTC and the IETF are currently working on the even thornier problem of determining how best to define identifiers in a context (such as internationalized domain names) where certain characters are disallowed (such as punctuation that has other reserved uses in URL syntax), where case folding is required, where normalization of data is also required (disallowing of equivalent sequences that might otherwise appear identical), and where even visual look-alikes of otherwise different characters are to be avoided if possible because of the confusion they can pose for user entry and the possibility of spoofing. This is an area where intimate knowledge of all the characters in 10646 and their interaction of properties and appearances is required. Yet again, it would behoove the SC22 working groups to participate in the joint UTC/IETF effort in this area through review and feedback, rather than trying to reinvent the wheel in a committee context where less relevant expertise would be available to start with.


End of document