From keld Tue May 19 14:19:02 1998 Received: (from keld@localhost) by dkuug.dk (8.6.12/8.6.12) id OAA00548 for iso14766; Tue, 19 May 1998 14:19:02 +0200 Message-Id: <199805191219.OAA00548@dkuug.dk> From: keld@dkuug.dk (Keld J|rn Simonsen) Date: Tue, 19 May 1998 14:18:59 +0200 X-Charset: ISO-8859-1 X-Char-Esc: 29 Mime-Version: 1.0 Content-Type: Text/Plain; Charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Mnemonic-Intro: 29 X-Mailer: Mail User's Shell (7.2.2 4/12/91) To: iso14766 Subject: 14766 WD 3 I have now completed my editing round on the comments from the Dallas meeting, so here is the resulting draft (WD3). Please submit comments to me or the list before 1998-06-26. The draft is also available in Word perfect 5.1 format (which can be read by microsoft Word and other packages) in the file http://www.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp The added information from this file is minimal. You should be able to comment just from the plain ascii file. However, the WP file follows ISO formatting rules quite closely. One outstanding item is that I should include text from TR 10000-2 clause 6.3 . I studied this and found that there was no text, it was only a description of some OSI profiles. So maybe another document was meant, such as TR 10000-1 ? Regards keld --- Reference number of working document: ISO/IEC JTC1/SC22/WG15 N_____ Date: 1998-05-18 Reference number of document: ISO/IEC WD3 14766 Committee identification: ISO/IEC JTC1/SC22 Secretariat: ANSI Information technology þ Guidelines for POSIX National Profiles and National Locales Technologies de l'information þ Guide de profiles nationales et locales nationales de POSIX FOREWORD ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. The main task of a technical committee is to prepare International Standards but in exceptional circumstances, the publication of a Technical Report of one of the following types may be proposed: - type 1, when the required support cannot be obtained for the publication of an International Standard, despite repeated efforts; - type 2, when the subject is still under technical development or where for any other reason there is the future but not immediate possibility of an agreement on an International Standard; - type 3, when a technical committee has collected data of a different kind from that which is normally published as an International Standard ("state of the art", for example). Technical Reports of types 1 and 2 are subject to review within three years of publication, to decide whether they can be transformed into International Standards. Technical Report of type 3 do not necessarily have to be reviewed until the date they provide are considered to be no longer valid or useful. Technical Report ISO/IEC 14766 was prepared by Joint Technical Committee ISO/IEC JTC 1, "Information Technology", subcommittee 22, "Programming languages, their environments and system software interfaces". This Technical Report was developed in cooperation with the Institute of Electrical and Electronics Engineers, Inc (IEEE). Suggestions and comments for improvement of this document are welcome. They should be sent to: Keld Simonsen Sankt J›rgens Alle 8 DK-1615 Copenhagen V Denmark Email: keld@dkuug.dk CONTENTS Page 1 Scope 1 2 Conformance and testing 1 3 References 1 4 Definitions and abbreviations 3 4.1 Definitions 3 4.2 Abbreviations 4 5 Purpose of National Profiles and National Locales 4 5.1 Purpose of National Profiles 4 5.2 Purpose of National Locales 5 6 Concept of National Profiles 5 6.1 The relationship to base standards 6 6.2 The relationship to Registration Authority 6 6.3 Principles of National Profile Content 7 6.3.1 General Principles 7 6.3.2 Principles of National Profile Content 7 6.3.3 Main elements of a National Profile Definition 8 6.4 The meaning of conformance to a National Profile 8 6.5 Conformance requirements of POSIX National Profiles 8 6.6 Implementation Conformance 9 6.6.1 General 9 6.6.2 Requirements 10 6.7 POSIX Application Conformance for National Profiles 10 6.7.1 Conforming POSIX Application 10 6.7.2 Conforming POSIX Application Using Extensions 10 7 Contents of National Profile 10 8 Concept of National Locale 14 9 Contents of National Locale 14 9.1 Contents of character classification and transformation 14 9.2 Contents of numeric format 15 9.3 Contents of monetary format 15 9.4 Contents of collating sequence 15 9.5 Contents of collating sequence 15 9.6 Contents of messages 16 10 Using locale templates 17 10.1 internationalization data collections 17 10.2 reorder-after technique 17 11 Concept of Charmap 17 12 Contents of Charmap 18 Annex A. POSIX locale extract 19 Annex B. Symbolic character names 19 Annex C. Convenient tools for producing National Locale 19 Annex D. Use of ISO/IEC 10646 in POSIX standards 20 Annex E. Registry data 26 Annex F. Examples of National Profile - Japan 27 Annex G. Examples of National Locale - Denmark 27 Bibliography 27 Index 27 Information technology þ Guidelines for POSIX National Profiles and National Locales 1 Scope This Technical Report provides a guideline for ISO Member Bodies in the process of making National Profiles and National Locales for the ISO/IEC 9945 POSIX series of standards. - National Profiles provides requirements for making POSIX suitable in the culture, by specifying options needed of the POSIX standards and national standards to be applied. Implementers can then comply to the POSIX National Profile to make their product suited for the market, and ISO member bodies can facilitate procurement by making National Profiles which are national standards. Users can obtain products which are suited for their needs and with consistent behaviour across applications and platforms. A National Profile may include National Locale specifications. - National Locales specify options to POSIX standards in POSIX locale format, on data that varies culturally. Applications can be written in a internationally portable way by removing hard-coded culturally dependent data or functions, and using the POSIX National Locale data instead. Implementers can, using the National Locales, be relieved from specifying the often very complex internationalization data them self and instead rely on a credible source as the ISO Member bodies. Users can benefit from products that are suited for their cultural needs and obtain consistent behaviour across applications and platforms. ISO member bodies can facilitate this process and provide procurement specifications via national standards on National Locales. Note: Hereafter through this document, for simplicity of wording, the word National Profile is used as synonym of the word POSIX National Profile, unless otherwise stated. 2 Conformance and testing As this specification is a Technical Report, there cannot be any conformance claimed to this TR. Editors note: Take something from .23, David Blackwood will provide this. For testing a POSIX National Profile with its National Locale it is often a good idea to provide test data for some funcionality, expecially the collating specification. This could be done by providing an unsorted file and a correctly sorted file. It will probably be unmanageable to provide a test suite for all of the standards referenced by a National Profile. 3 References The following normative documents contain provisions which, through reference in this text, constitute provisions of this Technical Report. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on this Technical Report are encouraged to investigate the possibility of applying the most recent editions of the normative documents indicated below. For undated references, the latest edition of the normative document referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards. ISO/IEC 9945-1:1996, "Information technology - Portable Operating System Interface (POSIX) - Part 1: System Application Program Interface (API) [C Language]". ISO/IEC 9945-2:1993 "Information technology - Portable Operating System Interface (POSIX) - Part 2: Shell and Utilities" ISO/IEC 646:1991, "Information technology - ISO 7-bit coded character set for information interchange." ISO/IEC 2022:1994, "Information technology - Character code structure and extension techniques". ISO 4217:1995, "Codes for the representation of currencies and funds". ISO 8601:1988, "Data elements and interchange formats - Information interchange - Representation of dates and times". ISO/IEC 10646:1997, "Information technology - Universal Multiple- Octet Coded Character Set (UCS), including Cor.1 and AMD 1-9". ISO/IEC FCD 14651, "Information technology - International string ordering - Method for comparing character strings and description of a default tailorable ordering". ISO/IEC 8859, "Information technology - 8-bit single-byte coded graphic character sets - Part 1, .., 10, 13, 14". "ISO/IEC Directives:1997, Procedures for the technical work of ISO/IEC JTC 1 on Information Technology." "ISO/IEC Directives Part 2, Methodology for the development of International Standards." "ISO/IEC Directives Part 3, Drafting and presentation of International Standards." ISO/IEC 9899:1993, "Information Technology - Programming language - C", (including AM 1:1993, Multibyte Support Extensions.) ISO/IEC TR 14262:1995, "Information Technology - Guide to the POSIX Open Systems Environment". IEEE P1003.18/D13 (September 1996), Information Technology - POSIX Profile. (editors note: there is apparantly no ISO/IEC document for this project). ISO/IEC TR 10000-1:1995, "Information technology - Framework and taxonomy of International Standardized Profiles - Part 1: Framework." ISO/IEC TR 10000-2:1995, "Information technology - Framework and taxonomy of International Standardized Profiles - Part 2: Principles and Taxonomy for OSI Profiles." ISO/IEC FCD 14652, "Information technology - Specifications for cultural conventions." ISO/IEC DIS 15897/ENV 12005:1996, "Information technology - Procedures for European registration of cultural elements." ISO/IEC TR 11017:1998, "Information Technology - Framework for Internationalization" 4. Definitions and abbreviations For the purpose of this technical report the following definitions and abbreviations apply. 4.1 POSIX Profile: A profile for an International Standard is a set of specifications of the parameters, the selections of the optional items and the recommendations of the implementation related matters. A POSIX Profile corresponds to the Profile concept for the POSIX International Standard. 4.2 POSIX National Profile: A National Profile is a subset of a POSIX Profile which is strongly related to the culture dependent aspects of the POSIX. It also contains the definitions and recommendations for the usage of national/regional standards which support the handling of the nation and/or area specific aspects (e.g. the use of the coded character sets and so on). 4.3 POSIX National Locale: A National Locale is a subset of a National Profile, which gives profile options in the POSIX localedef format. 4.4 Conformance to a POSIX National Profile: The concept of the degree of the preciseness of the coincidence between the specifications of a realized POSIX system and the POSIX National Profile. Since the POSIX National Profile is not necessarily included in the POSIX Profile, systems which conforms to the POSIX National Profile may not pass the POSIX Conformance requirements. 4.5 National Standards Profile: A National Standards Profile (NSP) is a profile of an international standard or set of international standards possibly together with other specifications, that is adopted by a ISO member body as a national standard. 4.6 Internationalization (I18N): A process of producing an application platform or application which is easily capable of being localized for (almost) any cultural environment. (Note, therefore, that an internationalized information system does not have a dependency on any specific culture, unless it is localized to that selected culture.) (11017) 4.7 Localization (L10N): A process of adapting an internationalized application platform or application to a specific cultural environment. In localization, the same semantics are preserved, while the syntax may be changed. (11017) 4.8 Portability: the ability that an appliaction can perform unchanged with same results on different application platforms (KS). 4.9 Locale: The definiton of the subset of the environment of a user that depends on language and culture conventions (9945-2). 4.10 Charmap: a character set description file, for use with a locale. (9945-2, KS) 4.11 International Standardized Profile: A profile that has been ratified by ISO and IEC. (KS) 4.12 ISP: an abbreviation for International Standardized Profile 4.13 NSP: an abbreviation for National Standards Profile 4.14 I18N: an abbreviation for internationalization 4.15 L10N: an abbreviation for localization 5. Purpose of National Profile and National Locale 5.1 Purpose of National Profiles National Profiles for POSIX based international standards define culture- and language-dependent adaptation and interpretation of POSIX for the following purposes: - National Profile identifies the base international and national/regional standards and clarify the relationships among them. - National Profile identifies the base standards, together with appropriate culture- and language-specific classes, subsets, options and parameters, which are necessary to assure higher degree of portability. - National Profile gives detailed description of locale- dependent functions that are out of the scope of the Base International Standard which provides frameworks for internationalization so that national bodies can define appropriate language and culture dependent adaptation and interpretation based on it, - National Profile provides reference systems on top of which culture- and language-dependent applications can be built to promote POSIX based standards among users and vendors, - National Profile promotes the development of conformance tests that produce consistent results for the systems compliant with POSIX and a given national profile. Various bodies throughout the world are undertaking work in the definition of National Profiles for POSIX based international standards. This Guideline for POSIX National Profile Writers has been developed by SC22/WG15 to make the National Profiles consistent and the harmonization of the National Profiles easier by defining the followings; - Define style, documentation scope and classification scheme for National Profiles. - Define those items that should be written in National Profiles - Define those items that should not be written in National Profiles 5.2 The purpose of the National locale The purpose of the national locale is to specify for a given culture, given by the country and the language and specified by a ISO member body, a POSIX locale that is directed towards this, so that users can refer to this locale and obtain consistent behaviour across the hardware and software platforms conforming to this locale. It is expected that many national standardisation organisations will make national standards on their locales, which then can be used also for procurement. The national locale will in most cases build on already existing national standards, for example on formatting and collation, but will sometimes reflect customary specifications, for example for date and time there often does not exist an adequate national standard. 6. Concept of National Profiles POSIX is a platform of Open System Environment (OSE), and an APE (Application Environment Profile) is a set of parameters and the selection of options for the base standards included in OSE to support the execution of application programs for a given application field. It includes the parameters and option selections for the relevant base standards such as the platform standards like POSIX and application specific standards like GKS, SQL and so on. A National Profile for a specific cultural region or a nation is a set of parameters and option selections for several base standards like POSIX. These standards may be National Standards like JIS X0208, and they may be extensions of international standards. A National Profile cannot avoid such non-international standards because it should specify the local cultural aspects. National Profiles cannot be considered International Standardized Profiles (ISPs) in the sense of ISO/IEC TR 10000-2, as they are not international (but national) in nature. Thus International Standardized Profiles cannot reference POSIX National Profiles, while the referencing from National Profiles to ISPs is possible. Application Environment Profiles and National Profiles may be based on National Standards, and therefore it is necessary to coordinate these by defining the parameters and option selections from the view point of international harmonization to support international application portability and interoperability. Granting this fact, there are several levels of conformance both for a given POSIX application environment profile and a given POSIX National Profile as follows: For Application Environment Profile: 1 Strictly Conforming POSIX Application for POSIX AEP An application that can be executed for any parameters and options for POSIX 2 ISO/IEC Conforming POSIX Application for POSIX AEP An application that requires only specific POSIX related parameters and options. 3 ISO/IEC Conforming POSIX Application using Extensions for POSIX AEP An application that requires not only specific POSIX related parameters and options but also other ISO/IEC standards and their international profiles. For POSIX National Profile: 1 National Body Conforming POSIX Application for POSIX NP An application that requires only the POSIX related parameters and options defined in POSIX National Profile. 2 National Body Conforming POSIX Application using Extensions for POSIX NP An application that requires POSIX related parameters and options defined in POSIX National Profile, national profiles for other ISO/IEC standards, and national body standards. 6.1 The relationship to base standards Base standards specify procedures and formats that facilitate the development of internationally portable applications across many countries/regions. They may provide mechanisms for supporting language/cultural dependent (locale specific) aspects, hopefully in a locale-independent way as much as possible. National profiles promote applicability of the base standards to specific countries/regions by defining how to use mechanisms specified in the base standards for a specific country/region with appropriate choice/value-setting of options/parameters. National profiles may also specify additional standards which are required for locale specific features support. National profiles shall not contradict base standards but shall make specific choices where options and ranges of values are available. The choice of the base standard options should be restricted so as to maximize the application portability across National profiles, consistent with achieving the objectives of the National profiles. 6.2 The relationship to Registration Authority Some objects specified in National Profile may be administered and registered to keep identification and to avoid conflict of values or names adopted by each of the countries. The administration and registration of such objects may be performed by Registration Authorities, authorized by ISO/IEC JTC1, with the procedure recognized and agreed internationally. The ISO/IEC DIS 15897 registration standard provides registration mechanisms for POSIX profiles, POSIX locales and POSIX charmaps and lists of symbolic character names, "repertoiremaps". Note: contents of the ISO/IEC DIS 15897/ENV 12005 registration are available at http://www.dkuug.dk/cultreg/ The following locale objects specified in a National Profile should be registered and maintained by Registration Authorities. (a) locale definitions and their names (b) symbolic character names (c) coded character set and their names (d) character class names 6.3 Principles of National Profile Content 6.3.1 General Principles General Principles for a Profile specified in ISO/IEC TR 10000-1, subclause 6.3 are applied to a POSIX National Profile. 6.3.2 Principles of National Profile Content A National Profile places a set of requirements which are useful in maximizing application's portability for a specific country/region. It does not specify all of the functionalities of a system, but only that part relevant to the function being used for locale-specific operation. The content of a National Profile shall be specified in a coded character set independent way where it is possible. When some requirements are recognized to be locale-specific but no clear indication can be made by a National Profile, it may include an informative guidance to implementors. 6.3.3 Main elements of a National Profile Definition The definition of a National Profile shall comprise the following elements: (a) a definition of the scope of the countries/regions for which the National Profile is defined, and of its purpose; (b) normative reference to base standards, including precise identification of the actual texts of the base standards being used and of any approved amendments and technical corrigenda (errata), conformance to which is identified as potentially having an impact on achieving portability using the National Profile; (c) normative and informative reference to any other relevant source documents, including National Body standard; (d) specification of the application or the function of each referenced base standard, covering recommendations on the choice of classes or subsets, and on the selection of options, ranges of parameter values, etc.; (e) specification of the locale information of each referenced base standard; (f) a statement defining the requirements to be observed by systems claiming conformance to the National Profile. 6.4 The meaning of conformance to a National Profile The concepts of Implementation Conformance and Application Conformance are incorporated in the concept of National Profiles. These conformances which are defined in a National Profile are applied to only an application platform, for interoperability and for portability of applications and data. A real system is said to exhibit conformance if it compiles with the requirements of applicable POSIX standards. A National Profile shall address the following two topics: (a) Implementation Conformance requirements (details as given in 6.6); (b) Application Conformance requirements (details as given in 6.7); These requirements are stated in a POSIX National Profile. In order to conform to a National Profile, a system shall perform correctly all the capabilities defined in the POSIX as mandatory and also any options of the POSIX which it claims to include. Conformance to a base standard in this context is conformance to a particular identified publication of a referenced base standard. A National Profile shall be defined in such a way that testing of its implementation can be carried out in the most complete way possible being given the available testing methodologies. 6.5 Conformance requirements of POSIX National Profiles The concepts of Implementation Conformance and Application Conformance are incorporated in the concept of National Profiles. These conformances which are defined in a National Profile are applied to only an application platform, for interoperability and for portability of applications and data. A real system is said to exhibit conformance if it compiles with the requirements of applicable POSIX standards. A POSIX National Profile shall address the following two topics: (a) Implementation Conformance requirements (details as given in 6.6); (b) Application Conformance requirements (details as given in 6.7); These requirements are stated in a POSIX National Profile. In order to conform to a POSIX National Profile, a system shall perform correctly all the capabilities defined in the POSIX standards as mandatory and also any options of the POSIX standards which it claims to include. Conformance to a base standard in this context is conformance to a particular identified publication of a referenced base standard. A POSIX National Profile shall be defined in such a way that testing of its implementation can be carried out in the most complete way possible being given the available testing methodologies. 6.6 Implementation Conformance 6.6.1 General The choices of interfaces and functional behaviour made in a National Profile's implementation conformance requirements are specific to that National Profile and provide added facilities to the base standards. The choices are not, therefore, arbitrary but need to be consistent with the purpose of the National Profile and consistent across the base standards referenced by it. In order to avoid ambiguity between the National Profiles and the base standards, the implementation conformance requirements of a National Profile shall be specified, where possible, by reference to the conformance requirements of the referenced base standards. 6.6.2 Requirements All systems claiming conformance to a National Profile shall support the required interface and functionality defined in the National Profile. The system may provide additional functions or facilities not required by the National Profile. 6.7 POSIX Application Conformance for National Profiles All POSIX applications claiming conformance to the National Profile shall use only language-dependent services for one or more of the Language Options defined in the National Profile and the facilities provided by the National Profile and referenced base standards, and shall fall within one of the following categories: 6.7.1 Conforming POSIX Application A Conforming POSIX Application requires only the parameters and options defined in POSIX National Profile for the said National Body. Such an application shall include a statement of conformance that documents all options and limit dependencies, and all other standards used. 6.7.2 Conforming POSIX Application Using Extensions A Conforming POSIX Application Using Extensions is an application that requires not only the parameters and options defined in POSIX National Profile, but also other ISO/IEC standards and their National Profiles and several National Standards for the said National Body. The national extensions shall only be with respect to cultural services. Such an application shall fully document its requirements for these extended facilities, in addition to the documentation required of a Conforming POSIX Application. 7. Contents of National Profile A POSIX National Profile shall have the following structure: 1. General 1.1 Scope The scope of the National Profile shall be described. Provision of this section is mandatory. 1.2 Normative Reference The standards which are referred by the National Profile shall be listed. Provision of this section is mandatory. 1.3 Objectives The objectives of the National Profile shall be described. Provision of this section is mandatory. 1.4 Conformance 1.4.1 Levels of conformance If the National body enacts some levels of conformance, the levels shall be specified. Provision of this section is mandatory. 1.4.2 System conformance The requirements to the National body conforming implementation shall be specified. Provision of this section is mandatory. 1.4.3 Application conformance The requirements to the National body conforming application shall be specified. Provision of this section is mandatory. 2. Registry The names which must not conflict with other National Profile shall be listed. The names described here shall be registered to ISO, when official registration mechanism is established. Provision of this section is mandatory. 2.1 Locale names The name of locales which are specified in the National Profile. Provision of this section is mandatory. 2.2 Symbolic name of characters The list of extended character's symbolic names or the naming conventions for symbolic name of extended characters shall be specified. Provision of this section is mandatory. 2.3 Name of coded character sets The name of coded character sets which are referred by the National Profile shall be listed. The names may be used for code conversion utilities/functions, also. Provision of this section is mandatory. 2.4 Character classes If the National body specifies extra character class in LC_CTYPE category, the names and descriptions shall be specified. This section is optional. 2.5 Environment variables If the National body specifies environment variables which are not specified in POSIX standard, name of the environment variables and its descriptions shall be specified. This section is optional. 2.6 Others 3. Parameters 3.1 POSIX The range of POSIX related parameters which are allowed by the National Profile shall be specified. Provision of this section is mandatory. 3.1.1 Charmap The contents of Charmaps shall be specified. Provision of this section is mandatory. 3.1.2 Locale definition The contents of locale definitions shall be specified. Provision of this section is mandatory. 3.1.3 System parameter The range of values of following system parameter e.g. POSIX_NO_TRANC, NAME_MAX, and NAME_MAX shall be specified. Provision of this section is mandatory. 3.2 C Language The range of C Language related parameters which are allowed by the National Profile shall be specified, e.g. CHAR_BIT. Every National Profile shall provide this section. Provision of this section is mandatory. 4. Options Options which are required to be implemented shall be specified. 4.1 POSIX The required optional facilities which are related to POSIX standard shall be listed, e.g. charmap option of localedef utility. Provision of this section is mandatory. 4.2 Programming Language support The facilities required with respect to programming language support, e.g. programming language C as defined in ISO/IEC 9899 (incl addendum 1 and technical corrigenda). 5. Error/exception handling If the National body specifies the error/exception handling of some functions, the methods shall be specified. This section is optional. 6. Extensions 6.1 POSIX Extension If the National body requires implementation of any enhanced facility, e.g. addition of environment variable, function, utility and option parameter of utility, the enhanced facilities shall be specified. Provision of this section is mandatory. 6.2 Other Standards If the National body requires implementation of any standards other than POSIX standard to the National body conforming systems, the standards shall be listed. Provision of this section is mandatory. 7. Data exchange If the National body specifies any formats and mechanism, or requires implementation of standards, the facilities shall be specified. This section is optional. 7.1 Archive file format Format of archive files. e.g. tar and cpio, shall be specified. 7.2 Identification of coded character set The mechanism to identify coded character sets in a file shall be specified. 7.3 Protocols Communication protocols which the National body conforming implementation must be implemented shall be listed. 7.4 Profile for OSI The profile which the National body specified for OSI shall be referred. 7.4 Media If the National body has requirements on media which is used for data exchange, the requirements shall be specified. Annex A Informative reference If the National body has any recommended parameters, options and extensions, though not required for the profile conformance, these features should be listed in this section. This section is optional. Annex B Notes and Rationale 8 Concept of POSIX national locale The POSIX national locale is giving information that can be applicable to each application that modifies the behaviour of the application to adopt ot national and cultural preferences. In this way the same binary application can be used according to the cultural expectations of users in different cultural environments. Locales thus enable binary portability of applications to diverse cultural environments. The POSIX national locale is logically a part of the POSIX National Profile. The benefits of a national locale is examplified with the Danish example locale included in ISO/IEC 9945-2. 9 Contents of national locale In creating a national locale, many things must be considered. Some data may be easier determined than others. For each locale category some recommendations on its contents is given below. 9.1 contents of character classification and transformation category The character classification section of the locale is normally straightforward; an "A" is considered a letter in about all languages and is mapped to an "a" when the lower case letter should be found. Normally the LC_CTYPE definition in POSIX.2 Annex G or the POSIX equivalent of the "i18n" FDCC-set of ISO/IEC 14652 can be used without change. 9.2 contents of numeric category The data here is normally easy to determine for a given language and culture. The ISO standard is using comma as decimal punctuation, and period as the thousands delimiter. 9.3 contents of monetary category The monetary formats may be a bit difficult to specify. The ISO 4217 currency code must be specified for the international format. The local specification may be a choice, but there may be guidelines in national orthography specifications. Some countries may have obligations to display an amount in more than one currency, for example European countries using the Euro currency and a national currency. This is currently not possible to do in an internationalized portable way with current POSIX standards. It is recommended to make a comment in the locale if this is the case. The current POSIX standards specify that the position of the international and domestic currency symbol in relation to the monetary amount must be the same. It is recommended to make a comment in the locale if this is not in line with the national practice. 9.4 contents of time category There may be problems with specifying the date format, including time zone names, which may not be well defined. You could consult a number official sources, including orthography definitions and numeric rendering standards. One thing to watch out for is if the day and month names are written with an initial small letter - many languages do this, while some proprietary sources say that the names are spelled with an initial capital letter. 9.5 Contents of collating category The collating sequence is a major task to define. There are a number of versions of collation algorithms, each version accomplishes collation with specific requirements. For example the telephone version, with "Mc" the same as "Mac", numbers spelled out, certain words like "the" ignored or moved to the end, and the same entry entered sevaral times at different places etc. Another level is the phonetic version - soundex, which is a little less complicated. A third version is transcripted characters, as some librarians use when they see a Greek alpha and order that as a Latin "a". The version that is recommended for POSIX.2 locales is the systems interface level. The collating order should be usable in POSIX systems tools like ls and sort. A requirement has been that it is deterministic; if two strings are different they will also differ when compared. Another issue has been efficiency. This is also called the dictionary version. The problem of pronunciation and transliteration has not been addressed. Instead it had been considered adequate just to look at the characters themselves - only considering characters at the systems level - and not sounds. The level provided by the example locale in the POSIX.2 standard is a service for comparing strings which are intended for a replacement to the standard strcmp() etc routines, just a little more intelligent and adhering to what is expected to be culturally acceptable. As an example, for the Danish collating, there is however put as much intelligence in there as possible. The two letters are sorted as the single letter (A WITH RING), but the single letter is before in homonyms. The 4 level scheme of the Canadian-French sorting is being used, with the four levels being letter, accent, case and special character. In cause of harmonization it was decided to use the reverse sorting for the accents as the Canadians do; the natural choice may have been forward sorting here too, but as most of these words would be of French origin anyway, it was decided to follow the French rules. was implemented with the German rule, as seen in several German dictionaries. is ordered as but before it in homonyms. As an example of specifying the collating sequence for accents, there was some rules indicated in the Danish sorting standard and in the official Danish orthography dictionary, but it was far from complete. Then the accent sequence in several ISO standards were used, when there were no clear Danish rule. About 25 accents have been ordered. For the non-Latin scripts it is recommended not to transcribe. This allows to use the native collation order for these scripts, like alpha, beta, gamma for Greek and a be ve ghe for Cyrillic. Accented Greek and Cyrillic letters and ligatures should be put into the right places. The sequence of the scripts is recommended to be taken as in the ISO/IEC 14651 draft. That should solve the question on which scripts should come before others. A national specification may then choose this order, or maybe choose to let the native script or scripts come first, and then the rest of the scripts in the order of the 14651 draft. 9.6 contents of messages category The messages category is a hook to provide real message service in the applications, and only yes/no is considered by the POSIX standard. For the yes/no it is recommended that only the first letter of the answer in the natural language is required, and also to allow the English form "Yes"/"No", and the more cultural neutral 1/0 as answers. In Greek, the affirmative answer is "ne" written with the Greek script, so the allowing of "n" for negative answers could be causing confusion for the Greek language users. 10. Using locale templates The POSIX.2 standard introduced a copy command for all sections of the locale. This is convenient for many purposes, and it ensures that two locales are equivalent for this category. A further step in building on previous art is described here. The collating sequences may vary a bit from country to country, but in many cases much of the collating sequence is the same. For instance the Danish sequence is quite equal to the German, English or French, but for about a dozen letters it differs. The same can be said for Swedish or Spanish: generally the Latin collating sequence is the same, but a few characters collate differently. With the advent of the quite general coded character set independent locales like the example Danish in POSIX.2 annex G, it would be convenient if the few differences could be specified just as changes to an existing one. The specification job could then be reduced by orders of magnitude from say about 300 Latin letters (or 30.000 characters of IS 10646) to about 10 to 30. This would also improve the overview of what the changes really are. Therefore it is recommended to use the following reorder-after construct in the LC_COLLATE section of the locale file format for producing new national locales. 10.1 internationalization data collections ISO/IEC JTC1/SC22/WG15 - the ISO POSIX Working group - has been collecting POSIX locales for a number of years, and about 60 locales and 150 charmaps are available now. Note 1: The electronic data is freely available at the address http://www.dkuug.dk/i18n/WG15-collection A formal registry has been estabilshed in ISO/IEC 15897 and CEN ENV 12005, with entries encompassing a number of internationalization related data, including POSIX national profiles, POSIX locales, POSIX charmaps and lists of symbolic character names - "repertoiremaps". Note 2: The electronic data is freely available at the address http://www.dkuug.dk/cultreg. 10.2 reorder-after technique See description in CEN ENV 12005 registration standard A tool to implement the "reorder-after" construct is present in annex C. 11. Concept of charmap The charmap is a file descibing a coded character set. It is used together with a locale file by the localdef utility to produce a binary locale. The charmap describes the mapping between symbolic character names, as used by the locale, and the binary encoding of the characters. One locale can be written to support a number of coded character sets or encodings, by using symbolic character names which then are mapped to actual binary encoding via a charmap for each of the coded character sets employed, thus giving a binary locale for each of the encodings. The charmaps may also be used together with different locales, when these use the same symbolic character names. WG15 - the ISO/IEC POSIX working group - has collected about 150 charmaps, that then can be readily applied by the localedef utility to a locale. The collection comprises almost all of the ISO/IEC 2375 coded character set registry, and some 60 vendor specific character sets. Note: see clause 10.1 for availability of this data. Thus with just one specification of a national locale, uniform collating for many character sets is defined - the characters will always come in the same sequence regardless of which character set employed. Also there can be just one definition of date format and the other cultural items to be written, and that specification is then valid for many character sets. 12. Contents of charmap The contents of a charmap file is described in ISO/IEC 9945-2 clause 2.4.1. A number of characters needs to be present, see table 2-4 and table 2-5 for optional control characters inclusion. This is almost the same as the repertoire of ISO/IEC 646 IRV. In the charmap file there may optionally be specified a number of keywords. The and may specify alternate charcaters for the escape character and comment character, respectively. Common replacements for the default \ and # characters are / and %, which may lead to better portability, as \ and # is known to change representation when transmitted in certain email environments. The describes the name of the character encoding, with graphic characters from ISO/IEC 646 IRV. and describes the maximum and minumim number of bytes in an encoding, respectively. They default to 1 and to the value of respectively. Each of the lines defining that mapping between a symbolic name and an encoding may take a third argument, namely a comment. There is no need to specify a comment character before the comment, but it does not harm. Giving for example the ISO/IEC 10646 short identifier and the long name may enhance the readability of the charmap considerably. Annex A. Locale related descriptions in POSIX We have an extract in source form from the POSIX editor, with permissions to reproduce it. It is not reproduced here due to considerations for the rain forests, as it is about 70 pages. It is an extract of POSIX.2 on the first sections including 2.5 locales, and the 4.13 date format. Annex B. Symbolic character names As in POSIX.2 annex G and ISO/IEC FCD 14652 clause 6. As it is about 40 pages, it is not reproduced in this draft of the TR. Annex C. Convenient tools for producing national locale A script has been written in the "awk" language defined in POSIX.2 to implement the "reorder-after" construct. BEGIN { comment = "%"; back[0]= follow[0] = 0 } /LC_COLLATE/ { coll=1 } /END LC_COLLATE/ { coll=0; for (lnr= 1; lnr; lnr= follow[lnr]) print cont[lnr] } { if (coll == 0) print $0 ; else { if ($1 == "copy") { file = $2 while (getline < file ) if ( $1 == "LC_COLLATE" ) copy_lc = 1 else if ( $1 == "END" && $2 == "LC_COLLATE" ) copy_lc =0 else if (copy_lc) { lnr++ follow[lnr-1] = lnr back [ lnr ] = lnr-1 cont[lnr] = $0 symb[ $1 ] = lnr } close (file ) } else if ($1 == "reorder-after") { ra=1 ; after = symb [ $2 ] } else if ($1 == "reorder-end") ra = 0 else { lnr++ if (ra) follow [ lnr ] = follow [ after ] if (ra) back [ follow [ after ] ] = lnr follow[after] = lnr back [ lnr ] = after cont[lnr] = $0 if ( ra && $1 != comment && $1 != "" ) { old = symb [ $1 ] follow [ back [ old ] ] = follow [ old ] back [ follow [ old ] ] = back [ old ] symb[ $1 ] = lnr } after = lnr } } } Annex D. Use of ISO/IEC 10646 in POSIX standards D.1 Introduction and scope For servicing the widest possible audience, POSIX standards should be able to handle the most encompassing character set, and the best candidate for this is the ISO/IEC 10646-1:1993 standard. The following gives guidance for how to accomplish this goal. The field of application is seen to be in many areas such as global organisations interested in just one character set organisationwide, in European government institutions, in eastern Asia and many other places. ISO/IEC 10646-1:1993, the Universal Multiple-Octet Coded Character Set (UCS), provides the capability to encode multi-script text within a single coded character set. However, because UCS is designed to use all code points available, null bytes and the code values of the other ISO/IEC 646:1991 IRV (also known as ASCII) characters, including the code value of the ISO 646 solidus ("/") character, are not protected. This makes the UCS character encoding incompatible with many existing ISO 646 based POSIX operating system implementations. The fact that UCS also uses code points also used for ISO 6429 control characters introduces further problems for communication and application software. From these problems it was clear that a POSIX internal encoding was required for the ISO/IEC 10646 coded character set. In the following, first a survey of the possible coded representation forms of UCS and UCS transformation formats and their respective characteristics are given. Then each of the handling areas (data storage, file names, internal processing, communications, interprocess communication) of the POSIX operation is analyzed. Finally guidelines are given for POSIX standards. A revised TR 10176 with guidelines for support of IS 10646 has been published, and there may be further recommendations in this work of relevance to POSIX. D.2 UCS coded representation forms and UCS transformation formats D.2.1 POSIX internal encoding For the POSIX internal encoding UTF-8 was considered suitable. The objective of UTF-8 is to provide an UCS transformation format which also meets the requirement of being usable on historical POSIX operating system file systems in a non-disruptive manner. The UTF-8 transformation format represents both UCS-2 and UCS-4 in a compatible format using multiple-octet coded characters of lengths 1, 2, 3, 4, 5, and 6 octets: Bits Hex Min Hex Max Byte Sequence in Binary 1 7 00000000 0000007F 0vvvvvvv 2 11 00000080 000007FF 110vvvvv 10vvvvvv 3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv 4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv 5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv The UCS value is the concatenation of the v-bits in the multiple- octet encoding, where the v-bits are the 0's and 1's that constitute the UCS value. Thus UTF-8 has the capability of handling existing ISO 646 files without change, and all codes in the ISO 646 range (having an octet value in the range 0-127) can be safely assumed to be representing the normal ISO 646 character. D.2.2 Other forms of IS 10646 IS 10646 has two forms: UCS-2 and UCS-4, a 16-bit and 31-bit coded representation of the character set, respectively. IS 10646 is planned to have more characters than what is representable in 64 k, so the general case of UCS-4 needs to be considered. ISO/IEC 10646-1:1993 had a transformation format UTF-1, which was informative, and it has now been removed from the standard by the amendment ISO/IEC 10646-1 AM4:1996. UTF-8 is aimed at the same purpose, and has more capability. UTF-8 has been approved as part of UCS via the amendment ISO/IEC 10646-1 AM2:1996. Another Transformation Format of IS 10646, UTF-16, has also been approved, as ISO/IEC 10646-1 AM1:1996, but this cannot accommodate all of IS 10646 (it accommodates about 1 million characters) and it will employ techniques like in UTF-8 with ranges indicating how many octets are required to form one character, without the added functionality of being backwards compatible with ISO/IEC 646 and ISO/IEC 2022 encodings (which is a functionality of UTF-8). The most general of the above encodings of IS 10646, is the UCS-4. It has the property of being constant-width, which may be easier to handle than the multiple-octet UTF-8. As a file and as an interchange code it has the problematic property of using codes in conflict with ISO/IEC 646, ISO/IEC 2022 and ISO/IEC 6429, dependency on byte-ordering (little-endian vs big-endian) of the hosting machine architecture, and also of using 4 octets per character. Here UTF-8 is clearly superior for POSIX internal encoding. UCS-4 may have advantages as an internal processing code, and as an inter-process encoding, for C language widechar- like encodings, but with the ISO/IEC C language amendment (AM1) with full support for multibyte coded character sets, that advantage may be diminishing. UTF-8 is as well defined and capable of representing all IS 10646 characters, and given its strengths in other areas it may well be chosen also for the internal processing, and inter-process communication. Internal processing is not in the scope of POSIX interfaces, anyway. D.2.3 UCS levelling IS 10646 has 3 levels of support, level 1 without combining characters, level 2 with combining characters in some scripts, and level 3 with unrestricted use of combing characters. SC22 has by resolutions from the 1993 Paris plenary recommended that all SC22 standards be enabled for level 3 data, but that the semantics of combining characters not be addressed currently. Thus there is not specific SC22 request for further support of level 2 and 3, but eventually there could be a need for support of these levels. SC22 also recommended use of IS 10646 terminology thruout SC22 standards, and this may need an alignment of current POSIX work, though it is the belief that current POSIX work is already well aligned with IS 10646 with respect to terminology. D.3 Problems in POSIX handling of UCS There are several challenges presented by UCS which must be dealt with by present implementations of the POSIX operating system. D.3.1 Data storage The most significant of these challenges is the encoding scheme used by UCS. More precisely, the challenge is the marrying of the UCS standard with existing programming languages and existing operating systems. Prominent among the operating system UCS handling concerns is the representation of contents of data in files. An underlying assumption is that there is an absolute requirement to maintain the existing operating system software investments while at the same time taking advantage of the use the large number of characters provided by UCS. For UTF-8 the representation of ISO 646 data is exactly the same, and for ISO/IEC 8859 parts, right hand side characters will need two octets for representation. For ideographic characters in the BMP, the representation will be three octets. This does not give a dramatically changed requirement for what is currently consumed for data storage. D.3.2 File names and internal processing The UTF-8 transformation format was originally conceived as a file system safe transformation format of UCS to allow historically ISO 646 based POSIX operating systems to cope with representation and handling in file names of the large number of characters that are possible to be encoded by UCS. In addition, from an internal operating system (kernel) viewpoint this handling of a large character set is only a problem for handling file names, which are only analyzed for the solidus ("/") delimiter to parse a name into filename components. As UTF-8 can represent the full encoding of IS 10646 and is backwards compatible with ISO 646, UTF-8 handling is sufficient for POSIX internal encoding. D.3.3 Communications Current ISO POSIX standards do not address communication, but as ISO 6429 control characters are often used in communication, and the UTF-1 transformation format was originally created for avoiding control character problems in communication, UTF-1 could be the choice. As UTF-1 is being removed from UCS and UTF-8 introduced, having the same capabilities with respect to control character problem solving, UTF-8 is the recommended choice in POSIX communication interfaces. D.3.4 Interprocess communication Communication between POSIX processes would probably use internal data formats, for example integers should be transferred in binary form. As it could be recommended that programs internally use a C language widechar style encoding of characters, a UCS-2 or UCS-4 format could be recommended. On the other hand interprocess communication is often across networks and between heterogeneous systems, therefore since UCS-2 and UCS-4 are dependent on machine architecture, UTF-8 may be the preferred candidate. UTF-8 would in many cases also be less space- consuming, which may be a significant plus when using low-capacity network lines. D.4 Recommendation According to the above analysis, UTF-8 is the best candidate for POSIX internal encoding of UCS in the areas of data storage, file names and internal operating system (kernel) processing, and communication, where otherwise UCS-2 or UCS-4 would have been used for coded data. Furthermore UTF-8 is a good candidate for UCS representation in interprocess communication. It is thus the recommendation to use the UTF-8 transformation format whenever UCS is used in POSIX interfaces. As POSIX interfaces in principle should be coded character set independent, there is no general need to require the use of UTF-8 in POSIX standards, but guidance could be given in rationales. A specific recommendation is that the portable archive exchange utility "pax" be revised to be able to specifically use UTF-8 for file names, and the use of UTF-8 should be clearly identified. D.5 Consequences X/Open has raised a number of problems with use of ISO/IEC 10646 in POSIX in the document WG15 N621. With the preceding recommendation the problems can be addressed as follows: - In UTF-8 the repertoire of ASCII is encoded as ASCII (ISO/IEC 646 IRV). - We know no codesets with control characters encoded in the full single octet range 0 thru 7F, but many use 0 thru 1F hex and 7F, and some the range 80 thru 9F. UTF-8 has reserved these octet ranges for control characters. - zero value octets and octets equating '/' only appear in UTF-8 as representations of the NUL and '/' character respectively. - "combining characters" need not have special processing as per SC22 resolutions, except for possibly a width specification in a locale. - According to the ISO/IEC 10646 standard there is no equivalences prescribed between sequences of characters with combining characters and some "precomposed" characters, and the SC22 plenary recommendation is that there need not be special handling of this. - It should not be needed to process composite sequences in a special way. Annex E. Registry data The following schema is needed for registration with DIS 15897/ENV 12005: Application form for a Cultural Specification Please specify all data relevant for the Cultural Specification type, indicating non-available data by "not available". Please fill out one form for each Cultural Specification submitted. When completed, please send it to the Registration Authority as listed in clause 4. 1. Cultural Specification type number: ______________________________ 2. Organization name of Sponsoring Authority: ________________________ 3. Organization postal address: _____________________________________ __________________________________________________________________ 4. Name of contact person: _________________________________ 5. Electronic mail address of contact person: ______________________ 6. Telephone number for contact person: + ___ ______________________ 7. Fax number for contact person: + ___ ______________________ For Narrative Cultural Specifications and POSIX Locales (type 1 and 2): 8. Natural language, as specified in ISO 639: ______ 9. Territory, as two-letter form of ISO 3166: ______ For POSIX Charmaps and POSIX Repertoiremaps (type 3 and 4): 10. The proposed POSIX Charmap or POSIX Repertoiremap name: ________________ For all 4 types: 11. If not for general use, an intended user audience, e.g. librarians: _______ 12. If for use of a special application, the short application name: ___________ 13. Short name for Sponsoring Authority, used in token identifier: ______________ 14. Version number with zero or more dots: __________ 15. Revision date in ISO 8601 format: ____________ The Cultural Specification identified above, and of which we hold copyright, is allowed for free distribution. Date: ______________ Authorized signature: __________________________ Annex F. Examples of National Profile - Japan [It is ready to include an example of Japanese National Profile here. Since the text is so large, the example is intentionally omitted from this review version of document. Please contact Japanese National Body for the details of Japanese National Profile.] Annex G. Examples of National Locale - Denmark [An example of Denmark National Locale will be provided here.] Bibilography Index