# Make char16_t/char32_t string literals be UTF-16/32

Document Number: P1041R0  
Date: 2018-04-24  
Audience: Evolution Working Group  
Reply-to: cpp@rmf.io

## Introduction

C++11 introduced character types suitable for code units of the UTF-16 and
UTF-32 encoding forms, namely `char16_t` and `char32_t`. Along with this, it
also introduced new string literals whose types are arrays of those two
character types, prefixed with `u` and `U`, respectively. And last but not
least, it also introduced *UTF-8 string literals*, prefixed with `u8`, with
types arrays of `const char`. Of these three new string literal types, only one
has a guarantee about the values that the elements of the array have; in other
words, only one has a guaranteed encoding form, the *UTF-8 string literals*.

The standard text hints that the `char16_t` and `char32_t` string literals are
intended to be encoded as, respectively, UTF-16 and UTF-32, but unlike it does
for *UTF-8 string literals*, it never explicitly makes such a requirement.

## Motivation

In defining `char16_t` string literals ([lex.string]/10), the standard makes a
mention of "surrogate pairs":

> A string-literal that begins with `u`, such as `u"asdf"`, is a `char16_t`
> string literal.  A `char16_t` string literal has type “array of *n* `const
> char16_t`”, where *n* is the size of the string as defined below; it is
> initialized with the given characters. A single *c-char* may produce more
> than one `char16_t` character in the form of surrogate pairs.

Further down, when defining the size of `char16_t` string literals
([lex.string]/15), there is another mention of "surrogate pairs":

> The size of a `char16_t` string literal is the total number of escape
> sequences, *universal-character-names*, and other characters, plus one for
> each character requiring a surrogate pair, plus one for the terminating
> `u'\0'`.  [*Note:* The size of a char16_­t string literal is the number of
> code units, not the number of characters. — *end note*]

For `char32_t` string literals, the definition of their size (\[lex.string]/15)
essentially limits the encoding form used to one that doesn't have more than
one code unit per character:

> The size of a `char32_t` or wide string literal is the total number of escape
> sequences, *universal-character-names*, and other characters, plus one for
> the terminating `U'\0'` or `L'\0'`.

Additionally, the standard constrains the range of *universal-character-names*
to the range that is supported by all of the UTF encoding forms discussed here:

> Within `char32_t` and `char16_t` string literals, any
> *universal-character-names* shall be within the range `0x0` to `0x10FFFF`.

All of these requirements, while never explicitly naming the UTF-16 or UTF-32
encoding forms, strongly imply that these are the encoding forms intended.
Furthermore, it would be questionable for an implementation to pick any other
encoding forms for these string literals: there is no well-known encoding form
that uses a concept named "surrogate pair" other than UTF-16, and there is no
well-known encoding form that encodes each character as a single 32-bit code
unit other than UTF-32.

In practice, all implementations use UTF-16 and UTF-32 for these string
literals. C++ should standardize this practice and make these requirements
explicit instead of just hinting at them.

## Proposal

This proposal renames "`char16_t` string literals" and "`char32_t` string
literals" to "UTF-16 string literals" and "UTF-32 string literals", to match
the existing "UTF-8 string literals", and explicitly requires the object
representations of those literals to be the values that correspond to the
UTF-16 and UTF-32 (respectively) encodings of the given characters.

## Technical Specifications

 - Add to [lex.string]/10:

    > A *string-literal* that begins with `u`, such as `u"asdf"`, is a
    > <del>`char16_t` string literal</del><ins>*UTF-16 string literal*</ins>. A
    > <del>`char16_t` string literal</del><ins>UTF-16 string literal</ins> has
    > type “array of *n* `const char16_t`”, where *n* is the size of the string
    > as defined below; it is initialized with the given characters. A single
    > *c-char* may produce more than one `char16_t` character in the form of
    > surrogate pairs.

 - Change [lex.string]/11:

    > A *string-literal* that begins with `U`, such as `U"asdf"`, is a
    > <del>`char32_t` string literal</del><ins>*UTF-32 string literal*</ins>.
    > A <del>`char32_t` string literal</del><ins>UTF-32 string literal</ins>
    > has type “array of *n* `const char32_t`”, where *n* is the size of the
    > string as defined below; it is initialized with the given characters.

 - Insert a paragraph between [lex.string]/10 and /11:

    > <ins>For a UTF-16 string literal, each successive element of the object
    > representation has the value of the corresponding code unit of the UTF-16
    > encoding of the string.</ins>

- Insert a paragraph between [lex.string]/11 and /12:

    > <ins>For a *UTF-32 string literal*, each successive element of the object
    > representation has the value of the corresponding code unit of the UTF-32
    > encoding of the string.</ins>

- Change [lex.ccon]/4:

    > A character literal that begins with the letter `u`, such as `u'x'`, is a
    > character literal of type `char16_t`<ins>, known as a *UTF-8 character
    > literal*</ins>. The value of a <del>`char16_t`</del><ins>UTF-16</ins>
    > character literal containing a single *c-char* is equal to its ISO 10646
    > code point value, provided that the code point value is representable
    > with a single 16-bit code unit (that is, provided it is in the basic
    > multi-lingual plane). If the value is not representable with a single
    > 16-bit code unit, the program is ill-formed. A
    > <del>`char16_t`</del><ins>UTF-16</ins> character literal containing
    > multiple *c-char*s is ill-formed.

- Change [lex.ccon]/5:

    > A character literal that begins with the letter `U`, such as `U'y'`, is a
    > character literal of type `char32_t`. The value of a
    > <del>`char32_­t`</del><ins>UTF-32</ins> character literal containing a
    > single *c-char* is equal to its ISO 10646 code point value. A
    > <del>`char32_­t`</del><ins>UTF-32</ins> character literal containing
    > multiple *c-char*s is ill-formed.

## Interaction with other papers

Currently, the standard lacks a normative reference to UTF-16, and UTF-32;
however, it also lacks one such reference for UTF-8. This paper assumes that
this problem will be fixed for all three encodings in another paper,
potentially
[D1025R0](https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md)
(*Update The Reference To The Unicode Standard*).

This paper was also written so as to not conflict with
[P0482R2](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html)
(*char8_t: A type for UTF-8 characters and strings (Revision 2)*).