ISO 2022 wchar_t encoding

by Markus Kuhn

This proposal suggests a standard encoding to be used for representing characters in the ISO C wchar_t data type. The encoding is upwards compatible with ISO 10646, while at the same time it can be used to faithfully preserve all information present in an ISO 2022 character stream. The main aim of this proposal is to demonstrate to sceptics of the ISO C 99 option "__STDC_ISO_10646__" that using UCS in wchar_t and full ISO 2022 round-trip compatibility are not mutually exclusive and can be achieved in a simple and elegant way. This proposal is not encouraging the use of ISO 2022 in Unix locales. On the contrary, this text explains why using ISO 2022 in Unix multi-byte encodings would be bad, even dangerous engineering and should really be removed from those implementations that have unfortunately decided to provide it.

The proposal is at this stage EXPERIMENTAL and should not yet be fielded in widely distributed implementations without consulting the author first. This proposal is discussed publicly on linux-utf8.

Introduction

When the ISO C 90 standard introduced the wchar_t type, no standard encoding was specified. As a result, implementations have come up that use in wchar_t locale-dependent encodings, often simply a concatenation of the bytes representing the same character in the current multi-byte encoding. This is problematic, because it threatens binary compatibility. The meaning of an L"..." wchar_t string literal stored in an executable changes with these implementations as the locale changes. This results in binaries that can be executed safely only under a single locale, which makes these implementations unsuitable for international use. Or at least it results in a programming environment, where it is just not safe to use wchar_t string literals.

Since 1993, we have finally had available a single unified well-documented standard encoding of all characters that satisfies the needs of every language equally, namely ISO 10646 = UCS = Unicode. UCS is clearly the obvious choice today for a standard wide character data type. Most programming languages or communication standards introduced or revised after 1993 have taken that path (Java, Ada95, TCL, Perl, Python, C#, XML, HTML). Unfortunately, due to existing implementation practice, the ISO C committee was not able to mandate in the new second edition of the C standard (ISO C 99) that wchar_t be always UCS encoded. Instead, this just became a recommended practice and implementors who follow it can signal in a standardised way to the application this guarantee by defining a standard macro:

__STDC_ISO_10646__
An integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month. [ISO/IEC 9899:1999(E), §6.10.8]

The GNU C library (glibc 2.2) used for instance under Linux is one example of a major platform that implements already the "__STDC_ISO_10646__" guarantee. In all its locales (with a large range of different multi-byte encodings), wchar_t (a signed 32-bit integer type) always represents characters using their ISO 10646 code. Other major POSIX platforms are expected to follow.

However, there is a group of (almost exclusively Japanese) "__STDC_ISO_10646__" sceptics who oppose the notion of treating all POSIX multi-byte encodings as UCS subsets. Their argument is that "__STDC_ISO_10646__" is incompatible with the ability to represent various ISO 2022 based email encodings such as ISO-2022-JP2 without loss of information.

Even though every single encoding used in ISO-2022-JP2 is perfectly round-trip compatible with ISO 10646, the information in which encoding a character came is lost after a conversion to UCS. For instance, both ISO 8859-7 and JIS X 0208 contain Greek letters, and these map to the same Greek letter in UCS, so when we convert back from UCS to ISO-2022-JP2, we don't know whether to encode a Greek character in ISO 8859-7 and JIS X 0208. One might wonder, what purpose this distinction has. A Greek character is after all just a Greek character, no matter whether it was encoded in ISO 8859-7 or JIS X 0208. The problem is that in Japan, it has become common practice to abuse the encoding as a style indicator. By convention, JIS X 0208 characters are displayed with square (double-width) fonts, while ISO 8859-7 characters are displayed single width. Because of this widespread confusion between encoding and style, Japanese Unix geeks are disappointed if the encoding information is lost after a Unicode roundtrip.

There are several answers to this argument. The first is obviously that it was a bad idea to start with, to piggyback style information (e.g., display width) onto an encoding, and the entire appreciation of this abuse of ISO 2022 should be reconsidered. The second is that ISO 2022 based encodings are not at all suitable for use as Unix locales. People who advocate ISO 2022 support in POSIX locales simply haven't understood the mandatory requirements that an encoding has to fulfil to be usable in a POSIX locale:

ASCII compatibility: In any C and POSIX multi-byte encoding, ASCII characters must be represented with bytes in the range 0x00 to 0x7f only. No other character must contain in its representation any byte in the range 0x00 to 0x7f. The reason for this requirement is that many ASCII bytes have special meanings in the POSIX environments, such as '/' as the path component separator in file names and '%' or '\' as escape symbols in C API functions such as printf or scanf, to name just a few obvious examples of many problem that non-ASCII compatibility causes.
Statelessness: In any C and POSIX multi-byte encoding, a sequence of bytes must always represent the same sequence of characters, independent of its location relative to other byte sequences. This forbids the use of shift sequences and other ISO 2022 mechanisms. The reason for this requirement is that the POSIX environment provides numerous sophisticated substring matching functions (such as the shell's filename globing and sed's regular expression matching) that were fundamentally built on the assumption that the byte sequences representing characters can be interpreted independent of the context in which they appear.

Even though some ISO 2022 based encodings have become popular transfer syntaxes for sending RFC 822 email messages in East Asia, this does not make any of these encodings suitable for use in POSIX locales. If a POSIX locale is selected, this means that the encoding can be used everywhere where ASCII is used, and this includes filenames and source code.

There are only 19 encodings currently used worldwide as legitimate POSIX multi-byte locale encodings:

UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII, WINDOWS-1251, WINDOWS-1256

Each of these is fully roundtrip compatible to ISO 10646, therefore all these locales can be represented nicely in wchar_t as the equivalent UCS values. The above names and the corresponding defining documents are listed in the IANA charset registry.

Full ISO 2022 round-trip compatibility

After all these health warnings, now for the actual proposal:

If against all common sense, someone really would seriously want to use something like ISO-2022-JP2 or any other ISO 2022 based 7-bit encoding in a POSIX locale, then the following scheme demonstrates an elegant way of keeping wchar_t in UCS while maintaining full round-trip compatibility to ISO 2022.

In this scheme the value of a wchar_t variable can either be a UCS value (supported are only UCS values up to U-07FFFFFF, but that is still far more than UTF-16 can handle), or it can be a UCS value (up to U-0003FFFF) annotated with a compact form of the ISO 2022 designator sequence that specified the encoding of this character in the original ISO 2022 representation.

A wchar_t value with ISO 2022 annotation has the form N * 0x08000000 + I * 0x02000000 + F * 0x00040000 + U, where N identifies the first half of the designating ESC sequence, I indicates an optional additional intermediate byte, F is the 7-bit final byte of the ESC sequence, and U is the UCS code of the represented character.

A bit representation of the layout of a wchar_t value in an ISO 2022 locale looks like this:

Layout of wchar_t
`0`	`0000`	`uu`	`uuuuuuu`	`uuuuuuuuuuuuuuuuuu`
`0`	`nnnn`	`ii`	`fffffff`	`uuuuuuuuuuuuuuuuuu`

For N=0, no ISO 2022 information is explicitly present and the remaining 27 bit contain just a UCS value. The possible other values of N denote different types of ISO 2022 designator sequences and are given in the following table:

nnnn	Designator		ESC sequence
0001	GZD4	G0-DESIGNATE 94-SET	ESC 02/08
0010	G1D4	G1-DESIGNATE 94-SET	ESC 02/09
0011	G2D4	G2-DESIGNATE 94-SET	ESC 02/10
0100	G3D4	G3-DESIGNATE 94-SET	ESC 02/11
0101	G1D6	G1-DESIGNATE 96-SET	ESC 02/13
0110	G2D6	G2-DESIGNATE 96-SET	ESC 02/14
0111	G3D6	G3-DESIGNATE 96-SET	ESC 02/15
1000	GZDM4	G0-DESIGNATE MULTIBYTE 94-SET	ESC 02/04 (02/08)
1001	G1DM4	G1-DESIGNATE MULTIBYTE 94-SET	ESC 02/04 02/09
1010	G2DM4	G2-DESIGNATE MULTIBYTE 94-SET	ESC 02/04 02/10
1011	G3DM4	G2-DESIGNATE MULTIBYTE 94-SET	ESC 02/04 02/11
1100	G1DM6	G1-DESIGNATE MULTIBYTE 96-SET	ESC 02/04 02/13
1101	G2DM6	G2-DESIGNATE MULTIBYTE 96-SET	ESC 02/04 02/14
1110	G3DM6	G3-DESIGNATE MULTIBYTE 96-SET	ESC 02/04 02/15
1111	DOCS	DESIGNATE OTHER CODING SYSTEM	ESC 02/05

[Note that for N=4 and F=0x40..0x42, the second intermediate byte 02/08 is not used, as explained in ECMA-35, section 14.3.2.]

In addition to the designator type, we also have to encode the final byte, and, if present, any additional intermediate byte that could be used to extend the range of final bytes. This information is encoded in fields F and I. The values of F (the final byte of the designator sequence) can be in the range 0x30 to 0x7e. The possible values of I (the additional intermediate byte) are given in the following table:

ii	Additional byte
00	none
01	02/01
10	02/02
11	02/03

Note, that with this encoding, it is quite easy to extend UCS functions for character classification, case mapping, sorting, etc. Just check whether the wchar_t value is greater than 0x08000000 and if yes, mask of the least significant 18 bits and forward these to the UCS function. This allows the implementation of ISO 2022 locales without a need for any extra processing tables beyond those needed anyway for UCS locales.

In the interest of better compatibility, it can be beneficial to encode at least some characters (perhaps as many characters as possible) in an ISO 2022 locale without extra ISO 2022 information, such that true UCS values are used. This definitely should be done for ASCII, but could be extended to some of the other encodings, as long as there is no overlap.

The above proposal has the advantage that the full encoding information is preserved and when the character sequence is converted back to ISO 2022, the exact same byte sequences, G0/G1/G2/G3 positions and designator sequences will be used for representing the characters. Only the position of designator sequences and shift functions and the nature of shift functions is not necessarily exactly preserved, but the results will be semantically identical.

Practical ISO 2022 round-trip compatibility

If exact preservation of the ISO 2022 representation is not the actual requirement, and full compatibility with Unicode/UCS and UTF-16 are more important, there is a second solution. Insert during ISO 2022 to UCS conversion the Unicode 3.1 Language Tags in order to preserve the information, which ideographic unification group or encoding standard a character came from:

Plane 14 tag	IRG Source	Countries	Standards
zh	G	China, Hongkong, Singapore	GB2312, GB12345, GB7589, GB7590, GB8565, GB16500
zh-tw	T	Taiwan	CNS 11643
ja	J	Japan	JIS X 0208, JIS X 0212
ko	K	Korea	KS C 5601, KS C 5657, PKS C 5700
vi	V	Vietnam	TCVN 5773, TCVN 6056
ru	n/a	Russia	ISO 8859-5
el	n/a	Greece	ISO 8859-7

This is the information carried in ISO 2022 that some East Asian users are actually interested in when they are concerned about information loss during conversion to UCS. It allows the selection of font styles and display widths such that the exact display conventions of ISO 2022 software such as MULE or kterm can be preserved in UCS plaintext. The downside of this encoding is that it introduces state again, with all its associated problems, but that problem existed already with ISO 2022 as well.

Markus Kuhn

created 2001-06-16 – last modified 2001-06-26 – http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html