From: Keld J|rn Simonsen URL: http://std.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp ISO TR 14766 - Guidelines for POSIX National Profiles and National Locales Annex D. Use of ISO/IEC 10646 in POSIX standards D.1 Introduction and scope For servicing the widest possible audience, POSIX standards should be able to handle the most encompassing character set, and the best candidate for this is the ISO/IEC 10646-1:1993 standard. The following gives guidance for how to accomplish this goal. The field of application is seen to be in many areas such as global organisations interested in just one character set organisationwide, in European government institutions, in eastern Asia and many other places. ISO/IEC 10646-1:1993, the Universal Multiple-Octet Coded Character Set (UCS), provides the capability to encode multi-script text within a single coded character set. However, because UCS is designed to use all code points available, null bytes and the code values of the other ISO/IEC 646:1991 IRV (also known as ASCII) characters, including the code value of the ISO 646 solidus ("/") character, are not protected. This makes the UCS character encoding incompatible with many existing ISO 646 based POSIX operating system implementations. The fact that UCS also uses code points also used for ISO 6429 control characters introduces further problems for communication and application software. From these problems it was clear that a POSIX internal encoding was required for the ISO/IEC 10646 coded character set. In the following, first a survey of the possible coded representation forms of UCS and UCS transformation formats and their respective characteristics are given. Then each of the handling areas (data storage, file names, internal processing, communications, interprocess communication) of the POSIX operation is analyzed. Finally guidelines are given for POSIX standards. A revised TR 10176 with guidelines for support of IS 10646 has been published, and there may be further recommendations in this work of relevance to POSIX. D.2 UCS coded representation forms and UCS transformation format D.2.1 POSIX internal encoding For the POSIX internal encoding UTF-8 was considered suitable. The objective of UTF-8 is to provide an UCS transformation format which also meets the requirement of being usable on historical POSIX operating system file systems in a non-disruptive manner. The UTF-8 transformation format represents both UCS-2 and UCS-4 in a compatible format using multiple-octet coded characters of lengths 1, 2, 3, 4, 5, and 6 octets: Bits Hex Min Hex Max Byte Sequence in Binary 1 7 00000000 0000007F 0vvvvvvv 2 11 00000080 000007FF 110vvvvv 10vvvvvv 3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv 4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv 5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv The UCS value is the concatenation of the v-bits in the multiple-octet encoding, where the v-bits are the 0's and 1's that constitute the UCS value. Thus UTF-8 has the capability of handling existing ISO 646 files without change, and all codes in the ISO 646 range (having an octet value in the range 0-127) can be safely assumed to be representing the normal ISO 646 character. D.2.2 Other forms of IS 10646 IS 10646 has two forms: UCS-2 and UCS-4, a 16-bit and 31-bit coded representation of the character set, respectively. IS 10646 is planned to have more characters than what is representable in 64 k, so the general case of UCS-4 needs to be considered. ISO/IEC 10646-1:1993 had a transformation format UTF-1, which was informative, and it has now been removed from the standard by the amendment ISO/IEC 10646-1 AM4:1996. UTF-8 is aimed at the same purpose, and has more capability. UTF-8 has been approved as part of UCS via the amendment ISO/IEC 10646-1 AM2:1996. Another Transformation Format of IS 10646, UTF-16, has also been approved, as ISO/IEC 10646-1 AM1:1996, but this cannot accommodate all of IS 10646 (it accommodates about 1 million characters) and it will employ techniques like in UTF-8 with ranges indicating how many octets are required to form one character, without the added functionality of being backwards compatible with ISO/IEC 646 and ISO/IEC 2022 encodings (which is a functionality of UTF-8). The most general of the above encodings of IS 10646, is the UCS-4. It has the property of being constant-width, which may be easier to handle than the multiple-octet UTF-8. As a file and as an interchange code it has the problematic property of using codes in conflict with ISO/IEC 646, ISO/IEC 2022 and ISO/IEC 6429, dependency on byte-ordering (little-endian vs big-endian) of the hosting machine architecture, and also of using 4 octets per character. Here UTF-8 is clearly superior for POSIX internal encoding. UCS-4 may have advantages as an internal processing code, and as an inter-process encoding, for C language widechar-like encodings, but with the ISO/IEC C language amendment (AM1) with full support for multibyte coded character sets, that advantage may be diminishing. UTF-8 is as well defined and capable of representing all IS 10646 characters, and given its strengths in other areas it may well be chosen also for the internal processing, and inter-process communication. Internal processing is not in the scope of POSIX interfaces, anyway. D.2.3 UCS levelling IS 10646 has 3 levels of support, level 1 without combining characters, level 2 with combining characters in some scripts, and level 3 with unrestricted use of combing characters. SC22 has by resolutions from the 1993 Paris plenary recommended that all SC22 standards be enabled for level 3 data, but that the semantics of combining characters not be addressed currently. Thus there is not specific SC22 request for further support of level 2 and 3, but eventually there could be a need for support of these levels. SC22 also recommended use of IS 10646 terminology thruout SC22 standards, and this may need an alignment of current POSIX work, though it is the belief that current POSIX work is already well aligned with IS 10646 with respect to terminology. D.3 Problems in POSIX handling of UCS There are several challenges presented by UCS which must be dealt with by present implementations of the POSIX operating system. D.3.1 Data storage The most significant of these challenges is the encoding scheme used by UCS. More precisely, the challenge is the marrying of the UCS standard with existing programming languages and existing operating systems. Prominent among the operating system UCS handling concerns is the representation of contents of data in files. An underlying assumption is that there is an absolute requirement to maintain the existing operating system software investments while at the same time taking advantage of the use the large number of characters provided by UCS. For UTF-8 the representation of ISO 646 data is exactly the same, and for ISO/IEC 8859 parts, right hand side characters will need two octets for representation. For ideographic characters in the BMP, the representation will be three octets. This does not give a dramatically changed requirement for what is currently consumed for data storage. D.3.2 File names and internal processing The UTF-8 transformation format was originally conceived as a file system safe transformation format of UCS to allow historically ISO 646 based POSIX operating systems to cope with representation and handling in file names of the large number of characters that are possible to be encoded by UCS. In addition, from an internal operating system (kernel) viewpoint this handling of a large character set is only a problem for handling file names, which are only analyzed for the solidus ("/") delimiter to parse a name into filename components. As UTF-8 can represent the full encoding of IS 10646 and is backwards compatible with ISO 646, UTF-8 handling is sufficient for POSIX internal encoding. D.3.3 Communications Current ISO POSIX standards do not address communication, but as ISO 6429 control characters are often used in communication, and the UTF-1 transformation format was originally created for avoiding control character problems in communication, UTF-1 could be the choice. As UTF-1 is being removed from UCS and UTF-8 introduced, having the same capabilities with respect to control character problem solving, UTF-8 is the recommended choice in POSIX communication interfaces. D.3.4 Interprocess communication Communication between POSIX processes would probably use internal data formats, for example integers should be transferred in binary form. As it could be recommended that programs internally use a C language widechar style encoding of characters, a UCS-2 or UCS-4 format could be recommended. On the other hand interprocess communication is often across networks and between heterogeneous systems, therefore since UCS-2 and UCS-4 are dependent on machine architecture, UTF-8 may be the preferred candidate. UTF-8 would in many cases also be less space-consuming, which may be a significant plus when using low-capacity network lines. D.4 Recommendation According to the above analysis, UTF-8 is the best candidate for POSIX internalencoding of UCS in the areas of data storage, file names and internal operating system (kernel) processing, and communication, where otherwise UCS-2 or UCS-4 would have been used for coded data. Furthermore UTF-8 is a good candidate for UCS representation in interprocess communication. It is thus the recommendation to use the UTF-8 transformation format whenever UCS is used in POSIX interfaces. As POSIX interfaces in principle should be coded character set independent, there is no general need to require the use of UTF-8 in POSIX standards, but guidance could be given in rationales. A specific recommendation is that the portable archive exchange utility "pax" be revised to be able to specifically use UTF-8 for file names, and the use of UTF-8 should be clearly identified. D.5 Consequences X/Open has raised a number of problems with use of ISO/IEC 10646 in POSIX in the document WG15 N621. With the preceding recommendation the problems can be addressed as follows: - In UTF-8 the repertoire of ASCII is encoded as ASCII (ISO/IEC 646 IRV). - We know no codesets with control characters encoded in the full single octet range 0 thru 7F, but many use 0 thru 1F hex and 7F, and some the range 80 thru 9F. UTF-8 has reserved these octet ranges for control characters. - zero value octets and octets equating '/' only appear in UTF-8 as representations of the NUL and '/' character respectively. - "combining characters" need not have special processing as per SC22 resolutions, except for possibly a width specification in a locale. - According to the ISO/IEC 10646 standard there is no equivalences prescribed between sequences of characters with combining characters and some "precomposed" characters, and the SC22 plenary recommendation is that there need not be special handling of this. - It should not be needed to process composite sequences in a special way.