From: Keld J|rn Simonsen <keld@dkuug.dk>
URL:  http://std.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp

ISO TR 14766 - Guidelines for POSIX National Profiles and National
Locales

Annex D. Use of ISO/IEC 10646 in POSIX standards

D.1 Introduction and scope

For servicing the widest possible audience, POSIX standards should be
able to handle the most encompassing character set, and the best
candidate for this is the ISO/IEC 10646-1:1993 standard. The following
gives guidance for how to accomplish this goal.

The field of application is seen to be in many areas such as global
organisations interested in just one character set organisationwide,
in European government institutions, in eastern Asia and many other
places.

ISO/IEC 10646-1:1993, the Universal Multiple-Octet Coded Character Set
(UCS), provides the capability to encode multi-script text within a
single coded character set.

However, because UCS is designed to use all code points available,
null bytes and the code values of the other ISO/IEC 646:1991 IRV (also
known as ASCII) characters, including the code value of the ISO 646
solidus ("/") character, are not protected. This makes the UCS
character encoding incompatible with many existing ISO 646 based POSIX
operating system implementations. The fact that UCS also uses code
points also used for ISO 6429 control characters introduces further
problems for communication and application software. From these
problems it was clear that a POSIX internal encoding was required for
the ISO/IEC 10646 coded character set.

In the following, first a survey of the possible coded representation
forms of UCS and UCS transformation formats and their respective
characteristics are given. Then each of the handling areas (data
storage, file names, internal processing, communications, interprocess
communication) of the POSIX operation is analyzed. Finally guidelines
are given for POSIX standards.

A revised TR 10176 with guidelines for support of IS 10646 has been
published, and there may be further recommendations in this work of
relevance to POSIX.

D.2 UCS coded representation forms and UCS transformation format

D.2.1 POSIX internal encoding

For the POSIX internal encoding UTF-8 was considered suitable.

The objective of UTF-8 is to provide an UCS transformation format
which also meets the requirement of being usable on historical POSIX
operating system file systems in a non-disruptive manner.

The UTF-8 transformation format represents both UCS-2 and UCS-4 in a
compatible format using multiple-octet coded characters of lengths 1,
2, 3, 4, 5, and 6 octets:

 Bits Hex Min Hex Max  Byte Sequence in Binary
1  7 00000000 0000007F 0vvvvvvv
2 11 00000080 000007FF 110vvvvv 10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

The UCS value is the concatenation of the v-bits in the multiple-octet
encoding, where the v-bits are the 0's and 1's that constitute the UCS
value.

Thus UTF-8 has the capability of handling existing ISO 646 files
without change, and all codes in the ISO 646 range (having an octet
value in the range 0-127) can be safely assumed to be representing the
normal ISO 646 character.

D.2.2 Other forms of IS 10646

IS 10646 has two forms: UCS-2 and UCS-4, a 16-bit and 31-bit coded
representation of the character set, respectively. IS 10646 is planned
to have more characters than what is representable in 64 k, so the
general case of UCS-4 needs to be considered.

ISO/IEC 10646-1:1993 had a transformation format UTF-1, which was
informative, and it has now been removed from the standard by the
amendment ISO/IEC 10646-1 AM4:1996. UTF-8 is aimed at the same
purpose, and has more capability. UTF-8 has been approved as part of
UCS via the amendment ISO/IEC 10646-1 AM2:1996.

Another Transformation Format of IS 10646, UTF-16, has also been
approved, as ISO/IEC 10646-1 AM1:1996, but this cannot accommodate all
of IS 10646 (it accommodates about 1 million characters) and it will
employ techniques like in UTF-8 with ranges indicating how many octets
are required to form one character, without the added functionality of
being backwards compatible with ISO/IEC 646 and ISO/IEC 2022 encodings
(which is a functionality of UTF-8).

The most general of the above encodings of IS 10646, is the UCS-4. It
has the property of being constant-width, which may be easier to
handle than the multiple-octet UTF-8. As a file and as an interchange
code it has the problematic property of using codes in conflict with
ISO/IEC 646, ISO/IEC 2022 and ISO/IEC 6429, dependency on
byte-ordering (little-endian vs big-endian) of the hosting machine
architecture, and also of using 4 octets per character. Here UTF-8 is
clearly superior for POSIX internal encoding. UCS-4 may have
advantages as an internal processing code, and as an inter-process
encoding, for C language widechar-like encodings, but with the ISO/IEC
C language amendment (AM1) with full support for multibyte coded
character sets, that advantage may be diminishing. UTF-8 is as well
defined and capable of representing all IS 10646 characters, and given
its strengths in other areas it may well be chosen also for the
internal processing, and inter-process communication. Internal
processing is not in the scope of POSIX interfaces, anyway.

D.2.3 UCS levelling

IS 10646 has 3 levels of support, level 1 without combining
characters, level 2 with combining characters in some scripts, and
level 3 with unrestricted use of combing characters. SC22 has by
resolutions from the 1993 Paris plenary recommended that all SC22
standards be enabled for level 3 data, but that the semantics of
combining characters not be addressed currently. Thus there is not
specific SC22 request for further support of level 2 and 3, but
eventually there could be a need for support of these levels. SC22
also recommended use of IS 10646 terminology thruout SC22 standards,
and this may need an alignment of current POSIX work, though it is the
belief that current POSIX work is already well aligned with IS 10646
with respect to terminology.

D.3 Problems in POSIX handling of UCS

There are several challenges presented by UCS which must be dealt with
by present implementations of the POSIX operating system.

D.3.1 Data storage

The most significant of these challenges is the encoding scheme used
by UCS. More precisely, the challenge is the marrying of the UCS
standard with existing programming languages and existing operating
systems. Prominent among the operating system UCS handling concerns is
the representation of contents of data in files. An underlying
assumption is that there is an absolute requirement to maintain the
existing operating system software investments while at the same time
taking advantage of the use the large number of characters provided by
UCS.

For UTF-8 the representation of ISO 646 data is exactly the same, and
for ISO/IEC 8859 parts, right hand side characters will need two
octets for representation. For ideographic characters in the BMP, the
representation will be three octets. This does not give a dramatically
changed requirement for what is currently consumed for data storage.

D.3.2 File names and internal processing

The UTF-8 transformation format was originally conceived as a file
system safe transformation format of UCS to allow historically ISO 646
based POSIX operating systems to cope with representation and handling
in file names of the large number of characters that are possible to
be encoded by UCS. In addition, from an internal operating system
(kernel) viewpoint this handling of a large character set is only a
problem for handling file names, which are only analyzed for the
solidus ("/") delimiter to parse a name into filename components. As
UTF-8 can represent the full encoding of IS 10646 and is backwards
compatible with ISO 646, UTF-8 handling is sufficient for POSIX
internal encoding.

D.3.3 Communications

Current ISO POSIX standards do not address communication, but as ISO
6429 control characters are often used in communication, and the UTF-1
transformation format was originally created for avoiding control
character problems in communication, UTF-1 could be the choice. As
UTF-1 is being removed from UCS and UTF-8 introduced, having the same
capabilities with respect to control character problem solving, UTF-8
is the recommended choice in POSIX communication interfaces.

D.3.4 Interprocess communication

Communication between POSIX processes would probably use internal data
formats, for example integers should be transferred in binary form. As
it could be recommended that programs internally use a C language
widechar style encoding of characters, a UCS-2 or UCS-4 format could
be recommended.

On the other hand interprocess communication is often across networks
and between heterogeneous systems, therefore since UCS-2 and UCS-4 are
dependent on machine architecture, UTF-8 may be the preferred
candidate. UTF-8 would in many cases also be less space-consuming,
which may be a significant plus when using low-capacity network lines.

D.4 Recommendation

According to the above analysis, UTF-8 is the best candidate for POSIX
internalencoding of UCS in the areas of data storage, file names and
internal operating system (kernel) processing, and communication,
where otherwise UCS-2 or UCS-4 would have been used for coded data.
Furthermore UTF-8 is a good candidate for UCS representation in
interprocess communication.

It is thus the recommendation to use the UTF-8 transformation format
whenever UCS is used in POSIX interfaces.

As POSIX interfaces in principle should be coded character set
independent, there is no general need to require the use of UTF-8 in
POSIX standards, but guidance could be given in rationales.

A specific recommendation is that the portable archive exchange
utility "pax" be revised to be able to specifically use UTF-8 for file
names, and the use of UTF-8 should be clearly identified.


D.5 Consequences

X/Open has raised a number of problems with use of ISO/IEC 10646 in
POSIX in the document WG15 N621. With the preceding recommendation the
problems can be addressed as follows:

 - In UTF-8 the repertoire of ASCII is encoded as ASCII (ISO/IEC 646 IRV).

 - We know no codesets with control characters encoded in the full
single octet range 0 thru 7F, but many use 0 thru 1F hex and 7F, and
some the range 80 thru 9F. UTF-8 has reserved these octet ranges for
control characters.

 - zero value octets and octets equating '/' only appear in UTF-8 as
representations of the NUL and '/' character respectively.

 - "combining characters" need not have special processing as per SC22
resolutions, except for possibly a width specification in a
locale.

 - According to the ISO/IEC 10646 standard there is no equivalences
prescribed between sequences of characters with combining characters
and some "precomposed" characters, and the SC22 plenary recommendation
is that there need not be special handling of this.

 - It should not be needed to process composite sequences in a special
way.