Set Character Width proposal (version 3)

by Markus Kuhn

This proposal adds a new control sequence to those defined in the ISO 6429 (= ECMA-48) standard, to allow applications to specify exactly, which ISO 10646 character sequences shall be displayed as non-spacing, single-width or double-width characters or ligatures (as needed for ideographic languages).

Note: This proposal is still EXPERIMENTAL. Potential users may want to consulting the author first before implementing it. This proposal is discussed publicly on the linux-utf8 mailing list. An alternative proposal to add width-selector characters to ISO 10646 can be found at the end.

Introduction

International standard ISO 6429 (or equivalently ECMA-48) specify the semantics of control functions as they are widely implemented in DEC VT100 compatible terminals and emulators. With the introduction of ISO 10646 and UTF-8 support in these products, the question arises, which UCS character shall advance the cursor by how many character-cell widths. The ECMA-48 standard clarifies in Annex D item 12 that the width of a character and whether it is fixed or variable is left implementation dependent by that standard.

However, for interoperability of terminal applications, it is essential that the host can predict or control, what cursor motion the output of any string will result in. There exists currently no widely deployed standard UTF-8 terminal that could serve as an implementation reference. Implementation practice is just emerging. The experimental UTF-8 support in XFree86 xterm has led to repeated controversy on the best split of Unicode characters into single-width and double-width characters, especially between Asian and European users. The Unicode Consortium acknowledges the problem in the form of Unicode Standard Annex (UAX) #11: East Asian Width, but it does not specify an exact terminal semantic either. UAX11 just assigns each character to one of five width classes to document the ambiguous usage in legacy encodings.

Even though Asian standards such as JIS X 0208 do not specify anything with regard to the width of these characters, it has become a long-standing implementation tradition in Asian terminals that characters from single-byte sets such as US-ASCII, ISO 8859 and JIS X 0201 (modified ASCII plus half-width Katakana) are displayed in a single cell, whereas all characters from double-byte character sets such as JIS X 0208 or GB 2312 are displayed in a pair of cells. This way, at least for some simple encodings, the byte count of a string remains identical to the number of cell columns through which the cursor advances, which no doubt significantly simplifies the adoption of old ASCII software to handle encodings such as EUC-JP. The down side of this is that non-ideographic double-byte characters, such as the Greek and Cyrillic alphabet present in JIS X 0208, will be displayed as double-width characters, which is even in the eyes of Japanese users typographically inappropriate but customary.

The proposed wcwidth standard convention, implemented experimentally in XFree86 xterm and GNU glibc 2.2.2 for UTF-8 locales, is based on the European practice of single-width display of all characters and uses a double-width font only for those characters (Han ideographs, Hieragana, Katakana, ideographic punctuation, full-width ASCII, etc.) for which UAX #11 defined the East Asian Full-width (F) or East Asian Wide (W) category. These characters are either explicit full-width duplicates of characters that are already elsewhere encoded in Unicode as normal characters (F) or they are characters that occur only in the context of East Asian typography where they are natural candidates for square bounding boxes (W). This convention works well for European users who want to include readable Asian text in their documents. It also works reasonably well for Asian users who are not bound by backwards compatibility to existing implementations (such as not only kterm and other EUC terminal emulators, but also numerous embedded devices such as mobile phones and PDAs). However, this convention was greeted with great skepticism by, in particular, Japanese users, who look back on a huge number of existing formatted plaintext, application software and typographic usage habits, in which almost anything is double width.

In this light, it seems inevitable that users will need UTF-8 terminal emulators with tightly controllable width behaviour. Furthermore, there are even some European users who favour the use of double-width glyphs for some selected European characters such as EM DASH or TRIPLE INTEGRAL which cannot really adequately be represented in a single cell shaped for a Latin letter. Similarly, some characters – in particular ligatures for the Arabic and Indic scripts – are likely to be far better represented by double-width glyphs. The current proposed wcwidth standard is a good start, but might not be the final solution. It is therefore essential that UTF-8 terminals allow users to override this default width of a glyph. We need a mechanism that provides for the implementation of tty filters that can patch later the character width of any UTF-8 fixed-width terminal by inserting overriding width-control sequences.

The current XFree86 xterm and GNU glibc implementations are rather flexible and could easily accommodate future character width conventions or trends. For the X11 charcell fonts, the choice of which character is single or double width has been separated from the font. Both a single- and a double-width font are loaded and then the X client software can decide, which glyph it picks for each letter. The C library offers a wcwidth() function to determine the width of a character, but leaves the actual value defined in a locale specification file, which could be fairly easily changed by the user.

The C library is only then a useful decision instance for character width if the terminal emulator uses a compatible C library and equivalent locale environment-variable settings as the application. With remote terminals, some form of in-band signaling is required to override any default character width conventions that the terminal has built in. This proposal suggests such an ESC sequence.

Specification

We add a single new control function to those already specified in ECMA-48:

SCW - SET CHARACTER WIDTH

Notation: (Pn;Ps)
Representation: CSI Pn;Ps 07/07

Parameter default value: Pn = infinity; Ps = 2

SCW is used to specify the width (number of cursor positions) of the following n graphical characters, where n equals the value of Pn if Pn is not empty, otherwise n equals infinity. The display width of these characters depends on the value of Ps:

0
This parameter value causes the next n graphical characters to occupy zero cursor positions. The implementation can accomplish this by combining these characters with the one preceding it in writing direction into a ligature of the same width in which this character was written originally. The implementation can also accomplish this by applying the character as a combining character to the preceding character, or if that is also not supported by the font, then by ignoring the character entirely.
1
This parameter value causes the next n graphical characters to occupy one single cursor position each. If no suitable single-width glyph is available, the cursor still has to be advanced by one position per received character, and some default glyph should be written instead.
2
This parameter value causes the next n graphical characters to occupy two cursor positions each. If no suitable double-width glyph is available, the cursor still has to be advanced by two positions per received character, and some default glyph should be written instead.

The effect of an SCW control function lasts only until

Note: The definition implies that CSI 03/00 07/07 can be used to terminate the effect of any previous SCW.

SCW shall not affect the joining or directional behaviour of characters in any way.

Note: The cursor position after reception of a malformed UTF-8 sequence is not defined by this proposal. Treating either any malformed UTF-8 sequence or each byte of a malformed UTF-8 sequence like a U+FFFD character seems to be commonly acceptable practice.

The following default behaviour is suggested for UTF-8 terminal emulators when no SCW control function is in effect: Non-spacing and enclosing combining characters (general category code Mn or Me in the Unicode database) have a column width of 0. Other format characters (general category code Cf in the Unicode, database) and ZERO WIDTH SPACE (U+200B) have a column width of 0. Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have a column width of 0. Spacing characters in the East Asian Wide (W) or East Asian FullWidth (F) category as defined in Unicode Technical Report #11 have a column width of 2. All remaining characters (including all printable ISO 8859-1 and WGL4 characters, Unicode control characters, etc.) have a column width of 1.

Note: The control sequence "CSI ... w" was already used by DEC for DECSHORP in some long obsolete DEC printers (LA100, LN03, DEClaser2100), as were all other private/experimental sequences without intermediate byte, so one could argue that we should settle for "CSI ... !w" (CSI ... 02/01 07/07) instead, which seems still unused elsewhere. But on the other hand, it is highly unlikely that someone will want to use SCW and DECSHORP in the same implementation, and without a registration procedure, collisions of private sequences can never be avoided completely.

Note: The original VT100 did not distinguish between zero and default parameters, but that is not part of ECMA-48 and following that practice would restrict SCW in unpleasant ways. The control sequence parsers for a few existing terminal emulators might have to be slightly revised for better ECMA-48 compliance before SCW can be implemented as described, but that is not necessarily a bad thing.

Examples

The sequence
  CSI 1w ABCD

will force the following character "A" to occupy two cursor positions, but it will not affect any further characters. The sequences

  CSI 2w ABCD
  CSI w AB CSI 0w CD

will both force the "A" and "B" to be double width and any subsequent characters remain unaffected. The sequences

  CSI 1w f CSI 1;0w i
  CSI 1;1w f CSI 1;0w l

cause "fi" to be printed as a single double-width ligature and "fl" as a single single-width ligature, if available. The sequence

  CSI w ABCDEF LF GHIJKL

forces "ABCDEF" to be doublewidth, whereas "GHIJKL" have their default width.

Alternative proposal: Width selector characters in UCS

Unicode 3.2 introduced 16 new Variation Selector characters (U+FE00 to U+FE0F), which are used like combining characters, but instead of placing an accent onto the previous character, they select an alternative glyph shape of what is still the same character.

The same problem that the Set Character Width (SCW) control function proposed above aims to solve could also be solved elegantly with the introduction of three new special variation-selector characters into Unicode. These would explicitly choose a zero-width, single-width, or double-width form of the preceding Unicode character, in order to remove ambiguity in the communication with character-cell terminal emulators.

These new variation selectors could be called something like

  ZERO-WIDTH SELECTOR
  SINGLE-WIDTH SELECTOR
  DOUBLE-WIDTH SELECTOR
and would be used particularly when converting from CJK legacy encodings to Unicode, in order to preserve exactly the width information of characters in the EastAsianWidth Ambiguous class. The Unicode Character Database uses already an equivalent of these width selector characters in the form of the <narrow> and <wide> compatibility formatting tags.

Beyond that, they can also be used by applications to control the width of other characters where it is not entirely obvious, which spacing a terminal implementor might have chosen. This is in particular the case for characters for which it is difficult to find an adequate single-width glyph, such as VOLUME INTEGRAL or EM DASH.

For example EM DASH followed by DOUBLE-WIDTH SELECTOR would lead to a typographically far more adequate width of this character, or HAIR SPACE followed by ZERO-WIDTH SELECTOR could be used to ensure that this character is definitely not advancing the cursor, while still preserving it for cut-and-paste operations.

The width selector characters are specifically intended for output devices with monospaced or biwidth fonts and can be ignored by implementations that use proportional fonts.

Note: The idea of a ZERO-WIDTH SELECTOR is perhaps somewhat less important and optional, but at least a code position for it could be kept free. The generic ability to make characters invisible might raise security concerns, so its use could be restricted to a selected set of characters such as for example thin/hair space and unsupported control characters, which a terminal emulator legitimately might or might not choose to display in a character cell of its own. Applications could achieve almost the same effect as ZERO-WIDTH SELECTOR by simply not sending the character in question. The difference would only become apparent with non-display terminal operations, such as copy-and-paste.

Note: VT100-style terminals display each individual character as soon as it is received and arbitrarily large delays can occur between the reception of characters. Therefore, terminal emulation authors might prefer it if variation and width selectors preceded the base character that they modify, rather than follow it, as this would remove the need to overwrite an already displayed character when the following variation selector finally arrives over the communication link. However, in a less severe form, Unicode terminals have already to deal with the same problem for combining characters.

Special thanks to Tomohiro Kubota, Juliusz Chroboczek, Robert de Bath, and Paul Williams for valuable discussion.

Markus Kuhn

created 2001-05-16 – last modified 2002-02-04 – http://www.cl.cam.ac.uk/~mgk25/ucs/scw-proposal.html