Small European Character Sets ----------------------------- I have recently spent quite some time working out a proposal for two Unicode/ISO 10646 subsets that are so small that I hope they will become widely implemented in Europe and America. Both are specifically designed to be suitable for systems where characters are represented in low-resolution fixed-width fonts. This includes for instance your xterm and Emacs window under Unix (or more general VT100 emulators and source code editors), but also applications such as portable LCD devices (pager, mobile phones), where only a small subset of Unicode makes sense to be implemented and where no single 8-bit set can cover a reasonable number of languages. These subsets are not really intended for applications such as the publishing industry, where these display restrictions do not exist and larger Unicode subsets or even full implementations might be adequate. The two subsets are: - Very Simple European Character Set (VSECS) 345 characters, basically the superset of Latin 1-4,9,10,15 and CP1251 plus a very few ISO 6397 characters Rows Positions (Cells) 00 20-7E A0-FF 01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92 02 C6-C7 D8-DD 20 13-15 18-1A 1C-1E 20-22 26 30 39-3A AC 21 22 26 5B-5E 90-93 26 6A FF FD - Simple European Character Set (SECS) 683 characters, covers in addition to VSECS also Cyrillic, Greek, MS-DOS blockgraphics, and a moderate set of mathematical characters that is likely to be used in academic email and source code comments. Rows Positions (Cells) 00 20-7E A0-FF 01 00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92 02 BC-BD C6-C7 D8-DD 03 84-86 88-8A 8C 8E-A1 A3-CE D1 D5-D6 F1 04 01-0C 0E-4F 51-5C 5E-5F 90-91 20 13-15 17-1A 1C-1E 20-22 26 30 32-34 39-3A 70 7F-83 A7 AC 21 02 15-16 1A 1D 22 24 26 5B-5E 90-95 A4-A7 D0-D5 22 00-09 0B-0C 12-13 18-1A 1D-1E 24-2A 3C 43 45 48-49 58 5F-62 64-65 22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5 23 00 08-0B 10 15 20-21 29-2A 25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 B2 25 BA BC C4 CB 26 10-12 3A-3C 40 42 6A-6B 6D-6F 27 13 17 FF FD VSECS is somewhat similar to ISO 6937 with some bugs fixed (e.g., the Euro symbol is included, as are the directed quotation marks). SECS is somewhat similar to Microsoft/Adobe WGL4. I think SECS is much better than WGL4, because WGL4 contains many letters for which I could not find out where they are used (for at least three I am sure they never existed). SECS contains the following 91 characters that are not part of WGL4: Rows Positions (Cells) 02 BC-BD 03 D1 D5-D6 F1 20 34 70 80-83 21 02 15 1A 1D 24 A4-A7 D0-D5 22 00-01 03-05 07-09 0B-0C 13 18 1D 24-28 2A 3C 43 45 49 58 5F 62 22 6A-6B 82-8B 95 97 A4-A7 C2-C3 C5 23 00 08-0B 15 29-2A 26 10-12 6D-6F 27 13 17 FF FD Almost all of these are a set of basic mathematic characters that most high school students should be familiar with. They are very useful to have available in academic email discussions and source code comments. It would be nice if the authors of WGL4 considered seriously to extend their Unicode subset by those few dozen elementary math symbols. Then SECS would become a subset of WGL4. VSECS is already a subset of WGL4 except for U+FFFD. The mathematical symbols of SECS will hopefully provide for US developers who do not specialize in i18n issues some motivation to get interested in 16-bit character sets, as they are more relevant for their personal use than the accented characters of crazy Europeans. My dream is that something like SECS becomes rather soon the common minimum repertoire in Unix X11 fonts and printer fonts. VSECS is intended as an intermediate step for applications where the size of the character set is critical and only Latin script support is required. I do not think SECS contains any useless symbol. I know for each letter and symbol why it is in there and in which languages or fields it is used. Just ask. Much more information on the two sets is available from http://www.cl.cam.ac.uk/~mgk25/ucs/vsecs.html http://www.cl.cam.ac.uk/~mgk25/ucs/secs.html Much better than just looking at these web pages is to download the database (Perl needed) that generated them from http://www.cl.cam.ac.uk/~mgk25/ucs/secs.tar.gz Then you can play around with them and test the subset properties with regard to other sets easily yourself. If you want to see example glyphs on the HTML output of this script, then you'll also need http://www.cl.cam.ac.uk/~mgk25/ucs/glyphs.zip The uniset Perl script allows you to comfortably build up your own database of character collections, to merge and subtract them and to generate Unicode subsets and study their relations with other subsets. The mapping files from the Unicode Consortium can be used directly as input. Please let me know what you think about SECS and VSECS and if this is something you would like to see widely implemented. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain