
Uniset - a Unicode subset management package
--------------------------------------------

Markus Kuhn <mkuhn@acm.org> -- 1998-08-26


The latest uniset package is available from

  <http://www.cl.cam.ac.uk/~mgk25/ucs/secs.zip>

The files in this package helped me to construct and experiment with
the (Very) Simple European Character Set (VSECS/SECS) subset. This
package contains the Perl script "uniset" which allows you to add and
subtract Unicode subsets, and a Makefile that constructs SECS using
other existing subsets. It also contains a comprehensive set of files
for generating and scrutinizing Unicode subsets in a very comfortable
way.

This package is intended for Unix systems on which the "perl" and
preferably also the "make" command is installed. I expect however that
uniset is also easily useable on any other platform on which Perl is
available, such as Win32 and Mac.

If you do not have a Perl interpreter installed yet, then get it from

  <http://www.perl.com/latest.html>

Whenever you want to regenerate the files that I produces using
uniset, then under Unix just enter "make" and the Makefile will be
executed in order to generate for instance the secs.html web page. But
you don't have to do this, for your convenience everything is also
included precompiled in the distribution.

The generated HTML pages can contain references to a large number of
GIF images with character glyphs on the Unicode Inc. Web site.
Downloading these glyphs can take quite some time. If you want to
seriously work with uniset using the glyph images in the HTML output,
then I suggest that you download the required images from the Unicode
web server onto your own harddisk.

You can easily get all MES-4 glyph images at once by downloading

  <http://www.cl.cam.ac.uk/~mgk25/ucs/glyphs.zip>

[You could also automatically download all images in the MES-4 subset
(warning, these are almost 3000 files) yourself from the original
Unicode server by entering "./uniset +MES-3 loadimages". The
downloading functions requires the "webcopy" Perl script to be
installed, which is available from
<ftp://ftp.inf.utfsm.cl/pub/utfsm/perl/webcopy.tgz>.]


Let's have a look at a few examples of how you can use uniset:

This package comes with a large number of character subset files. Many
of these are just the mapping files provided by Unicode, others have
been generated by myself from various available documents. The
filenames usually tell you all about the content, for instance
8859-14.TXT is the ISO/IEC 8859-14 repertoire. See the comments in the
files for more details.

Uniset allows you to do add and subtract these repertoires and output
them in various formats. For example:

  ./uniset +MES-1 -SECS table
  0132 # LATIN CAPITAL LIGATURE IJ
  0133 # LATIN SMALL LIGATURE IJ
  013F # LATIN CAPITAL LETTER L WITH MIDDLE DOT
  0140 # LATIN SMALL LETTER L WITH MIDDLE DOT
  0149 # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

Here, uniset takes the MES-1 collection and removes the SECS
collection (both are filenames). The resulting set is printed as a
table with the "table" instruction. Note that on non-Unix systems you
might have to type "perl uniset +MES-1 -SECS table" instead of
"./uniset ..." in case your operating system does not interpret the
first line of the uniset file as an instruction to activate the perl
interpreter.

Type

  perl uniset

or

  ./uniset

to get a help page displayed that documents all functions of uniset.

If you prefer to get HTML output, just add "html" as the first word to
the command line, or "img" if you want to have HTML with glyph images
in the table, as in:

  ./uniset img +MES-1 -SECS table >deprecated-mes1-chars.html

Uniset can also generate the compact table format of a Unicode subset
that is used in the work of the CEN/TC304 P10 committee. Use the
"compact" command to do this, and the "nr" command adds a line that
tells you the size of this set. For example to get SECS in this
format, use

  ./uniset +SECS compact nr

  # Plane 00
  # Rows  Positions (Cells)

    00    20-7E A0-FF
    01    00-13 16-2B 2E-31 34-3E 41-48 4A-4D 50-7E 92
    02    BC-BD C6-C7 D8-DD
    03    84-86 88-8A 8C 8E-A1 A3-CE D1 D5-D6 F1
  ...
    25    BA BC C4 CB
    26    10-12 3A-3C 40 42 6A-6B 6D-6F
    27    13 17
    FF    FD

  # Number of characters in above table: 705

All non-html output of uniset is suitable to be read in again as a
repertoire file. You can therefore use uniset to built up and
manipulate comfortably your own database of character set collections.

*** Have fun ***

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>
