
Uniset - a Unicode subset management package
--------------------------------------------

Markus Kuhn -- https://www.cl.cam.ac.uk/~mgk25/


The latest uniset package is available from

  https://www.cl.cam.ac.uk/~mgk25/download/uniset.tar.gz
  https://github.com/mgkuhn/uniset

This package contains the Perl script "uniset", which allows you to
join and subtract Unicode subsets and present the result in various
forms. It also contains a comprehensive set of files for generating
and scrutinizing Unicode subsets.

This package is intended for Unix systems on which Perl 5.14 or newer,
and preferably also the "make" command, is installed. I'd expect it to
also be usable on any other platform where Perl is installed.

You can use uniset simply in the directory where you unpacked the
tarball, or you can install it with

  make install

which puts it under /usr/local/bin, or provide your own prefix, e.g.

  make install DESTDIR=$HOME/local


Let's have a look at a few examples of how you can use uniset:

This package comes with a large number of character subset files. Many
of these are just the mapping files provided by Unicode, others have
been generated by myself from various available documents. The
filenames usually tell you all about the content, for instance
8859-14.TXT is the ISO/IEC 8859-14 repertoire. See the comments in the
files for more details.

Uniset allows you to do add and subtract these repertoires and output
them in various formats. For example, if you want to know, which
characters are in ISO 8859-15 that are not in ISO 8859-1, then just
add 8859-15 and subtract from that 8859-1 and display the result as an
ASCII table:

  ./uniset +8859-15.TXT -8859-1.TXT table
  0152 # LATIN CAPITAL LIGATURE OE
  0153 # LATIN SMALL LIGATURE OE
  0160 # LATIN CAPITAL LETTER S WITH CARON
  0161 # LATIN SMALL LETTER S WITH CARON
  0178 # LATIN CAPITAL LETTER Y WITH DIAERESIS
  017D # LATIN CAPITAL LETTER Z WITH CARON
  017E # LATIN SMALL LETTER Z WITH CARON
  20AC # EURO SIGN

Note that on non-Unix systems you might have to type "perl uniset
+8859-15.TXT -8859-1.TXT table" instead of "./uniset ..." in case your
operating system does not interpret the first line of the uniset file
as an instruction to activate the perl interpreter.

Type

  perl uniset

or

  ./uniset

to get a help page displayed that documents all functions of uniset.

If you prefer to get HTML output, just add "html" as the first word to
the command line, as in:

  ./uniset html +8859-15.TXT -8859-1.TXT table >new-8859-15-chars.html

Uniset can also generate the compact table format of a Unicode subset
that is used in the work of the CEN/TC304 P10 committee. Use the
"compact" command to do this, and the "nr" command adds a line that
tells you the size of this set. For example to get MES-2 in this
format, use

  ./uniset +MES-2 compact nr

  # Plane 00
  # Rows  Positions (Cells)

    00    20-7E A0-FF
    01    00-7F 8F 92 B7 DE-EF FA-FF
    02    18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D6 D8-DD DF EE
    ...
    26    3A-3C 40 42 60 63 65-66 6A-6B
    FB    01-02
    FF    FD

  # Number of characters in above table: 1062

All non-html output of uniset is suitable to be read in again as a
repertoire file. You can therefore use uniset to built up and
manipulate comfortably your own database of character set collections.

Uniset can also read BDF files (X11 fonts) as subsets, in order to
allow easy verification of which characters are available in a
font. So if you are interested in which and how many WGL4 glyphs are
still missing in 6x13.bdf, then just write "./uniset +WGL4 -6x13.bdf
table nr", and if you want to get a UTF-8 list of all characters in
the 6x13 font then use for example "./uniset +6x13.bdf utf8-list
>6x13.repertoire-utf8".

Uniset can also produce UTF-8 output and generate repertoire
descriptions of all characters found in a UTF-8 plain text file. See
the builtin helptext for details.

*** Have fun ***

Markus

-- 
Markus Kuhn, Department of Computer Science and Technology
https://www.cl.cam.ac.uk/~mgk25/ | University of Cambridge
