Notes by Nicholas Cutler

regex -----

The regex program is a simple command-line utility which demonstrates the use of the PCRE (Perl Compatible Regular Expression) library. It accepts a regular expression and reads a text file line by line, reporting each instance where the expression matches some text in the line.

It is invoked from the command line, so first ensure that the Filer can find the regex program. You can do this by putting it in the library directory, usually !Boot.Library; by adding the directory containing the executable to Run$Path; or by changing to the directory where regex is located. Now press F12, or open a task window and enter (for example):

   *regex -i test

You will be prompted to enter a regular expression. Once you have done this, it reads the file test, reporting any matches as it goes.

The general command-line syntax is

   *regex [options] <filename>

where the options can be one or more of the following, each separated by spaces:

  -c Report any captured strings from the regular expression.
  -g Find matching strings globally; that is, don't stop looking after the first
     successful match.
  -i Prompt the user to input the pattern from the keyboard.
  -p <pattern> Specify the pattern on the command line.
  -s Matches in a case-insensitive way.
  -t Report the execution time of the regular expression for each line.
  -u Treat the pattern and text strings as being Unicode strings in UTF8 encoding.

If you fail to provide, at very least, a filename on the command line then an error is given. Also, a message will be printed for any unknown options, but they will otherwise be ignored.

In use, regex will read strings from the input file and try to match the regular expression against them. It will report the position of any successful matches, and optionally any captured substrings. Each string in the input file should end with a newline character (ie. the normal line ending convention on RISC OS). If you enable unicode support, it is assumed that the subject strings, but not the pattern string, are already in the UTF8 encoding. RISC OS cannot display most Unicode characters, so any not in the default character set will be displayed as a hexadecimal code in angle brackets eg. <U+023A>. A subject string can be split over more than one physical line by including a slash '/' at the end of each line except the last. Each slash is replaced with a newline, which can, therefore, be matched with \n in the regular expression.

Note that if you specify neither -p or -i on the command line, then the first line of the text file is assumed to contain the pattern. If you are using Unicode, then you will also need the Iconv module (http://www.netsurf-browser.org/projects/iconv/) to be installed before you invoke regex. This is used to convert the pattern string to UTF8 before starting. The -t option to report times also requires Timermod (http://www.armclub.org.uk/free/) to be present to enable the measurement of times to a resolution of 1 microsecond.

I have tested regex and found that it should be satisfactory for trying the examples in the article that it accompanies in Archive magazine (issue 23:9). However, it is not intended to be a general text-processing tool, and it has not been extensively tested for this purpose. The C source code is included with this release.
 
