Uuseg_string (uuseg.Uuseg

Segment

type 'a folder = 'a -> string -> 'a

The type for segment folders. The function takes an accumulator and a segment. Segments are the UTF-X encoded characters delimited by two `Boundary occurences. If the segmenter has no initial or final `Boundary, the folding function inserts an implicit one. Empty segments – which by definition do not happen with the default segmenters – are not reported.

val fold_utf_8 : [< Uuseg.boundary ] -> 'a folder -> 'a -> string -> 'a

fold_utf_8 b f acc s folds over the b UTF-8 encoded segments of the UTF-8 encoded string s using f and acc.

val fold_utf_16be : [< Uuseg.boundary ] -> 'a folder -> 'a -> string -> 'a

fold_utf16be is like fold_utf_8 but on UTF-16BE encoded strings.

val fold_utf_16le : [< Uuseg.boundary ] -> 'a folder -> 'a -> string -> 'a

fold_utf16le is like fold_utf_8 but on UTF-16BE encoded strings.

Pretty-printers

val pp_utf_8 : Stdlib.Format.formatter -> string -> unit

pp_utf8 ppf s prints the UTF-8 encoded string s. Each grapheme cluster is considered as taking a length of 1.

val pp_utf_8_text : Stdlib.Format.formatter -> string -> unit

pp_utf_8_text ppf s prints the UTF-8 encoded string s. Each grapheme cluster is considered as taking a length of 1. Each line break opportunity is hinted with Format.pp_print_break and mandatory line breaks issue a Format.pp_force_newline call.

Take into account the following points:

Any white space Unicode character occuring before a break opportunity will be translated to a space (U+0020) in output if no break occurs.
The sequence CR LF (U+000D, U+000A) and all kind of mandatory line breaks are translated to whathever line separator is output by Format.pp_force_newline. See pp_utf_8_lines for the list of characters treated as mandatory line breaks.
Soft hyphens are handled but due to limitations in Format are not replaced by hard ones on breaks.

val pp_utf_8_lines : Stdlib.Format.formatter -> string -> unit

pp_utf_8_lines ppf s prints the UTF-8 encoded string s. Each grapheme cluster is considered as taking a length of 1. Each mandatory line break (including the sequence CR LF (U+000D, U+000A)) issues a Format.pp_force_newline and is translated to whathever line separator this function outputs.

This function correctly handles all kinds of line ends present Unicode, as of 7.0.0 this is FORM FEED (U+000C), LINE TABULATION (U+000B), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2020), NEXT LINE (U+085), LINE FEED (U+000A), CARRIAGE RETURN (U+000D), and the sequence CR LF (U+000D, U+000A).