Streams
Streams of elements of type 'a.
In simple usage, when using only this module Markup, the additional type parameter 's is always sync, and there is no need to consider it further.
However, if you are using Markup_lwt, you may create some async streams. The difference between the two is that next on a sync stream retrieves an element before next "returns," while next on an async stream might not retrieve an element until later. As a result, it is not safe to pass an async stream where a sync stream is required. The phantom types are used to make the type checker catch such errors at compile time.
Errors
module Error : sig ... endError type and to_string function.
Encodings
module Encoding : sig ... endCommon Internet encodings such as UTF-8 and UTF-16; also includes some less popular encodings that are sometimes used for XML.
Signals
Representation of an XML declaration, i.e. <?xml version="1.0" encoding="utf-8"?>.
type doctype = {doctype_name : string option; |
public_identifier : string option; |
system_identifier : string option; |
raw_text : string option; |
force_quirks : bool; |
}Representation of a document type declaration. The HTML parser fills in all fields besides raw_text. The XML parser reads declarations roughly, and fills only the raw_text field with the text found in the declaration.
type signal = [ | `Start_element of name * (name * string) list |
| `End_element |
| `Text of string list |
| `Doctype of doctype |
| `Xml of xml_declaration |
| `PI of string * string |
| `Comment of string |
]Parsing signals. The parsers emit them according to the following grammar:
doc ::= `Xml? misc* `Doctype? misc* element misc*
misc ::= `PI | `Comment
element ::= `Start_element content* `End_element
content ::= `Text | element | `PI | `CommentAs a result, emitted `Start_element and `End_element signals are always balanced, and, if there is an XML declaration, it is the first signal.
If parsing with ~context:`Document, the signal sequence will match the doc production until the first error. If parsing with ~context:`Fragment, it will match content*. If ~context is not specified, the parser will pick one of the two by examining the input.
As an example, if the XML parser is parsing
<?xml version="1.0"?><root>text<nested>more text</nested></root>it will emit the signal sequence
`Xml {version = "1.0"; encoding = None; standalone = None}
`Start_element (("", "root"), [])
`Text ["text"]
`Start_element (("", "nested"), [])
`Text ["more text"]
`End_element
`End_elementThe `Text signal carries a string list instead of a single string because on 32-bit platforms, OCaml strings cannot be larger than 16MB. In case the parsers encounter a very long sequence of text, one whose length exceeds about Sys.max_string_length / 2, they will emit a `Text signal with several strings.
val signal_to_string : [< signal ] -> stringProvides a human-readable representation of signals for debugging.
Parsers
An 's parser is a thin wrapper around a (signal, 's) stream that supports access to additional information that is not carried directly in the stream, such as source locations.
Evaluates to the location of the last signal emitted on the parser's signal stream. If no signals have yet been emitted, evaluates to (1, 1).
XML
val parse_xml : ?report:(location -> Error.t -> unit) -> ?encoding:Encoding.t -> ?namespace:(string -> string option) -> ?entity:(string -> string option) -> ?context:[< `Document | `Fragment ] -> (char, 's) stream -> 's parserCreates a parser that converts an XML byte stream to a signal stream.
For simple usage, string "foo" |> parse_xml |> signals.
If ~report is provided, report is called for every error encountered. You may raise an exception in report, and it will propagate to the code reading the signal stream.
If ~encoding is not specified, the parser detects the input encoding automatically. Otherwise, the given encoding is used.
~namespace is called when the parser is unable to resolve a namespace prefix. If it evaluates to Some s, the parser maps the prefix to s. Otherwise, the parser reports `Bad_namespace.
~entity is called when the parser is unable to resolve an entity reference. If it evaluates to Some s, the parser inserts s into the text or attribute being parsed without any further parsing of s. s is assumed to be encoded in UTF-8. If entity evaluates to None instead, the parser reports `Bad_token. See xhtml_entity if you are parsing XHTML.
The meaning of ~context is described at signal, above.
val write_xml : ?report:((signal * int) -> Error.t -> unit) -> ?prefix:(string -> string option) -> ([< signal ], 's) stream -> (char, 's) streamConverts an XML signal stream to a byte stream.
If ~report is provided, it is called for every error encountered. The first argument is a pair of the signal causing the error and its index in the signal stream. You may raise an exception in report, and it will propagate to the code reading the byte stream.
~prefix is called when the writer is unable to find a prefix in scope for a namespace URI. If it evaluates to Some s, the writer uses s for the URI. Otherwise, the writer reports `Bad_namespace.
HTML
val parse_html : ?report:(location -> Error.t -> unit) -> ?encoding:Encoding.t -> ?context:[< `Document | `Fragment of string ] -> (char, 's) stream -> 's parserSimilar to parse_xml, but parses HTML with embedded SVG and MathML, never emits signals `Xml or `PI, and ~context has a different type on tag `Fragment.
For HTML fragments, you should specify the enclosing element, e.g. `Fragment "body". This is because, when parsing HTML, error recovery and the interpretation of text depend on the current element. For example, the text
foo</bar>parses differently in title elements than in p elements. In the former, it is parsed as foo</bar>, while in the latter, it is foo followed by a parse error due to unmatched tag </bar>. To get these behaviors, set ~context to `Fragment "title" and `Fragment "p", respectively.
If you use `Fragment "svg", the fragment is assumed to be SVG markup. Likewise, `Fragment "math" causes the parser to parse MathML markup.
If ~context is omitted, the parser guesses it from the input stream. For example, if the first signal would be `Doctype, the context is set to `Document, but if the first signal would be `Start_element "td", the context is set to `Fragment "tr". If the first signal would be `Start_element "g", the context is set to `Fragment "svg".
val write_html : ?escape_attribute:(string -> string) -> ?escape_text:(string -> string) -> ([< signal ], 's) stream -> (char, 's) streamSimilar to write_xml, but emits HTML5 instead of XML. If ~escape_attribute and/or ~escape_text are provided, they are used instead of default escaping functions.
Input sources
Evaluates to a stream that retrieves successive bytes from the given string.
val buffer : Stdlib.Buffer.t -> (char, sync) streamEvaluates to a stream that retrieves successive bytes from the given buffer. Be careful of changing the buffer while it is being iterated by the stream.
val channel : Stdlib.in_channel -> (char, sync) streamEvaluates to a stream that retrieves bytes from the given channel. If the channel cannot be read, the next read of the stream results in raising Sys_error.
Note that this input source is synchronous because Pervasives.in_channel reads are blocking. For non-blocking channels, see Markup_lwt_unix.
file path opens the file at path, then evaluates to a pair s, close, where reading from stream s retrieves successive bytes from the file, and calling close () closes the file.
The file is closed automatically if s is read to completion, or if reading s raises an exception. It is not necessary to call close () in these cases.
If the file cannot be opened, raises Sys_error immediately. If the file cannot be read, reading the stream raises Sys_error.
fn f is a stream that retrives bytes by calling f (). If the call results in Some c, the stream emits c. If the call results in None, the stream is considered to have ended.
This is actually an alias for stream, restricted to type char.
Output destinations
Eagerly retrieves bytes from the given stream and assembles a string.
val to_buffer : (char, sync) stream -> Stdlib.Buffer.tEagerly retrieves bytes from the given stream and places them into a buffer.
val to_channel : Stdlib.out_channel -> (char, sync) stream -> unitEagerly retrieves bytes from the given stream and writes them to the given channel. If writing fails, raises Sys_error.
Eagerly retrieves bytes from the given stream and writes them to the given file. If writing fails, or the file cannot be opened, raises Sys_error. Note that the file is truncated (cleared) before writing. If you wish to append to file, open it with the appropriate flags and use to_channel on the resulting channel.
Stream operations
stream f creates a stream that repeatedly calls f (). Each time f () evaluates to Some v, the next item in the stream is v. The first time f () evaluates to None, the stream ends.
Retrieves the next item in the stream, if any, and removes it from the stream.
Retrieves the next item in the stream, if any, but does not remove the item from the stream.
transform f init s lazily creates a stream by repeatedly applying f acc v, where acc is an accumulator whose initial value is init, and v is consecutive values of s. Each time, f acc v evaluates to a pair (vs, maybe_acc'). The values vs are added to the result stream. If maybe_acc' is Some acc', the accumulator is set to acc'. Otherwise, if maybe_acc' is None, the result stream ends.
fold f init s eagerly folds over the items v, v', v'', ... of s, i.e. evaluates f (f (f init v) v') v''...
map f s lazily applies f to each item of s, and produces the resulting stream.
filter f s is s without the items for which f evaluates to false. filter is lazy.
filter_map f s lazily applies f to each item v of s. If f v evaluates to Some v', the result stream has v'. If f v evaluates to None, no item corresponding to v appears in the result stream.
iter f s eagerly applies f to each item of s, i.e. evaluates f v; f v'; f v''...
drain s eagerly consumes s. This is useful for observing side effects, such as parsing errors, when you don't care about the parsing signals themselves. It is equivalent to iter ignore s.
Utility
val content : ([< signal ], 's) stream -> (content_signal, 's) streamConverts a signal stream into a content_signal stream by filtering out all signals besides `Start_element, `End_element, and `Text.
val tree : ?text:(string list -> 'a) -> ?element:(name -> (name * string) list -> 'a list -> 'a) -> ?comment:(string -> 'a) -> ?pi:(string -> string -> 'a) -> ?xml:(xml_declaration -> 'a) -> ?doctype:(doctype -> 'a) -> ([< signal ], sync) stream -> 'a optionThis function's type signature may look intimidating, but it is actually easy to use. It is best introduced by example:
type my_dom = Text of string | Element of name * my_dom list
"<p>HTML5 is <em>easy</em> to parse"
|> string
|> parse_html
|> signals
|> tree
~text:(fun ss -> Text (String.concat "" ss))
~element:(fun (name, _) children -> Element (name, children))results in the structure
Element ("p" [
Text "HTML5 is ";
Element ("em", [Text "easy"]);
Text " to parse"])Formally, tree assembles a tree data structure of type 'a from a signal stream. The stream is parsed according to the following grammar:
stream ::= node*
node ::= element | `Text | `Comment | `PI | `Xml | `Doctype
element ::= `Start_element node* `End_elementEach time trees matches a production of node, it calls the corresponding function to convert the node into your tree type 'a. For example, when trees matches `Text ss, it calls ~text ss, if ~text is supplied. Similarly, when trees matches element, it calls ~element name attributes children, if ~element is supplied.
See trees if the input stream might have multiple top-level trees. This function tree only retrieves the first one.
val trees : ?text:(string list -> 'a) -> ?element:(name -> (name * string) list -> 'a list -> 'a) -> ?comment:(string -> 'a) -> ?pi:(string -> string -> 'a) -> ?xml:(xml_declaration -> 'a) -> ?doctype:(doctype -> 'a) -> ([< signal ], 's) stream -> ('a, 's) streamLike tree, but converts all top-level trees, not only the first one. The trees are emitted on the resulting stream, in the sequence that they appear in the input.
type 'a node = [ | `Element of name * (name * string) list * 'a list |
| `Text of string |
| `Doctype of doctype |
| `Xml of xml_declaration |
| `PI of string * string |
| `Comment of string |
]See from_tree below.
Deconstructs tree data structures of type 'a into signal streams. The function argument is applied to each data structure node. For example,
type my_dom = Text of string | Element of string * my_dom list
let dom =
Element ("p", [
Text "HTML5 is ";
Element ("em", [Text "easy"]);
Text " to parse"])
dom |> from_tree (function
| Text s -> `Text s
| Element (name, children) -> `Element (("", name), [], children))results in the signal stream
`Start_element (("", "p"), [])
`Text ["HTML5 is "]
`Start_element (("", "em"), [])
`Text ["easy"]
`End_element
`Text " to parse"
`End_elementval elements : (name -> (name * string) list -> bool) -> ([< signal ] as 'a, 's) stream -> (('a, 's) stream, 's) streamelements f s scans the signal stream s for `Start_element (name, attributes) signals that satisfy f name attributes. Each such matching signal is the beginning of a substream that ends with the corresponding `End_element signal. The result of elements f s is the stream of these substreams.
Matches don't nest. If there is a matching element contained in another matching element, only the top one results in a substream.
Code using elements does not have to read each substream to completion, or at all. However, once the using code has tried to get the next substream, it should not try to read a previous one.
Extracts all the text in a signal stream by discarding all markup. For each `Text ss signal, the result stream has the bytes of the strings ss, and all other signals are ignored.
val trim : ([> content_signal ] as 'a, 's) stream -> ('a, 's) streamTrims insignificant whitespace in an HTML signal stream. Whitespace around flow ("block") content does not matter, but whitespace in phrasing ("inline") content does. So, if the input stream is
<div>
<p>
<em>foo</em> bar
</p>
</div>passing it through Markup.trim will result in
<div><p><em>foo</em> bar</p></div>Note that whitespace around the </em> tag was preserved.
Concatenates adjacent `Text signals, then eliminates all empty strings, then all `Text [] signals. Signals besides `Text are unaffected. Note that signal streams emitted by the parsers already have normalized text. This function is useful when you are inserting text into a signal stream after parsing, or generating streams from scratch, and would like to clean up the `Text signals.
val pretty_print : ([> content_signal ] as 'a, 's) stream -> ('a, 's) streamAdjusts the whitespace in the `Text signals in the given stream so that the output appears nicely-indented when the stream is converted to bytes and written.
This function is aware of the significance of whitespace in HTML, so it avoids changing the whitespace in phrasing ("inline") content. For example, pretty printing
<div><p><em>foo</em>bar</p></div>results in
<div>
<p>
<em>foo</em>bar
</p>
</div>Note that no whitespace was inserted around <em> and </em>, because doing so would create a word break that wasn't present in the original stream.
Converts a signal stream into an HTML5 signal stream by stripping any document type declarations, XML declarations, and processing instructions, and prefixing the HTML5 doctype declaration. This is useful when converting between XHTML and HTML.
val xhtml : ?dtd:[< `Strict_1_0 | `Transitional_1_0 | `Frameset_1_0 | `Strict_1_1 ] -> ([< signal ], 's) stream -> (signal, 's) streamSimilar to html5, but does not strip processing instructions, and prefixes an XHTML document type declaration and an XML declaration. The ~dtd argument specifies which DTD to refer to in the doctype declaration. The default is `Strict_1_1.
Translates XHTML entities. This function is for use with the ~entity argument of parse_xml when parsing XHTML.
strings_to_bytes s is the stream of all the bytes of all strings in s.
Orders locations according to their appearance in an input stream, i.e. first by line, and then, for locations on the same line, by column.
Namespaces
module Ns : sig ... endCommon namespace URIs.
Asynchronous interface
module type ASYNCHRONOUS = sig ... endMarkup.ml interface for monadic I/O libraries such as Lwt and Async.