Html5rw.EntitiesHTML entity decoding.
HTML uses character references to represent characters that are hard to type or have special meaning:
& (ampersand), < (less than), (non-breaking space)< (less than as decimal 60)< (less than as hex 3C)This module decodes all 2,231 named character references defined in the WHATWG specification, plus numeric references.
HTML5 Named Character Reference Decoding
This module provides functions for decoding HTML5 named character references (entities) and numeric character references. It includes the complete table of 2,231 named character references defined in the WHATWG HTML5 specification.
HTML5 supports three types of character references:
&, <, >, ⪡̸{ (decimal codepoint){ or { (hexadecimal codepoint)Some named entities are "legacy" - they were supported without a trailing semicolon in older browsers (e.g., & instead of &). The parser handles these according to the WHATWG specification.
Decode all character references in a text string.
Processes the string and replaces all valid character references (named and numeric) with their decoded UTF-8 equivalents.
decode "Hello & goodbye"
(* Returns: "Hello & goodbye" *)
decode "<script>"
(* Returns: "<script>" *)Decode a numeric character reference.
Note: Some codepoints are replaced according to the HTML5 specification (e.g., control characters in the 0x80-0x9F range are mapped to Windows-1252 equivalents).
Look up a named character reference.
lookup "amp" (* Some [0x26] *)
lookup "nbsp" (* Some [0xA0] *)
lookup "bogus" (* None *)Check if an entity is a legacy entity.
Legacy entities are those that were historically recognized without a trailing semicolon. The parser handles these specially to maintain browser compatibility.
is_legacy "amp" (* true - & works without ; *)
is_legacy "nbsp" (* true *)
is_legacy "Aacute" (* false - requires semicolon *)module Numeric_ref : sig ... endNumeric character reference handling.