The UTF8-hul module recognizes and validates content streams encoded with the Unicode UTF-8 encoding.
The module is invoked by the:
jhove ... -m utf8-hul ...
command line option.
- UTF-8 encoded content streams [Unicode]
The following criteria must be met by an UTF8 content streams for JHOVE to consider it well-formed:
The stream consists of an optional three-octet encoded Byte Order Mark
(BOM) character, 0xEFBBBF, followed by an arbitrary number of
the following one- to four-octet sequences:
Single octet: 0xxxxxxx Two octets: 110yyyyy 10xxxxxx Three octets: 1110zzzz 10yyyyyy 10yyyyyy
Four octets: 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
The presence of an initial Byte Order Mark (BOM) character in the form of
any of the following two- or four-octet sequences automatically taints the
content stream as non-well-formed UTF-8:
Two octets: 0xEF 0xFF UTF-16 big-endian encoding 0xFFFE UTF-16 little-endian encoding Four octets: 0x0000FEFF UCS-4 big-endian encoding 0xFFFE0000 UCS-4 little-endian encoding
The following criteria must be met by an UTF-8 encoded file for JHOVE to consider it valid:
- The UTF-8 encoded file is well-formed
The MIME type is reported as: text/plain; charset=UTF-8
In addition to the standard JHOVE representation information, the module defines the following properties:
- Line endings: CR, CRLF, or LF
- Additional control characters
- Number of characters
- Unicode 6.0.0 code blocks [Unicode Code Blocks]