UTF8-hul Module
1 Introduction
The UTF8-hul module recognizes and validates content streams encoded with the Unicode UTF-8 encoding.
The module is invoked by the:
jhove ... -m UTF8-hul ...
command line option.
This module can be configured with the following parameters:
- withTextMD=true to ask for the output of a textMD block in the text technical properties.
Coverage
- UTF-8 encoded content streams [Unicode]
Well-Formedness
The following criteria must be met by an UTF8 content streams for JHOVE to consider it well-formed:
-
The stream consists of an optional three-octet encoded Byte Order Mark
(BOM) character, 0xEFBBBF, followed by an arbitrary number of
the following one- to four-octet sequences:
Single octet: 0xxxxxxx Two octets: 110yyyyy 10xxxxxx Three octets: 1110zzzz 10yyyyyy 10yyyyyy
Four octets: 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx -
The presence of an initial Byte Order Mark (BOM) character in the form of
any of the following two- or four-octet sequences automatically taints the
content stream as non-well-formed UTF-8:
Two octets: 0xEF 0xFF UTF-16 big-endian encoding 0xFFFE UTF-16 little-endian encoding Four octets: 0x0000FEFF UCS-4 big-endian encoding 0xFFFE0000 UCS-4 little-endian encoding
Validity
The following criteria must be met by an UTF-8 encoded file for JHOVE to consider it valid:
- The UTF-8 encoded file is well-formed
Representation Information
The MIME type is reported as: text/plain; charset=UTF-8
In addition to the standard JHOVE representation information, the module defines the following properties:
- Property "UTF8Metadata" of type PROPERTY and arity LIST
- Property "Characters" of type LONG and arity SCALAR containing the number of characters
- Property "UnicodeBlocks" of type STRING and arity LIST containing Unicode 6.0.0 code blocks [Unicode Code Blocks]
- Property "LineEndings" of type STRING and arity LIST containing: CR, CRLF, or LF
- If withTextMD, Property "TextMDMetadata" of type TextMDMetadata and arity SCALAR