Tokenizer (JHOVE - JSTOR/Harvard Object Validation Environment 1.16.0 API)

java.lang.Object
- edu.harvard.hul.ois.jhove.module.pdf.Tokenizer

Direct Known Subclasses:

FileTokenizer, StreamTokenizer
```
public abstract class Tokenizer
extends Object
```
Tokenizer for PDF files. This is used in conjunction with the Parser, which assembled Tokens into higher-level constructs.

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`_ch` Character code of current character.
`protected RandomAccessFile`	`_file` Source from which to read bytes.
`static char[]`	`PDFDOCENCODING` Mapping between PDFDocEncoding and Unicode code points.

Constructor Summary

Constructors
Constructor and Description

Tokenizer()
Constructor.

Constructors
Constructor and Description
`Tokenizer()` Constructor.

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addLanguageCode(String langCode)` Add a string to the language codes
`abstract void`	`backupChar()` Back up a byte so it will be read again.
`Set<String>`	`getLanguageCodes()` Return the set of language codes.
`Token`	`getNext()` Parses out and returns a token from the input file.
`Token`	`getNext(long max)` Parses out and returns a token from the input file.
`long`	`getOffset()` Return the current offset into the file.
`boolean`	`getPDFACompliant()` Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance.
`String`	`getWSString()` Returns the value of the last white space string read by the tokenizer.
`protected abstract void`	`initStream(Stream token)` Initialization code for Stream object.
`abstract int`	`readChar()` Get a character from the file or stream, using a buffer
`int`	`readChar1(boolean utf16)` Read a character in one-byte or 2-byte format, as requested
`void`	`scanMode(boolean flag)` If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.
`abstract void`	`seek(long offset)` Set the Tokenizer to a new position in the file.
`protected void`	`seekReset(long offset)` Reset after a seek.
`void`	`setEncrypted(boolean encrypted)` Tell this object that the file is or isn't encrypted.
`void`	`setPDFACompliant(boolean pdfACompliant)` Set the value of the pdfACompliant flag.
`protected abstract void`	`setStreamOffset(Stream token)` Sets the offset of a Stream to the current file position.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PDFDOCENCODING
```
public static char[] PDFDOCENCODING
```
    Mapping between PDFDocEncoding and Unicode code points.
  - _file
```
protected RandomAccessFile _file
```
    Source from which to read bytes.
  - _ch
```
protected int _ch
```
    Character code of current character.
- Constructor Detail
  - Tokenizer
```
public Tokenizer()
```
    Constructor.
- Method Detail
  - getNext
```
public Token getNext()
              throws IOException,
                     PdfException
```
    Parses out and returns a token from the input file. If it hits the end of the file, returns null. Other parsing problems cause an exception to be thrown. When an exception is thrown, the state is changed to WHITESPACE, so the parser can get back in sync more easily.
    
    Throws:
    
    IOException
    
    PdfException
  - getNext
```
public Token getNext(long max)
              throws IOException,
                     PdfException
```
    Parses out and returns a token from the input file. If it hits the end of the file, returns null. Other parsing problems cause an exception to be thrown. When an exception is thrown, the state is changed to WHITESPACE, so the parser can get back in sync more easily.
    
    Parameters:
    
    max - Maximum allowable size of the token
    
    Throws:
    
    IOException
    
    PdfException
  - getOffset
```
public long getOffset()
```
    Return the current offset into the file.
  - getLanguageCodes
```
public Set<String> getLanguageCodes()
```
    Return the set of language codes. Members of the set are Strings.
  - setEncrypted
```
public void setEncrypted(boolean encrypted)
```
    Tell this object that the file is or isn't encrypted.
  - getPDFACompliant
```
public boolean getPDFACompliant()
```
    Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance. A value of true is no guarantee that the file is compliant.
  - setPDFACompliant
```
public void setPDFACompliant(boolean pdfACompliant)
```
    Set the value of the pdfACompliant flag. This may be used to clear previous detection of noncompliance.
  - getWSString
```
public String getWSString()
```
    Returns the value of the last white space string read by the tokenizer. Repositioning clears the white space string.
  - seek
```
public abstract void seek(long offset)
                   throws IOException,
                          PdfException
```
    Set the Tokenizer to a new position in the file.
    
    Parameters:
    
    offset - The offset in bytes from the start of the file.
    
    Throws:
    
    IOException
    
    PdfException
  - seekReset
```
protected void seekReset(long offset)
```
    Reset after a seek.
  - readChar
```
public abstract int readChar()
                      throws IOException
```
    Get a character from the file or stream, using a buffer
    
    Throws:
    
    IOException
  - readChar1
```
public int readChar1(boolean utf16)
              throws IOException
```
    Read a character in one-byte or 2-byte format, as requested
    
    Throws:
    
    IOException
  - backupChar
```
public abstract void backupChar()
```
    Back up a byte so it will be read again.
  - addLanguageCode
```
public void addLanguageCode(String langCode)
```
    Add a string to the language codes
  - scanMode
```
public void scanMode(boolean flag)
```
    If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.
    
    Parameters:
    
    flag - Scan mode flag
  - initStream
```
protected abstract void initStream(Stream token)
                            throws IOException,
                                   PdfException
```
    Initialization code for Stream object. This is meaningful only for the FileTokenizer subclass.
    
    Throws:
    
    IOException
    
    PdfException
  - setStreamOffset
```
protected abstract void setStreamOffset(Stream token)
                                 throws IOException,
                                        PdfException
```
    Sets the offset of a Stream to the current file position. Only the file-based tokenizer can do this, which is why this overrides the Tokenizer method.
    
    Throws:
    
    IOException
    
    PdfException

Class Tokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PDFDOCENCODING

_file

_ch

Constructor Detail

Tokenizer

Method Detail

getNext

getNext

getOffset

getLanguageCodes

setEncrypted

getPDFACompliant

setPDFACompliant

getWSString

seek

seekReset

readChar

readChar1

backupChar

addLanguageCode

scanMode

initStream

setStreamOffset