Chapter 3
Reading and Writing HTML Tokens

This chapter describes using the Composer Plug-in API token interface to read and write to an HTML document.

A Composer Plug-in can process a document as a Unicode character stream or as a set of HTML Token objects. The approach you use depends upon the type of editing you want the plug-in to do. In most cases, tokens are more useful because they contain more semantic information than character streams and thus simplify the developer's job.

An HTML token is a collection of letters that make up a single concept at a lexical level. There are five token types, which are represented by the five IO package classes, all of which are derived from the Token class:

Comment: Token type for HTML comments
Entity: Used to escape '&' and '<' in HTML and enable the representation of unusual characters
JavaScriptEntity: Runtime evaluation of Tag parameters
Tag: Represents HTML tags
Text: Represents HTML text

To break up a character stream into tokens, you use the LexicalStream class.

Two of the samples included in the Composer Plug-in Kit, Colorize and DocInfo, process tokens. DocInfo finds images, so it looks for HTML image Tag tokens. If you would like to construct a simple example that uses tokens to read and write HTML tags, see the RedLetter tutorial.

The token interface is a tool that allows you to deal effectively with selected text. For methods that find the beginning and end of a selection, see "Selecting Text."

Reading HTML Tokens
Writing HTML Tokens

[Top]

Reading HTML Tokens

The LexicalStream class reads text and breaks it up into tokens. To create a lexical stream from a Unicode string or a Reader object, use one of the forms of the LexicalStream.LexicalStream method. The form you choose is determined by whether you want the method to take a String or a Reader object as a parameter.

public LexicalStream(String in)

Here, the in parameter is a String object.

public LexicalStream(Reader in)

Here, the in parameter is a Reader object.

This example shows how the Document.getInput, SelectedHTMLReader and LexicalStream methods work together to get and tokenize document text.

       LexicalStream in = new LexicalStream(
             new SelectedHTMLReader(Document.getInput(), out));
       hue = 0;
       for(;;){
               ...
              }

The Document.getInput method gets the complete text of the document as a Reader object. The plug-in usually does not need the entire document, only the selected text upon which it is going to act. If you plan to tokenize the raw HTML code of the document, use this method. If you want to perform string-based operations on the raw HTML code of the document, use document.getText.
The SelectedHTMLReader object filters the input Reader objects. It copies all the unselected text of the document from the input Reader object to the output stream. Only the selected text is passed along to the consumer of the SelectedHTMLReader object.
The LexicalStream object tokenizes this stream.

In the Composer Plug-in API, tokenizing is the process of repeatedly calling LexicalStream.next to return tokens. This method returns the next token in an HTML input stream or null if the input stream has run out of tokens.

public Token next() throws IOException

This example uses next to get the next token of a document. It returns tokens until it encounters null, that is, until it runs out of tokens, then exits the loop.

for(;;){
  Token token = in.next();
  if ( token == null ) break;
  }

See the Sample Code for an extended example.

[Top]

Writing HTML Tokens

To write the text it has changed to the document page, a Composer Plug-in must:

Create an output stream from the document.
Output the token.
Close the output stream.

To write information into a document, use the Document.getOutput method. You should use this method if you plan to tokenize the raw HTML code of the document. If you want to perform string-based operations on the raw HTML code of the document, use Document.setText.

public Writer getOutput() throws IOException

This example gets the output stream to hold the new document text:

PrintWriter out = new PrintWriter(document.getOutput());

The second step is to output the tokens.

out.print(token);
}

The final step is to use the close method to close the stream. Without this step, the output is not written to the document.

out.close();

See the Sample Code for an extended example.

[Top]

Sample Code

This perform method demonstrates how to read and write HTML tokens. It comes from the Colorize sample, which is included in the Composer Plug-in Kit. This example tokenizes text, reads the tokens in the document, and checks whether the tokens it encounters are text or HTML tags (Tag objects). If the tokens are text, it applies a color pattern with the sample's colorize method. If the tokens are HTML tags, it changes the attribute only if it finds a FONT tag with a COLOR attribute.

    public boolean perform(Document document) throws IOException{

        /* Get the output stream to hold the new document text. */
        PrintWriter out = new PrintWriter(document.getOutput());

        /* Create a lexical stream to tokenize the old document text. */
        LexicalStream in = new LexicalStream(
                 new SelectedHTMLReader(document.getInput(), out));
        hue = 0;

        /* Tokenize the text. */
        for(;;){

            /* Get the next token of the document. */
            Token token = in.next();
            if ( token == null ) break;

            /* See if the token is text. */
            /* Null means you've reached the end of the document. */
            else if ( token instanceof Text ) {
                Text text = (Text) token;

                /* If text, change the color. */
                colorize(text.getText(), out);
                continue;
                /* Do not output the original token */

            }
            /* See if the token is an HTML tag. */
            else if ( token instanceof Tag ) {
                Tag tag = (Tag) token;

                /* See if the tag is a FONT tag with COLOR attribute. */
                if ( tag.getName().equals("FONT")
                    && tag.containsAttribute("COLOR") ){

                    /* Strip out the color tag. */
                    tag.removeAttribute("COLOR");
                }
            }

            /* Output the token. */
            out.print(token);
        }

        /* Close output stream. */
        out.close();
        return true;
    }

For the complete text of the Colorize sample, see Colorize.java in the source directory of the Composer Plug-in Kit. For a simple example that uses tokens to read and write HTML tags, see the RedLetter tutorial.

[Top] [Contents] [Previous] [Next] [Last]

Chapter 3 Reading and Writing HTML Tokens

Reading HTML Tokens

Writing HTML Tokens

Sample Code

Chapter 3
Reading and Writing HTML Tokens