[Contents] [Previous] [Next] [Last]


Chapter 3
Reading and Writing HTML Tokens

This chapter describes using the Composer Plug-in API token interface to read and write to an HTML document.

A Composer Plug-in can process a document as a Unicode character stream or as a set of HTML Token objects. The approach you use depends upon the type of editing you want the plug-in to do. In most cases, tokens are more useful because they contain more semantic information than character streams and thus simplify the developer's job.

An HTML token is a collection of letters that make up a single concept at a lexical level. There are five token types, which are represented by the five IO package classes, all of which are derived from the Token class:

To break up a character stream into tokens, you use the LexicalStream class.

Two of the samples included in the Composer Plug-in Kit, Colorize and DocInfo, process tokens. DocInfo finds images, so it looks for HTML image Tag tokens. If you would like to construct a simple example that uses tokens to read and write HTML tags, see the RedLetter tutorial.

The token interface is a tool that allows you to deal effectively with selected text. For methods that find the beginning and end of a selection, see "Selecting Text."

[Top]


Reading HTML Tokens


The LexicalStream class reads text and breaks it up into tokens. To create a lexical stream from a Unicode string or a Reader object, use one of the forms of the LexicalStream.LexicalStream method. The form you choose is determined by whether you want the method to take a String or a Reader object as a parameter.

Here, the in parameter is a String object.

Here, the in parameter is a Reader object.

This example shows how the Document.getInput, SelectedHTMLReader and LexicalStream methods work together to get and tokenize document text.

       LexicalStream in = new LexicalStream(
             new SelectedHTMLReader(Document.getInput(), out));
       hue = 0;
       for(;;){
               ...
              }

In the Composer Plug-in API, tokenizing is the process of repeatedly calling LexicalStream.next to return tokens. This method returns the next token in an HTML input stream or null if the input stream has run out of tokens.

This example uses next to get the next token of a document. It returns tokens until it encounters null, that is, until it runs out of tokens, then exits the loop.

See the Sample Code for an extended example.

[Top]


Writing HTML Tokens


To write the text it has changed to the document page, a Composer Plug-in must:

To write information into a document, use the Document.getOutput method. You should use this method if you plan to tokenize the raw HTML code of the document. If you want to perform string-based operations on the raw HTML code of the document, use Document.setText.

This example gets the output stream to hold the new document text:

The second step is to output the tokens.

The final step is to use the close method to close the stream. Without this step, the output is not written to the document.

See the Sample Code for an extended example.

[Top]


Sample Code


This perform method demonstrates how to read and write HTML tokens. It comes from the Colorize sample, which is included in the Composer Plug-in Kit. This example tokenizes text, reads the tokens in the document, and checks whether the tokens it encounters are text or HTML tags (Tag objects). If the tokens are text, it applies a color pattern with the sample's colorize method. If the tokens are HTML tags, it changes the attribute only if it finds a FONT tag with a COLOR attribute.

    public boolean perform(Document document) throws IOException{

        /* Get the output stream to hold the new document text. */
        PrintWriter out = new PrintWriter(document.getOutput());

        /* Create a lexical stream to tokenize the old document text. */
        LexicalStream in = new LexicalStream(
                 new SelectedHTMLReader(document.getInput(), out));
        hue = 0;

        /* Tokenize the text. */
        for(;;){

            /* Get the next token of the document. */
            Token token = in.next();
            if ( token == null ) break;

            /* See if the token is text. */
            /* Null means you've reached the end of the document. */
            else if ( token instanceof Text ) {
                Text text = (Text) token;

                /* If text, change the color. */
                colorize(text.getText(), out);
                continue;
                /* Do not output the original token */

            }
            /* See if the token is an HTML tag. */
            else if ( token instanceof Tag ) {
                Tag tag = (Tag) token;

                /* See if the tag is a FONT tag with COLOR attribute. */
                if ( tag.getName().equals("FONT")
                    && tag.containsAttribute("COLOR") ){

                    /* Strip out the color tag. */
                    tag.removeAttribute("COLOR");
                }
            }

            /* Output the token. */
            out.print(token);
        }

        /* Close output stream. */
        out.close();
        return true;
    }

For the complete text of the Colorize sample, see Colorize.java in the source directory of the Composer Plug-in Kit. For a simple example that uses tokens to read and write HTML tags, see the RedLetter tutorial.

[Top] [Contents] [Previous] [Next] [Last]



Copyright © 1997 Netscape Communications Corporation