org.apache.lucene.ant

Class HtmlDocument

public class HtmlDocument extends Object

The HtmlDocument class creates a Lucene {@link org.apache.lucene.document.Document} from an HTML document.

It does this by using JTidy package. It can take input input from {@link java.io.File} or {@link java.io.InputStream}.

Author: Erik Hatcher

Constructor Summary
HtmlDocument(File file)
Constructs an HtmlDocument from a {@link java.io.File}.
HtmlDocument(InputStream is)
Constructs an HtmlDocument from an {@link java.io.InputStream}.
Method Summary
static DocumentDocument(File file)
Creates a Lucene Document from a {@link java.io.File}.
StringgetBody()
Gets the bodyText attribute of the HtmlDocument object.
static DocumentgetDocument(InputStream is)
Creates a Lucene Document from an {@link java.io.InputStream}.
StringgetTitle()
Gets the title attribute of the HtmlDocument object.
static voidmain(String[] args)
Runs HtmlDocument on the files specified on the command line.

Constructor Detail

HtmlDocument

public HtmlDocument(File file)
Constructs an HtmlDocument from a {@link java.io.File}.

Parameters: file the File containing the HTML to parse

Throws: IOException if an I/O exception occurs

Since:

HtmlDocument

public HtmlDocument(InputStream is)
Constructs an HtmlDocument from an {@link java.io.InputStream}.

Parameters: is the InputStream containing the HTML

Throws: IOException if I/O exception occurs

Since:

Method Detail

Document

public static Document Document(File file)
Creates a Lucene Document from a {@link java.io.File}.

Parameters: file

Returns:

Throws: IOException

getBody

public String getBody()
Gets the bodyText attribute of the HtmlDocument object.

Returns: the bodyText value

getDocument

public static Document getDocument(InputStream is)
Creates a Lucene Document from an {@link java.io.InputStream}.

Parameters: is

Returns:

Throws: IOException

getTitle

public String getTitle()
Gets the title attribute of the HtmlDocument object.

Returns: the title value

main

public static void main(String[] args)
Runs HtmlDocument on the files specified on the command line.

Parameters: args Command line arguments

Throws: Exception Description of Exception

Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.