org.apache.lucene.analysis.cjk

Class CJKTokenizer

public final class CJKTokenizer extends Tokenizer

CJKTokenizer was modified from StopTokenizer which does a decent job for most European languages. It performs other token methods for double-byte Characters: the token will return at each two charactors with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation: please search google

Author: Che, Dong

Constructor Summary
CJKTokenizer(Reader in)
Construct a token stream processing the given input.
Method Summary
Tokennext()
Returns the next token in the stream, or null at EOS.

Constructor Detail

CJKTokenizer

public CJKTokenizer(Reader in)
Construct a token stream processing the given input.

Parameters: in I/O reader

Method Detail

next

public final Token next()
Returns the next token in the stream, or null at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Returns: Token

Throws: java.io.IOException - throw IOException when read error
hanppened in the InputStream

Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.