/// CJKTokenizer was modified from StopTokenizer which does a decent job for
/// most European languages. and it perferm other token method for double-byte
/// chars: the token will return at each two charactors with overlap match.
/// Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it
/// also need filter filter zero length token ""
/// for Digit: digit, '+', '#' will token as letter
/// for more info on Asia language(Chinese Japanese Korean) text segmentation:
/// please search google
///