LUCENE-5042: Fix the n-gram tokenizers and filters. This commit fixes n-gram tokenizers and filters so that they handle supplementary characters correctly and adds the ability to pre-tokenize the stream in tokenizers.