Clean up tokenizer code and strenghten unit tests Clean up the tokenzier to be a bit more clear as to what's going on, and strenghten the unit tests to better test handling of multi-byte Unicode characters