Example 2: Defining a Multibyte Character Code Conversion (JIS <-> Unicode)

Apache C++ Standard Library User's Guide

40.5 Example 2: Defining a Multibyte Character Code Conversion (JIS <-> Unicode)

Let us consider the example of a state-dependent code conversion. As mentioned previously, this type of conversion would occur between JIS, which is a state-dependent multibyte encoding for Japanese characters, and Unicode, which is a wide-character encoding. As usual, we assume that the external device uses multibyte encoding, and the internal processing uses wide-character encoding.

Here is what you must do to implement and use a state-dependent code conversion facet:

Define a new conversion state type if necessary.
Define a new character traits type if necessary, or instantiate the character traits template with the new state type.
Define the code conversion facet.
Instantiate new stream types using the new character traits type.
Imbue a file stream's buffer with a locale that carries the new code conversion facet.

These steps are explained in detail in the following sections.

40.5.1 Define a New Conversion State Type

While parsing or creating a sequence of multibytes in a state-dependent multibyte encoding, the code conversion facet has to maintain a conversion state. This state is by default of type mbstate_t, which is the implementation-dependent state type defined in <cwchar>. If this type does not suffice to keep track of the conversion state, you must provide your own conversion state type that satisfies the requirements of CopyConstructible.

class JISstate_t { /* ... */ };

40.5.2 Define a New Character Traits Type

The conversion state type is part of the character traits. Hence, with a new conversion state type, you need a new character traits type.

If you do not want to rely on a nonstandard and thus non-portable feature of the library, you must define a new character traits type and redefine the necessary types:

struct JIS_char_traits: std::char_traits<wchar_t> 
{
   typedef JISstate_t                state_type;
   typedef std::fpos<state_type>     pos_type;
   typedef std::streamoff            off_type;
};

40.5.3 Define the Code Conversion Facet

Just as in the first example, you must define the actual code conversion facet. The steps are basically the same as before, too: define a new class template for the new code conversion type and specialize it. The code would look like this:

template <class internT, class externT, class stateT>
class UnicodeJISConversion
    : public std::codecvt<internT, externT, stateT>
{};

class UnicodeJISConversion<wchar_t, char, JISstate_t>
: public std::codecvt<wchar_t, char, JISstate_t>
{
protected:

 virtual std::codecvt_base::result
 do_in(JISstate_t&  state,
       const char*  from,
       const char*  from_end,
       const char*& from_next,
       wchar_t*     to,
       wchar_t*     to_limit,
       wchar_t*&    to_next) const;

 virtual std::codecvt_base::result
 do_out(JISstate_t&     state,
        const wchar_t*  from,
        const wchar_t*  from_end,
        const wchar_t*& from_next,
        char*           to,
        char*           to_limit, 
        char*&          to_next) const;

 virtual bool do_always_noconv() const throw(){
      return false;
 }

 virtual int do_encoding() const throw(){ 
       return -1;
 }
};

In this case, the member function do_encoding() has to return -1, which identifies the code conversion as state-dependent. Again, the member functions in() and out() must conform to the error indication policy explained under class codecvt in the Apache C++ Standard Library Reference Guide.

The distinguishing characteristic of a state-independent conversion is that the conversion state argument to in() and out() is used for communication between the file stream buffer and the code conversion facet. The file stream buffer is responsible for creating, maintaining, and deleting the conversion state. At the beginning, the file stream buffer creates a conversion state object that represents the initial conversion state and hands it over to the code conversion facet. The facet modifies it according to the conversion it performs. The file stream buffer receives it and stores it between two subsequent code conversions.

40.5.4 Use the New Code Conversion Facet

Here is an example of how the new code conversion facet can be used:

typedef std::basic_fstream<wchar_t,
                            JIS_char_traits> JIS_fstream;     //1
JIS_fstream inout("/tmp/fil");
UnicodeJISConversion<wchar_t,char,JISstate_t> cvtfac;
std::locale cvtloc(std::locale(), &cvtfac);
inout.rdbuf()->pubimbue(cvtloc)                               //2
std::wcout << inout.rdbuf();                                  //3

`//1`	Our Unicode-JIS code conversion needs a conversion state type different from the default type `std::mbstate_t`. Since the conversion state type is contained in the character traits, we must create a new file type.
`//2`	Here the stream buffer's locale is replaced by a copy of the global locale that has a Unicode-JIS code conversion facet.
`//3`	The content of the JIS encoded file `"/tmp/fil"` is read, automatically converted to Unicode, and written to `std::wcout`.