Issues (and their resolutions) when using gettext for message translation Contents ======== * Windows issues * Automatic characterset conversion * Translations on the client * No translations on the server Windows issues ============== On Windows, Subversion is linked against a modified version of GNU gettext. This resolves several issues: - Eliminated need to link against libiconv (which would be the second iconv library, since we already link against apr-iconv) - No automatic charset conversion (guaranteed UTF-8 strings returned by gettext() calls without performance penalties) More in the paragraphs below... Automatic characterset conversion ================================= Some gettext implementations automatically convert the strings in the message catalogue to the active system characterset. The source encoding is stored in the "" message id. The message string looks somewhat like a mime header and contains a "Content-Encoding" line. It's typically GNU's gettext which does this. Subversion uses UTF-8 to encode strings internally, which may not be the systems default character encoding. To prevent internal corruption, libsvn_subr:svn_cmdline_init2() explicitly tells gettext to return UTF-8 encoded strings if it has bind_textdomain_codeset(). Some gettext implementations don't contain automatic string recoding. In order to work with both recoding and non-recoding implementations, the source strings must be UTF-8 encoded. This is achieved by requiring .po files to be UTF-8 encoded. [Note: a pre-commit hook has been installed to ensure this.] On Windows Subversion links against a version of GNU gettext, which has been modified not to do character conversions. This eliminates the requirement to link against libiconv which would mean Subversion being linked against 2 iconv libraries (apr_iconv as well as libiconv). Translations on the client ========================== The translation effort is to translate most error messages generated on the system on which the user has invoked his subversion command (svnadmin, svnlook, svndumpfilter, svnversion or svn). This means that in all layers of the libraries strings have been marked for translation, either with _(), N_() or Q_(). Parameters are sprintf-ed straight into errorstrings at the time they are added to the error structure, so most strings are marked with _() and translated directly into the language for which the client was set up. [Note: The N_() macro marks strings for delayed translation.] Translations on the server ========================== On systems which define the LC_MESSAGES constant, setlocale() can be used to set string translation for all (error) strings even those outside the Subversion domain. Windows doesn't define LC_MESSAGES. Instead GNU gettext uses the environ- ment variables LANGUAGE, LC_ALL, LC_MESSAGES and LANG (in that order) to find out what language to translate to. If none of these are defined, the system and user default locales are queried. Though setting one of the aforementioned variables before starting the server will avoid localization by Subversion to the default locale, messages generated by the system itself are likely to still be in its default locale (they are on Windows). While systems which have the LC_MESSAGES flag (or setenv() - of which Windows has neither) allow languages to be switched at run time, this cannot be done portably. Any attempt to use setlocale() in an Apache environment may conflict with settings other modules expect to be setup (even when using a prefork MPM). On the svnserve side, having no portable way to change languages dynamically means that the environment has to be set up correctly from the start. Futhermore, the svnserve protocol doesn't yet support content negotiation. In other words, there is no way -- programmatically -- to ensure that messages are served in any specific language using a traditional gettext implementation. Current consensus is that gettext must be replaced on the server side with a more flexible implementation. Server requirement(s): - Language negotiation on a per-client session basis. For a stateless protocol like HTTP, this means per-request. For a stateful protocol like the one used by svnserve, this means per-connection. - Avoid contamination of environment used by other code (e.g. other Apache modules running in the same server as mod_dav_svn). - Allow for propagation of the language to use to hook scripts. - Continue to inter-op with generic HTTP/DAV clients, and stay compatible with SVN clients of various versions (as per existing compatibility rules). I18N module requirement(s): - Cross-platform. - Interoperable with gettext tools (e.g. for .po files). - Non-viral license which allows for any necessary modifications. - gettext-like API (needn't be an exact match). Implementation guidelines: - The L10N API will be uniform across all libraries, clients, and servers. Server-negotiated language will be recorded in either a context baton (e.g. apr_pool_t.userdata), or in thread-local storage (TLS). - Implemented on top of a new gettext-like module with per-struct or per-thread locale mutator functions and storage for name/value pairs (a glorified apr_hash_t). (See implementation from Nicolás Lichtmaier noted below.) - Language chosen by the server will be negotiated based on a ranked list of preferences provided by the client. - Language used by httpd/mod_dav_svn will be derived from the Accept-Language HTTP header, and setup by mod_negotiation (when available), or by mod_dav_svn on a per-request basis. - Language used by svnserve derived from additions to the protocol which allow for HTTP-style content negotiation on a per-connection basis. The protocol extension would use the same sort of q-value list found in the Accept-Language header to specify user language preferences. Investigation: A brief canvasing of developers (on IRC) indicated that no thorough investigation of existing solutions which might meet the above requirements has been done. This incomplete canvasing may not paint an accurate picture, however. A branch has been created to explore a solution to the above requirements. While the L10N module is important, how that module is applied to both the server-side and client-side is possibly even more so; an implementation which meets the requirements should not dramatically impact the solution used across the code base for the general L10N API, nor the necessary server-side machinations. Nicolás Lichtmaier wrote something along the lines of the module referenced in the "Possible implementation" section , which has been committed to the server-l10n branch. However, it depends upon the GNU gettext .mo format, and the GNU implementation may not be available on all platforms (unless re-implemented). This module will need to be enhanced or replaced, ideally completely obviating the need for linkage against a platform's own gettext implementation. Whether to use TLS or a context baton for the L10N API is under discussion. TLS can provide a more friendly API (albeit somewhat underhanded), while use of a context baton more resilient to change (e.g. if httpd someday allowed more than one thread to service a request). Here's a sample: - No localization: "A message to localize" - Localization w/ TLS or global preference: _("A message to localize") - Localization w/ a context baton: _("A message to localize", pool) Historical note: Original consensus indicated that messages from the server side should stay untranslated for transmission to the client. However, client side localization is not an option, because by then the parameter values have been inserted into the string, meaning that it can't be looked up in the messages catalogue anymore. So any localization must occur on the server, or significantly increase the complexity of marshalling messages from the server as unlocalized/unformatted data structures and localizing them on the client side using some additional wrapper APIs to handle the unmarshalling and message formatting. Additionally, client and server versions may not match up, meaning that message keys and format string values provided by the server may not correspond to what's available on the client. Paul Querna suggested a variation on this scheme involving requesting (once) and caching the localizations (to the local disk) for each server version, along with sending the message key (for lookup of localized text) and an already formatted text (to use as the default when no localization bundle is available). In addition to the complications mentioned previously, this has the downside of crippling the localization of server-generated messages when no write access to the local disk is available to the client.