24 July 2008

"I can quit smoking whenever I want to."

"...I just don't want to."

Have you heard that one before? I heard something similar last week from a director of engineering:

"All of our strings are embedded in source code. This is deliberate, and we planned it very carefully."

How would you have reacted?

At first, I figured he was pulling my leg ("taking the mickey," "having me on," etc.). Then he explained the process of localizing strings in the gnu gettext model, which can live peacefully without external resources.

A line of code reading

result = wx.MessageDialog(_("Welcome to my blog. Today is %s"), date.today)

uses the _ function in the English context as an identity function. In a localized context it will load the language pack built using the gnu gettext utilities and map the English strings to the localized equivalent:

"Welcome to my blog. Today is %s" -> "Bienvenido a mi blog. Hoy es %s"

To redeem what seems like shortsightedness in allowing developers to embed strings in code, these utilities also contain scripts that can pull out all the English strings from source code and make localization packages, which translators can work on without danger of touching the code. Other scripts can push the localized strings back into place.

Like .properties files in Java and .rc files in C++, these localization packages isolate non-code elements for easy localization. However, a programmer's coding mistake could still result in strings going undetected by the scripts, so I still plan to perform pseudo-translation and internationalization testing on this software as soon as possible.
Just in case the director of engineering can't quit smoking as easily as he thinks he can.

Labels: , , ,

29 November 2007

Keeping an eye on Catalyst

In localization, "Catalyst" is a tool from Alchemy Software. Among other things, it allows you to localize UI elements within software resource files, sometimes without the need to rebuild the software manually into binary format.

Since software binaries come from text files, part of Catalyst's value lies in straddling the divide between allowing the translator to change strings in the these text files (say, from English to Japanese) and displaying them in the binary, run-time format in which the user will see them on screen.

Last month a vendor returned some resource files to me which we had them localize from English to Japanese. I rebuilt the binaries (language-resource DLLs) and ran them. Unfortunately, a number of items were suddenly missing from the Japanese menus, so I had to troubleshoot the problem.

My first thought was that either a person or a tool (or a person using a tool) had modified something that should not be affected by the localization process. I had handed off a resource file containing these lines:

32777 MENU DISCARDABLE
BEGIN
POPUP "&Tools"
BEGIN
MENUITEM "Serial P&ort Settings...", ID_TOOLS_SERIALPORTSETTINGS
MENUITEM "&Network Settings...", ID_TOOLS_NETWORK
MENUITEM "&Battery Settings...", ID_TOOLS_BATTERYSETTINGS
END
END

32779 MENU DISCARDABLE
BEGIN
POPUP "&File"
END


They returned to me a resource file containing these strings:

9 MENU DISCARDABLE
BEGIN
POPUP "ツール(&T)"
BEGIN
MENUITEM "シリアルポートの設定(&O)...", ID_TOOLS_SERIALPORTSETTINGS
MENUITEM "ネットワーク設定(&N)...", ID_TOOLS_NETWORK
MENUITEM "バッテリの設定(&B)...", ID_TOOLS_BATTERYSETTINGS
END
END

11 MENU DISCARDABLE
BEGIN
POPUP "ファイル(&F)"
END

There was nothing wrong with the translation, and the string IDs were intact. The product has long been "double-byte clean," so I knew that the software was not gagging on the Japanese characters.

The problem lay in the menu ID numbers, which are 32777 and 32779 in the English, but which came back in the Japanese files as 9 and 11. The vendor believes that Catalyst changed them, since they had used it to for resizing and QA.

Normally, this renumbering has no effect on how the binary functions. In this case, however, it has a profound effect on how the binary functions, because there is code somewhere in the software that is looking for "32777" and "32779" and when it doesn't find those ID's, it cannot complete the menu. This is poor internationalization in the code base which I have discussed with Engineering, to no avail, so I need to police the resource files in each round of localization.

How is Catalyst working for you? Have you seen similar problems?

Interested in this topic? You might enjoy another article I've written called "Localized Binaries - The Plot Thickens"

Labels: , , , , , ,

16 March 2007

How to pseudo-translate, Part II

You only speak one language, so maybe you'll never be a translator, but you have a chance as a pseudo-translator.

Pseudo-translation is the process of replacing or adding characters to your software strings to try and break the software, or at least uncover strings that are still embedded in the code and need to be externalized for proper localization. (Part I of this post describes why anybody would want to do such a thing.) Pseudo-translation is a big piece of internationalization (I18n), which you should undertake before you bother handing anything off to the translators.

Here's an example of a few strings from a C resource file, with their respective, pseudo-translations:

IDS_TITLE_OPEN_SKIN "Select Device"
IDS_TITLE_OPEN_SKIN "日本Sイlイct Dイvウcイ本日"

IDS_MY_FOLDER "Directory:"
IDS_MY_FOLDER "日本Dウrイctエrユ:本本"

IDS_MY_OPEN "&Open"
IDS_MY_OPEN "日本&Opイn日"

IDS_WINDOW_NOT_ENOUGH_MEM
"Windows has not enough memory. You may lower the heap size specified in the configuration file."
IDS_WINDOW_NOT_ENOUGH_MEM
"日本Wウndエws hアs nエt イnエオgh mイmエrユ. Yエオ mアユ lエwイr thイ hイアp sウzイ spイcウfウイd ウn thイ cエnfウgオrアtウエn fウlイ.本日本日日本本本日日本日日本日本日本日本"

IDS_TARGET_INITIALIZATION_FAILED
"Failed to load or initialize the target."
IDS_TARGET_INITIALIZATION_FAILED
"日本Fアウlイd tエ lエアd エr ウnウtウアlウzイ thイ tアrgイt.日日本日本日本本"

In these strings, Japanese characters have been pushed in to replace the vowels in all English words. The goal of using Ja characters is to ensure that, when compiled, the strings will look and behave as they should under Windows Japanese; it's important to pseudo-translate with the right result in mind.

Some observations:
  1. Each string begins with Ja characters, since that will be the case in the real Japanese translation, and it's a situation worth testing.
  2. Each string contains enough English characters to allow the tester to "gist" the string from the context. This is helpful because pseudo-translation can often destroy the meaning of the string.
  3. Each string has a ratio of swell, with trailing characters adding 20% to the length of the string. This helps flush out fields and controls in which strings will be truncated.
Okapi Rainbow is an excellent (if somewhat inscrutable) text-manipulation utility for just this purpose. When run on all of the string files in the development project, the result is a set of resources which, when recompiled, will run as a pseudo-translated binary. With a testbench running the appropriate operating system, a tester can get a good idea of the I18n work in store for the developers.

Rare is the product that passes pseudo-translation testing on the first try, either because of strings left behind in the code, resizing issues, string truncation, buffer overflows, or just plain bad luck.

Even if your code isn't perfect, though, look on the bright side: You're now a pseudo-translator.

Labels: , , , , ,

16 December 2006

Favorite Localization Tools

Here's a short list of Windows-based tools I use a great deal in managing localization projects:

Beyond Compare
- Clients constantly drill me about the differences between the last version of their product and this version, with an eye to the order of magnitude of localization expense they're in for. Beyond Compare is the best tool I've found for finding the files that have changed, then comparing older and newer versions of files in a specialized viewer. Good technical support as well.

EmEditor - As long as you have the font and OS support installed, you can view multi-byte characters in their appropriate applications under English-language Windows, but EmEditor allows you to change the encoding of a text file to better display it, or so that you can edit it. My standard text editor is Ultra-Edit, which has excellent search-and-replace capability, but it's not as deft as EmEditor for multibyte work on an English OS.

SDLX Glue - An obscure utility inside the SDLX suite, this will append up to I don't know how many hundred HTML files together. Translation vendors like it for work on big sites because it slashes the number of files being slung around. Naturally, it includes an unglue utility as well.

FAR - A technical writer introduced me to this utility, which includes a compiler system for HTML Help and MS Help. It will compile CHM files in any language such that, if you have a good HTML authoring tool, you don't need RoboHelp to build your CHMs. (Unfortunately, I've had problems when I've tried to use FAR on projects that have been created in RoboHelp, but there are some ways around them.)

Moreover, FAR stands for "Find And Replace", and this is hands down the best front end on regular expressions that I've ever found. The Holy Grail of search-and-replace is ignoring line breaks, and while regex supports that, not many utilities (that I've found) implement it. For instance, in the text

In a white room

with black curtains

at the station

if your goal was to find "room with black curtains at", most utilities would not be able to locate it because of the line breaks. FAR does find it, and even allows you to replace the text with line breaks. Top-flight technical support also.

Most of these are shareware, but they're well worth the US$25-$50.

(compiling CHMs, finding and replacing across line breaks)

Labels: , , , , , ,