30 October 2008

Wordcount Woes - Part 3

How about those engineers who are certain that all of the strings have been externalized from the code?

I don't know about you, but I stopped believing them a long time ago.

Pseudo-translating the code is the definitive way of showing them the strings they've missed. It requires a bit of time and, frankly, some cooperation from the very engineers you're about to embarrass, but it's the sure way to find strings still embedded in code.

Many engineers overlook the installer, also. There is usually a script or value-pair file with custom strings, and it's easy to forget to externalize strings to the file. It's also easy to specify the wrong encoding for the file, such that all of the custom strings show up corrupted in the installer. We see that a lot with InstallShield projects.

Mind you, I'm never out to get the engineers - I need them too dearly - but they sometimes get to believing their own stuff and thinking that internationalization (I18n) is kaltes Kaffee, or yesterday's news. It is yesterday's news, but that doesn't mean it's unimportant.

Where do you find strings that engineers overlook?

Note: I've posted less frequently of late because I'm between projects (and setting up another blog). Once L1on activity resumes with a couple of clients, I'll have more war stories again.

Labels: , ,

21 August 2008

Localizing Code Snippets - Part II

Last week I posted on the dilemma of how to localize Code Snippets, the selected pieces of your documentation that you shoehorn into XML files so that Visual Studio can present them in tool-tip-like fashion to the user while s/he is writing code that depends on your documentation.

My goal was to ensure that the process of grabbing these bits of documentation (mostly one-sentence descriptions and usage tips) was internationalized, so that we could run it on translated documentation and save money. This has proved more difficult than anticipated.

Here is the lesson: If you think it's hard to get internal support for internationalizing your company's revenue-generating products, just try to get support for internationalizing the myriad hacks, scripts, macros and shortcuts your developers use to create those products.

In this client's case, it makes more sense to translate the documentation, then re-use that translation memory on all of the Code Snippet files derived from the documentation. It will cost more money (mostly for translation engineering and QA, rather than for new translation) in the short run, but less headache and delay in the long run. Not to mention fewer battles I need to fight.

Discretion is the better part of localization valor.

Labels: , , , , ,

14 August 2008

Localizing Code Snippets

"Why would I localize code snippets?" you ask. (Go ahead; ask.)

Everybody knows you don't translate snippets of code. Even if you found a translator brave enough to take on something like int IBACKLIGHT_GetBacklightInfo(IBacklight *p, AEEBacklightInfo * pBacklightInfo), the compiler would just laugh and spit out error messages.

However, if you're a developer (say, of Windows applications) working in an integrated development environment (say, Microsoft Visual Studio), you may want to refer very quickly to the correct syntax and description of a feature without searching for it in the reference manual. The Code Snippet enhancement to Visual Studio makes this possible with a small popup box that contains thumbnail documentation on the particular interface the developer wants to use. It's similar in concept and appearance to the "What's This?" contextual help offered by right-clicking on options in many Windows applications.

How does the thumbnail documentation get in there? It's a tortuous path, but the enhancement pulls text from XML-formatted .snippet files. You can fill the .snippet files with the information yourself, or you can populate them from your main documentation source using Perl scripts and XSL transformation. So while you're not really translating code snippets, you're translating Code Snippets.

And therein lies the problem.


One of our clients is implementing Code Snippets, but the Perl scripts and XSL transformation scripts they're using to extract the documentation, don't support Unicode. I found this out because I pseudo-translated some of the source documentation and ran the scripts on them. Much of the text didn't survive to the .snippet files, so we're on a quest to find the offending portions of the scripts and suggest internationalization changes.

We've determined that the translated documentation in the Code Snippets will display properly in Visual Studio; the perilous part of the journey is the process of extracting the desired subset of documentation and pouring it into the .snippet files. Don't expect that your developers will automatically enable the code for this; you'll probably have to politely persist to have it done right.

Alternatives:
  • Wait until all of your documentation has been translated, then translate the .snippet files. It's more time-consuming and it will cost you more, but working this far downstream may be easier than getting your developers to clean up their scripts.
  • Make your Japanese developers tolerate English documentation in the Code Snippets.
Neither one is really the Jedi way. Work with your developers on this.

Labels: , , , , ,

15 May 2008

Doxygen and localization

Are you localizing any documentation projects that use Doxygen? It's an open-source tool for documenting source code.

If your documentation set includes things like an API reference or extensive details in programming code, Doxygen allows you to embed tags in the original code or header files, then automatically create entire help systems organized around the tagged text. Doxygen does not compile anything, but takes the tagged bits of source files, turns them into HTML pages, then links them for viewing in a browser.

Like most tools, it's a breath of fresh air when it works properly, but it can require a lot of re-plumbing and retrofitting.

As far as localization goes, it can be a life-saver. In theory, you can have the header files themselves localized, then run them through Doxygen as you would the original English files. Working this far upstream can be a big advantage.

Some months ago a client embarked on a conversion of a help system to Doxygen. While it was still in the proof-of-concept stage, we pseudo-translated some header files and tested the tool for global-readiness.

The good news is that the developers of Doxygen have enabled it for multiple languages. It encodes pages in UTF-8 (or other character sets), so translated text displays properly in the browser. It's possible to set the OUTPUT_LANGUAGE parameter to your target language (e.g., Japanese, in our test scenario) so that the datestamp and other text supplied by Doxygen displays in Japanese, rather than in the default English.

There are some I18n problems with Doxygen, though.
  • Each header file page begins with "Collaboration diagram for" followed by the page title. When the page title contains double-byte characters, the Japanese characters for "Collaboration diagram for" are corrupted. It appears that Doxygen is not pushing UTF-8 characters for this phrase, though it pushes UTF-8 characters in other places.
  • Some hyperlinked words in body text will require translation. If so, it will be important to ensure that they are translated the same everywhere. Note, however, that Doxygen will not generate the necessary file if the hyperlink has double-byte characters in it (not even on a Japanese OS).
  • Doxygen allows for generation of the .hhc, .hhp and .hhk files needed for Compiled HTML Help (CHM). It can also be configured to execute hhc.exe and compile the project. However, Doxygen outputs the .hhc file in UTF-8 format, which is incompatible with the table of contents pane in the Help viewer. To fix this, open the .hhc in Notepad (preferably on a Japanese OS) and save it back out as Shift-JIS ("ANSI" in Japanese Notepad). Then recompile the CHM by invoking hhc.exe from the command line and the contents will show up properly.
  • Searches using single- or double-byte characters do not work in the resulting CHM.
These strike me as rather large, empty boxes on the checklist of global-readiness. Still, the source code is available, so if your organization has already started down the Doxygen path, you can clean up problems like these for your worldwide versions.

Interested in this topic? You might enjoy another article I've written called Localizing Robohelp Projects.

Labels: , , , ,

01 June 2007

Market Requirements for Localization

What good is all the market research if your product doesn't support the locale, and if Engineering can't get it to support the locale?

As product manager, you're pleased with your product's global reach. You've successfully localized for the low-hanging fruit (other Latin-based character sets like Spanish, German, even Nordic), and your product and Web site makes customers happy all over the Western world. You have established robust processes for:
  • researching the needs of each foreign market
  • making those needs an integral part of the product requirements
  • working with Engineering on timetables for support of the needs
  • working with QA to ensure the engineering work can be adequately tested
  • releasing in foreign markets and enjoying success in them
Now talk turns to Asian markets, and multibyte enabling of your product and Web presence. You meet with Engineering and, as they've done for the European languages, they assure you that their code is, or will be, clean, and that you'll replicate in Asia the success you've had in Europe. Everybody nods, and it's just like Euro-success again.

But what if it isn't?

As product manager, you want to do your usual, excellent job of identifying market requirements and writing up the intelligence so that Engineering knows what the product needs to support. You'd better scratch a little harder, though.
  1. How is Engineering going to validate the product for multibyte? Peer review of code? Bring in an internationalization engineer? Pseudo-translation? You can't just take their word for it; you have too much at stake.
  2. Can your Web team create a staging environment and test cases close enough to what the production environment will be like?
  3. Has QA done a good job in flushing out bugs in your other localized products? Look back at the bug reports from German or Finnish; did they really find many problems? Did they all get fixed? Do you know for sure that they're testing under production-caliber conditions and on production-caliber testbenches?
  4. Do you really need to launch in Japanese, Korean and two versions of Chinese at the same time? Can you adopt a phased approach? Which market can give you the best support as you're enabling your product? (Hint: It's often Japan.)
This is the localization-equivalent of getting your ducks in a row. After you've done all the work of finding out what the market requires, you'd better be sure that the product you want to sell them really will perform as you claim.

Engineering, this is not business as usual. This is Asia.

Labels: , , , , , ,

16 March 2007

How to pseudo-translate, Part II

You only speak one language, so maybe you'll never be a translator, but you have a chance as a pseudo-translator.

Pseudo-translation is the process of replacing or adding characters to your software strings to try and break the software, or at least uncover strings that are still embedded in the code and need to be externalized for proper localization. (Part I of this post describes why anybody would want to do such a thing.) Pseudo-translation is a big piece of internationalization (I18n), which you should undertake before you bother handing anything off to the translators.

Here's an example of a few strings from a C resource file, with their respective, pseudo-translations:

IDS_TITLE_OPEN_SKIN "Select Device"
IDS_TITLE_OPEN_SKIN "日本Sイlイct Dイvウcイ本日"

IDS_MY_FOLDER "Directory:"
IDS_MY_FOLDER "日本Dウrイctエrユ:本本"

IDS_MY_OPEN "&Open"
IDS_MY_OPEN "日本&Opイn日"

IDS_WINDOW_NOT_ENOUGH_MEM
"Windows has not enough memory. You may lower the heap size specified in the configuration file."
IDS_WINDOW_NOT_ENOUGH_MEM
"日本Wウndエws hアs nエt イnエオgh mイmエrユ. Yエオ mアユ lエwイr thイ hイアp sウzイ spイcウfウイd ウn thイ cエnfウgオrアtウエn fウlイ.本日本日日本本本日日本日日本日本日本日本"

IDS_TARGET_INITIALIZATION_FAILED
"Failed to load or initialize the target."
IDS_TARGET_INITIALIZATION_FAILED
"日本Fアウlイd tエ lエアd エr ウnウtウアlウzイ thイ tアrgイt.日日本日本日本本"

In these strings, Japanese characters have been pushed in to replace the vowels in all English words. The goal of using Ja characters is to ensure that, when compiled, the strings will look and behave as they should under Windows Japanese; it's important to pseudo-translate with the right result in mind.

Some observations:
  1. Each string begins with Ja characters, since that will be the case in the real Japanese translation, and it's a situation worth testing.
  2. Each string contains enough English characters to allow the tester to "gist" the string from the context. This is helpful because pseudo-translation can often destroy the meaning of the string.
  3. Each string has a ratio of swell, with trailing characters adding 20% to the length of the string. This helps flush out fields and controls in which strings will be truncated.
Okapi Rainbow is an excellent (if somewhat inscrutable) text-manipulation utility for just this purpose. When run on all of the string files in the development project, the result is a set of resources which, when recompiled, will run as a pseudo-translated binary. With a testbench running the appropriate operating system, a tester can get a good idea of the I18n work in store for the developers.

Rare is the product that passes pseudo-translation testing on the first try, either because of strings left behind in the code, resizing issues, string truncation, buffer overflows, or just plain bad luck.

Even if your code isn't perfect, though, look on the bright side: You're now a pseudo-translator.

Labels: , , , , ,

06 March 2007

How to pseudo-translate, Part I

Before you localize your software product, wouldn't you like to have an idea of what's going to break as a result?

If you've written it in English, it will surprise and alarm you to learn that that's no assurance that it will work when the user interface (UI) is in Chinese or Arabic or maybe even Spanish. The most conspicuous vulnerabilities are:
  • text swell, in which "prompt" becomes "Eingabeausforderung" in German, for example, and the 40 pixels of width you've reserved in the English UI results in only a small part of the German appearing;
  • corrupted characters, which will show up in the UI as question marks or little black boxes because characters such as à, ü, ¿, ß, Ø and 日本語 aren't in the code page or encoding under which your software is compiled;
  • illegible or invalid names of files and paths, which occur when installing your software on an operating system that will handle more kinds of characters than your product will;
  • crashes, which occur when your software mishandles the strange characters so badly that the program just giggles briefly and then dies;
  • ethnocentric business logic, which leads to ridiculous results when users select unanticipated countries or currencies;
  • hard-coded anything, whether currency symbols, standards of measurement (metric vs. English) or UI strings.
In the past, localization efforts have become stranded on these beaches late in the voyage, after the text has been translated and the binaries rebuilt. It needn't be that way.

Internationalization testing is the process of pushing alien characters and situations down your software's throat to see what breaks. The more complex the software, the more complex the testing, such that there are companies that specialize in internationalization as much as if not more than localization.

It's not rocket science, but it doesn't happen on its own, either. And, you don't want your customers worldwide doing any more of your internationalization testing than absolutely necessary, because they really don't appreciate buying the product and then testing it.

The process requires some cooperation between Engineering and QA, which should already be in place for the domestic product and can easily be extended to the international products as well. An upcoming post will explain some of the tools and techniques for proper internationalization testing.

Labels: , , , , , , ,

23 August 2006

Fixing that small internationalization gaffe

The engineers resolved the internationalization problem. Sort of.

They've modified the logic so that it no longer depends on the hardcoded presence of "&Tools" to pull the resources in correctly from two separate DLLs. However, it still looks for the literal "&Edit" in each DLL. If it doesn't find it, the submenu items do not show up. I know, because I broke it again with a random pseudo-translation pass that rendered "&Edit" as "&ßéüdßéüt" in one resource file and "&ßéüñdßéüñt" in the other.

"Well, what do you expect?" asked the developer, when I explained this to him. "Get your pseudo-act together and you won't find problems like this."

I granted him that it was very unlikely that "&Edit" would be translated differently in two places - well, it could happen, but it should not happen - but that was not the point. It's just not good programming practice to depend on string literals like that, whether localization engineering is a concern or not. "Why don't you make the dependency on the string ID instead? Localization will never go near that."

"Submit a ticket on it and we'll see for next time," he replied. "I've got other dragons to slay right now."

So, I filed the request and the enhancement is in the great cosmic wash of the engineering team's Issue Review system.

Labels: , ,

20 August 2006

Bad internationalization practice

Unfortunately, there's been another architecture change besides the move to .NET: Engineering has split the resource DLL into two pieces.

This is not bad news in itself, but there is a tricky dimension to putting the the two DLLs together at run time, and the engineers have handled it in a way that assumes a little too much.

The main menu contains the usual entries (File, Edit, View, Tools, Windows, Help), each of which contains a submenu. The localization hiccup is that some of the submenu items live in one DLL, and the others live in the other DLL. What brings them together at run-time? The software depends on the presence of the string "&Edit" in each one. What happens when "&Edit" gets translated? "Oh, well, I guess we didn't think of that..."

The pseudo-translated string reads "&ßéüdßéüt". The sets of submenu items don't find one another in the DLLs at run-time, so they simply don't show up in the menus. Another triumph for the farsightedness of internationalization testing, and back to the drawing board for the developers.

Labels: , ,

16 August 2006

Pseudo-translating the resource files

I probably shouldn't enjoy this stuff so much, but I'm a gearhead at heart, so I get a lot of gratification from climbing around inside resource files.

One of the unsung virtues of localization consulting is pseudo-translation and subsequent QA. The goal is to replace the source (in this case, English) strings with well thought-out gibberish, in an effort to make the software barf. This can take a number of forms, such as:
  • truncated strings
  • corrupted characters
  • hard-coded strings
  • expanses of blank space where strings should be; and
  • crashes (my favorite)
I'm not really all that happy that I've caused the software to crash, but at least it vindicates the function of localization project management in general and pseudo-translation in particular in a way that even the most jaded developer cannot ignore.

Labels: ,