24 July 2008

"I can quit smoking whenever I want to."

"...I just don't want to."

Have you heard that one before? I heard something similar last week from a director of engineering:

"All of our strings are embedded in source code. This is deliberate, and we planned it very carefully."

How would you have reacted?

At first, I figured he was pulling my leg ("taking the mickey," "having me on," etc.). Then he explained the process of localizing strings in the gnu gettext model, which can live peacefully without external resources.

A line of code reading

result = wx.MessageDialog(_("Welcome to my blog. Today is %s"), date.today)

uses the _ function in the English context as an identity function. In a localized context it will load the language pack built using the gnu gettext utilities and map the English strings to the localized equivalent:

"Welcome to my blog. Today is %s" -> "Bienvenido a mi blog. Hoy es %s"

To redeem what seems like shortsightedness in allowing developers to embed strings in code, these utilities also contain scripts that can pull out all the English strings from source code and make localization packages, which translators can work on without danger of touching the code. Other scripts can push the localized strings back into place.

Like .properties files in Java and .rc files in C++, these localization packages isolate non-code elements for easy localization. However, a programmer's coding mistake could still result in strings going undetected by the scripts, so I still plan to perform pseudo-translation and internationalization testing on this software as soon as possible.
Just in case the director of engineering can't quit smoking as easily as he thinks he can.

Labels: , , ,

15 May 2008

Doxygen and localization

Are you localizing any documentation projects that use Doxygen? It's an open-source tool for documenting source code.

If your documentation set includes things like an API reference or extensive details in programming code, Doxygen allows you to embed tags in the original code or header files, then automatically create entire help systems organized around the tagged text. Doxygen does not compile anything, but takes the tagged bits of source files, turns them into HTML pages, then links them for viewing in a browser.

Like most tools, it's a breath of fresh air when it works properly, but it can require a lot of re-plumbing and retrofitting.

As far as localization goes, it can be a life-saver. In theory, you can have the header files themselves localized, then run them through Doxygen as you would the original English files. Working this far upstream can be a big advantage.

Some months ago a client embarked on a conversion of a help system to Doxygen. While it was still in the proof-of-concept stage, we pseudo-translated some header files and tested the tool for global-readiness.

The good news is that the developers of Doxygen have enabled it for multiple languages. It encodes pages in UTF-8 (or other character sets), so translated text displays properly in the browser. It's possible to set the OUTPUT_LANGUAGE parameter to your target language (e.g., Japanese, in our test scenario) so that the datestamp and other text supplied by Doxygen displays in Japanese, rather than in the default English.

There are some I18n problems with Doxygen, though.
  • Each header file page begins with "Collaboration diagram for" followed by the page title. When the page title contains double-byte characters, the Japanese characters for "Collaboration diagram for" are corrupted. It appears that Doxygen is not pushing UTF-8 characters for this phrase, though it pushes UTF-8 characters in other places.
  • Some hyperlinked words in body text will require translation. If so, it will be important to ensure that they are translated the same everywhere. Note, however, that Doxygen will not generate the necessary file if the hyperlink has double-byte characters in it (not even on a Japanese OS).
  • Doxygen allows for generation of the .hhc, .hhp and .hhk files needed for Compiled HTML Help (CHM). It can also be configured to execute hhc.exe and compile the project. However, Doxygen outputs the .hhc file in UTF-8 format, which is incompatible with the table of contents pane in the Help viewer. To fix this, open the .hhc in Notepad (preferably on a Japanese OS) and save it back out as Shift-JIS ("ANSI" in Japanese Notepad). Then recompile the CHM by invoking hhc.exe from the command line and the contents will show up properly.
  • Searches using single- or double-byte characters do not work in the resulting CHM.
These strike me as rather large, empty boxes on the checklist of global-readiness. Still, the source code is available, so if your organization has already started down the Doxygen path, you can clean up problems like these for your worldwide versions.

Interested in this topic? You might enjoy another article I've written called Localizing Robohelp Projects.

Labels: , , , ,

27 March 2008

Windows Vista Multilingual User Interface

Have you started localization testing on Vista yet? They've taken the fun out of it.

Windows 2000 was easy. The OS is relatively small, and you could even maintain multiple native systems (Win 2000 Ja, Win 2000 Ko, Win 2000 Es) on separate partitions or virtual machines. If your testing wasn't that rigorous and you just wanted the localized UI, you could slap the Multilingual User Interface (MUI) onto an English version of Windows without much headache.

XP required a bit more work, not to mention disk space, but it was still manageable, especially with drives as large as they are.

Vista feels like another order of magnitude in complexity. Actually, the process of building the testbenches is not that formidable, but the space requirements are. Just to reduce overhead, one of my clients has asked that we plan to test on MUI instead of native OSes from now on. Not the optimal vision, but we'll work with it.

MUI is not air-tight. The font and locale support is good, but it leaks English all over the place. (Apparently you can install MUI only on top of an English base OS.) It's not great for screenshots because you cannot always rely on perfectly localized UI, but it's a fair gauge of how your app will work in the target market.

There's an intimidating procedure on Microsoft's localization Technet for configuring MUI under Vista. I couldn't believe it was really this difficult - it need not be - so I poked around some more and found a different, simpler way.

1) Upgrade to Vista Ultimate or Enterprise. Vista MUI won't work with Home or even Business Editions, so plan to spend the money on the upgrade. From Windows Anytime Upgrade (WAU) you can pay US$139-159 for a license key to Vista Ultimate. If your company manages such things, you can find out whether it already has a site license for either of these.
2) Depending on the media you received with your machine, you may need to order (US$8) the WAU DVD also. The MUI files are not on it, but the files needed to upgrade the OS are.
3) After installing the upgrade, be sure to activate it.

Congratulations! You now have Windows Ultimate (and several fewer GB of disk space).

4) Go to Windows Update, which is where the language packs live. Microsoft used to make these available via MSDN - perhaps they still do - but now they're available as optional Windows updates.
5) Your upgrade to Ultimate/Enterprise may entail a few required updates. Install those first and get them out of the way, or else your request for the language pack may not take.
6) Select the language packs you need, start the download and go on vacation. The Japanese language pack is about 800MB in size, so you may do well to install one, play with it for a few days, then install others.
7) Go to the Regional and Language Options control panel. On the Keyboards and Languages tab, choose the display language. Windows requires only a logoff (not a restart) and at login you'll see something like this:

(Click image for expanded view.)

Explorer's menu is localized, but Wordpad's is not. Most of the Windows-specific titles are localized, but the date on the clock is not. It's a hodgepodge, and MUI may not meet all your needs.

Our resulting testbench comprises Vista Business, the upgrade to Ultimate, the Ja language pack and McAfee VirusScan. Call me a dinosaur, and many other people have complained about Vista's disk usage more eloquently than I can, but I'm up to 12GB (2GB of which is the hibernation file that won't go away), which strikes me as quite a bit. So, if you plan to create several localized Vista testbenches, get a fat drive.

By the way, we experimented with localized virtual machines based on Vista. Even devoting 1 of the 2 GB of physical RAM to it, the VM ran so poorly that it was just plain not worth it.

If you liked this article, you may enjoy another related article, "Localization Testbenches - Part I."

Labels: , , , ,

27 July 2007

Virtual machines as localization testbenches

What are you doing for localized testbenches? Are you still partitioning hard drives or, worse yet, dedicating entire machines to one language-platform? Lab getting a bit crowded and hot, is it? Consider using virtual machines, or VMs.

Microsoft's Virtual PC 2007 is free and uncrippled, so you can create, run and administer your own VMs. It's not bad software, and you can be sure that if there are any "special" tricks a VM should know about the Windows-version hosting it, this product will know them. VMWare's product is not free, but features a VM player, so if somebody in your department has the full product and can create VMs for you, you can use them as you would a normal drive.

A VM is just a huge file (around 1GB for Windows 2000, 3GB for Windows XP, 6GB for Vista) that you mount and run as a "guest" session in its own window. It's like having a computer inside a computer, although it takes away drive space, RAM and processor cycles that the "host" system - the one your computer runs normally - used to use. You can start and stop the VMs as you need them, and you can install almost any OS or language to run as a guest inside the VM; however, you do need to procure a legal copy of that OS/language combination.

Tiring of so much ancient kit lying around the lab, we've begun to migrate to VMs. They're not a panacea, but they make things like remote testing a good deal easier, they require less hardware, and they make dual-boot configurations irrelevant. There's quite a performance hit, unfortunately, and we're finding that VMWare VMs are a bit more responsive than Microsoft's.
However, most localization testing is focused on UI and functionality rather than on performance, so this may not affect your lab unduly.

Creating one is not that difficult. Here's an example for Microsoft's Virtual PC 2007:
  1. Obtain a machine with about 2GB of RAM and at least 80GB of drive space, running, say Windows XP English.
  2. Download and install MS Virtual PC 2007.
  3. Create a virtual machine, specifying Windows XP. For installation, give the VM 1GB of RAM; you can reduce that amount later if need be.
  4. Obtain the installation disk for the desired OS (e.g., Windows XP Japanese) and place it in the drive.
  5. Start the VM. As it opens in its new window, specify that you want to capture the CD drive (or the image file on the host machine, if you're mounting a .img).
  6. Installation will then take place as normal, with disk checking, file copying, and all the configuration and rebooting you would expect of an installation on a physical drive.
  7. A few GB later, you have a WinXP Japanese VM running as a guest on your Win XP English host. Install the Virtual Disk Additions to enable features like host-to-guest drag-and-drop.
Of note:
  1. Your VMs don't inherit domain information from the host machine, so if you want the VMs on the domain for things like advertised programs and SMS pushes, you'll need to arrange that separately with the network administrator.
  2. They just get bigger. There is a feature to "compact" the VMs, but the resulting file takes no less space on disk.
  3. Migrating from existing physical drives into a VM is a crapshoot. Our most wildly successful experimentation has resulted in Blue Screen of Death, so don't expect to take an existing testbench and copy it into a new VM, any more than you would expect it to work going from a desktop to a laptop physical machine.
  4. There's an emulation layer between the VM and the hardware, so peripherals (USB devices, dongles) may not run the same way as they do on a physical drive.
  5. VMs are portable, so if you can get one to run on your desktop, you should be able to copy it to other computers and use it on them. For that matter, VMWare VMs can be hosted on servers and run remotely.

Labels: , , , ,

06 April 2007

Localization Testbenches, Part II (Software)

What are you using to test your localized products? If you're handing them to your domestic QA team and expecting that they'll intuitively test them with correct language locale settings, you may be in for an unpleasant surprise.

1) Software
This will probably take you the most time to get right, because you need to go to more pains to emulate the real-world scenario of your customers. They've bought computers running Windows XP/Japanese or Linux/Russian or MacOS/Arabic. The hardware nowadays isn't different (except for the keyboards), so you don't need to outfit your lab with machines from all over the planet.

However, if you install your Korean product under US-English Windows XP, you'll probably be in for lots of corrupted characters on screen. This is because characters in Korean (and Japanese, Chinese, Arabic and a few other languages) take up two bytes, whereas characters in English (and other Western languages) take up only one byte. An English operating system tries to interpret the Korean characters one byte at a time, and the result is usually illegible.

Modern operating systems include the fonts and locale support for these multi-byte languages, though it usually needs to be enabled. This is a good half-measure for testing your localized products, but it's still not exactly what your in-country customers will see, so you should consider native-language testbenches, onto which you freshly install the native operating system.

This can get clunky and hardware-intensive - even if you're partitioning the disk and dual-booting - so you may also consider virtualization products like VMWare and Virtual Disk. You can host dozens of different native-language systems on a single hard drive, and run several of them at a time, if your machine is sufficiently endowed.

Of course, almost any solution will spook your testers, who will consult their job descriptions and inform you that they contain no mention of "putting up with weird languages." This is not an insurmountable problem, though it is a topic for another post.

Note: Believe it or not, some people think it's pretty slick to see MacOS in Portuguese or Russian RedHat. They are hypnotized by how similar the interface is, and struck by the differences. A neat show-stopper for your evangelization sessions.

Labels: , , , , ,

30 March 2007

Localization Testbenches, Part I

What are you using to test your localized products? If you're handing them to your domestic QA team and expecting that they'll intuitively test them with correct language locale settings, you may be in for an unpleasant surprise.

Of course, your testers need to have some tolerance for the extraordinary circumstance of not being able to read what they're testing. Testers with this level of tolerance have not been that easy to hire in single-language countries - which is one explanation for the success of globalization - but they do not take quite so much umbrage at it now that the writing is on the wall and the tools are more handy.

Also, there are two levels of testing: linguistic and functional. You do not need (or want) your domestic QA team to review the Italian translation; you want the translators to review it, and by the time you're handing your product to your QA team, linguistic review should be long since ended. In most cases, your QA team will know how to perform functional testing much more efficiently than the translators will, even though the UI is foreign. Encourage them to overcome the "How can I test this when I can't read it?" obstacle, either with your own evangelization, or with gentle, paycheck-indexed prodding from above. They have more value to add to the localization QA process than they suspect.

In this series, you'll read about testing 1) Software; 2) Web sites; and 3) Help files.

Labels: , , ,

16 March 2007

How to pseudo-translate, Part II

You only speak one language, so maybe you'll never be a translator, but you have a chance as a pseudo-translator.

Pseudo-translation is the process of replacing or adding characters to your software strings to try and break the software, or at least uncover strings that are still embedded in the code and need to be externalized for proper localization. (Part I of this post describes why anybody would want to do such a thing.) Pseudo-translation is a big piece of internationalization (I18n), which you should undertake before you bother handing anything off to the translators.

Here's an example of a few strings from a C resource file, with their respective, pseudo-translations:

IDS_TITLE_OPEN_SKIN "Select Device"
IDS_TITLE_OPEN_SKIN "日本Sイlイct Dイvウcイ本日"

IDS_MY_FOLDER "Directory:"
IDS_MY_FOLDER "日本Dウrイctエrユ:本本"

IDS_MY_OPEN "&Open"
IDS_MY_OPEN "日本&Opイn日"

IDS_WINDOW_NOT_ENOUGH_MEM
"Windows has not enough memory. You may lower the heap size specified in the configuration file."
IDS_WINDOW_NOT_ENOUGH_MEM
"日本Wウndエws hアs nエt イnエオgh mイmエrユ. Yエオ mアユ lエwイr thイ hイアp sウzイ spイcウfウイd ウn thイ cエnfウgオrアtウエn fウlイ.本日本日日本本本日日本日日本日本日本日本"

IDS_TARGET_INITIALIZATION_FAILED
"Failed to load or initialize the target."
IDS_TARGET_INITIALIZATION_FAILED
"日本Fアウlイd tエ lエアd エr ウnウtウアlウzイ thイ tアrgイt.日日本日本日本本"

In these strings, Japanese characters have been pushed in to replace the vowels in all English words. The goal of using Ja characters is to ensure that, when compiled, the strings will look and behave as they should under Windows Japanese; it's important to pseudo-translate with the right result in mind.

Some observations:
  1. Each string begins with Ja characters, since that will be the case in the real Japanese translation, and it's a situation worth testing.
  2. Each string contains enough English characters to allow the tester to "gist" the string from the context. This is helpful because pseudo-translation can often destroy the meaning of the string.
  3. Each string has a ratio of swell, with trailing characters adding 20% to the length of the string. This helps flush out fields and controls in which strings will be truncated.
Okapi Rainbow is an excellent (if somewhat inscrutable) text-manipulation utility for just this purpose. When run on all of the string files in the development project, the result is a set of resources which, when recompiled, will run as a pseudo-translated binary. With a testbench running the appropriate operating system, a tester can get a good idea of the I18n work in store for the developers.

Rare is the product that passes pseudo-translation testing on the first try, either because of strings left behind in the code, resizing issues, string truncation, buffer overflows, or just plain bad luck.

Even if your code isn't perfect, though, look on the bright side: You're now a pseudo-translator.

Labels: , , , , ,

06 March 2007

How to pseudo-translate, Part I

Before you localize your software product, wouldn't you like to have an idea of what's going to break as a result?

If you've written it in English, it will surprise and alarm you to learn that that's no assurance that it will work when the user interface (UI) is in Chinese or Arabic or maybe even Spanish. The most conspicuous vulnerabilities are:
  • text swell, in which "prompt" becomes "Eingabeausforderung" in German, for example, and the 40 pixels of width you've reserved in the English UI results in only a small part of the German appearing;
  • corrupted characters, which will show up in the UI as question marks or little black boxes because characters such as à, ü, ¿, ß, Ø and 日本語 aren't in the code page or encoding under which your software is compiled;
  • illegible or invalid names of files and paths, which occur when installing your software on an operating system that will handle more kinds of characters than your product will;
  • crashes, which occur when your software mishandles the strange characters so badly that the program just giggles briefly and then dies;
  • ethnocentric business logic, which leads to ridiculous results when users select unanticipated countries or currencies;
  • hard-coded anything, whether currency symbols, standards of measurement (metric vs. English) or UI strings.
In the past, localization efforts have become stranded on these beaches late in the voyage, after the text has been translated and the binaries rebuilt. It needn't be that way.

Internationalization testing is the process of pushing alien characters and situations down your software's throat to see what breaks. The more complex the software, the more complex the testing, such that there are companies that specialize in internationalization as much as if not more than localization.

It's not rocket science, but it doesn't happen on its own, either. And, you don't want your customers worldwide doing any more of your internationalization testing than absolutely necessary, because they really don't appreciate buying the product and then testing it.

The process requires some cooperation between Engineering and QA, which should already be in place for the domestic product and can easily be extended to the international products as well. An upcoming post will explain some of the tools and techniques for proper internationalization testing.

Labels: , , , , , , ,