25 August 2006

Internationalization and the smart installer

Have we been thankful enough for InstallShield? I think it's a royal headache for the release engineers that have to get used to it, but it's a dream for a localization project manager:
  • InstallShield does most of the hard work. Most of the strings are already translated into more languages than most companies know what to do with.
  • Customized strings live in a single, text-based value.shl file, which the release engineers peel off and hand me for translation.
  • By default it creates language-specific branches in source control, which prevents, say, your Russian release from getting pasted in as a mere revision to your original English release.
The value.shl file is very simple, and ours changes so infrequently that it's easiest for me to update it myself (version numbers, copyright dates, URLs), without need to hand it off for translation.

Of course, it did drive the release engineers batty in the early days, especially when I wandered in asking for 3 Asian and 2 Western installers every few months. The hard part for them is seeing far enough down the road to build a maintainable structure in source control. It never occurred to them to start out with branches labeled /en/ or /0009-English/ because they never foresaw the need for other languages, so they painted themselves into corners but didn't realize it until Chinese came along one day.

People in this industry write about introducing worldwide consciousness to the overall mindset of the organization, and evangelizing the gospel of localization; that's the 50,000 foot-/16,129 meter-level. Must be nice. I spend most of my time crawling in a trench in source control somewhere, trying to soften periods into decimal separators without getting flamed.

Labels: , , ,

23 August 2006

Fixing that small internationalization gaffe

The engineers resolved the internationalization problem. Sort of.

They've modified the logic so that it no longer depends on the hardcoded presence of "&Tools" to pull the resources in correctly from two separate DLLs. However, it still looks for the literal "&Edit" in each DLL. If it doesn't find it, the submenu items do not show up. I know, because I broke it again with a random pseudo-translation pass that rendered "&Edit" as "&ßéüdßéüt" in one resource file and "&ßéüñdßéüñt" in the other.

"Well, what do you expect?" asked the developer, when I explained this to him. "Get your pseudo-act together and you won't find problems like this."

I granted him that it was very unlikely that "&Edit" would be translated differently in two places - well, it could happen, but it should not happen - but that was not the point. It's just not good programming practice to depend on string literals like that, whether localization engineering is a concern or not. "Why don't you make the dependency on the string ID instead? Localization will never go near that."

"Submit a ticket on it and we'll see for next time," he replied. "I've got other dragons to slay right now."

So, I filed the request and the enhancement is in the great cosmic wash of the engineering team's Issue Review system.

Labels: , ,

20 August 2006

Bad internationalization practice

Unfortunately, there's been another architecture change besides the move to .NET: Engineering has split the resource DLL into two pieces.

This is not bad news in itself, but there is a tricky dimension to putting the the two DLLs together at run time, and the engineers have handled it in a way that assumes a little too much.

The main menu contains the usual entries (File, Edit, View, Tools, Windows, Help), each of which contains a submenu. The localization hiccup is that some of the submenu items live in one DLL, and the others live in the other DLL. What brings them together at run-time? The software depends on the presence of the string "&Edit" in each one. What happens when "&Edit" gets translated? "Oh, well, I guess we didn't think of that..."

The pseudo-translated string reads "&ßéüdßéüt". The sets of submenu items don't find one another in the DLLs at run-time, so they simply don't show up in the menus. Another triumph for the farsightedness of internationalization testing, and back to the drawing board for the developers.

Labels: , ,

17 August 2006

Software internationalization - the hidden suspects

This is pretty old software. We've been working on it for over five years, and it's not very complex to begin with: a few standalone Win32 applications in the normal .exe-calls-resource-DLL model.

Localization practices and internationalization techniques are mostly behind us. Several years ago, at the behest of the vendor, we managed to convince the developers to modify the layout and controls in all of the UI so that no resizing would be necessary from one language to the next. That was the biggest thing to get out of the way, as it ate up precious hours between translation and successful build. The problem has to do with the coming and going of developers who weren't there a few years ago, and who move things around and resize them in the English UI. Localization headaches ensue.

I'm not sure, but I think the move from Visual Studio 6 to Visual Studio .NET has played a disruptive role. There's no inherent reason for that architecture change to cause resizing of controls or changes in coordinates, but it is mucking about with other non-translatables, like

1 TEXTINCLUDE DISCARDABLE

changing to

1 TEXTINCLUDE

or

STYLE DS_MODALFRAME | WS_POPUP |

changing to

STYLE DS_SETFONT | DS_MODALFRAME | WS_POPUP

There's certainly nothing to translate here, but these count as changes in the source-language files that should be preserved in the target-language files. It's hard to do this sort of thing without some kind of Computer-Aided Translation (CAT) tool or at least text-manipulation utility.

So, I reckon we hand it off for localization engineering, if not translation. The engineers will jack up the translation, slide out the old architecture as if it were HTML or XML tags, slide in the new architecture, and push the translated strings back down into place.

16 August 2006

Pseudo-translating the resource files

I probably shouldn't enjoy this stuff so much, but I'm a gearhead at heart, so I get a lot of gratification from climbing around inside resource files.

One of the unsung virtues of localization consulting is pseudo-translation and subsequent QA. The goal is to replace the source (in this case, English) strings with well thought-out gibberish, in an effort to make the software barf. This can take a number of forms, such as:
  • truncated strings
  • corrupted characters
  • hard-coded strings
  • expanses of blank space where strings should be; and
  • crashes (my favorite)
I'm not really all that happy that I've caused the software to crash, but at least it vindicates the function of localization project management in general and pseudo-translation in particular in a way that even the most jaded developer cannot ignore.

Labels: ,

13 August 2006

On the varieties of Translation Memory mismatches

A close look at two different versions of the API Reference and the TM was enlightening.

Whereas the early version of the API Reference had multiple instances of

"These are filled by

Interface A and can be examined in the callback function."

and the current version has

"These are filled by
Interface A and can be examined in the callback function."

TM has

"These are filled by Interface A and can be examined in the callback function."

Ordinarily, segmentation rules in the TM tools would ignore the whitespace and CRLFs and see the sentence as a single translation unit, but it appears that the TM tool cannot (or will not) ignore all of the whitespace. It could also be that ignoring it correctly here would have adverse effects elsewhere in the document. It's also more or less successful in ignoring "unimportant" differences inside HTML tags.

In any event, because the engineers are constantly tweaking the Perl scripts, the source header files are still much more stable than the extracted HTML files. We should figure out a way to localize the former and generate the latter using scripts, as we do for English.

Labels:

10 August 2006

Localization beyond the Translation Memory tools

Still reeling from the preliminary wordcount on the HTML pages in the API Ref...

While it's true that 3280 pages is not a phenomenal amount, it's too onerous a number for review and vetting by a single human, particularly when I would be that human.

I've dusted off BeyondCompare for some preliminary testing. I think I can use it to pour these pages into a few different buckets:

1) New, orphan pages - newly written content
2) Pages which have not changed at all since the last time I handed them off for localization
3) Pages which have changed immaterially (datestamp in footer, etc.) since last handoff
4) Pages which have changed for reasons that won't matter to translators (format changes, cleaned up typo's in English)

The question is: Will this lead to a higher or lower margin of error than I get when I simply throw all 3280 pages into Trados?

This is the stuff localization consulting is made of.

Labels:

07 August 2006

Trados, you've got to be kidding!

I handed off the preliminary version of the 3280 HTML pages to the localization vendor for analysis against the existing TM. What came back was most ugly.

Out of 400,000 words, the analysis finds that over 220,000 are not exact matches, which is inaccurate at best and depressing at worst.

-Maybe the translation units in the TM are including the HTML tags, which have changed greatly over the years.

-Maybe the TM has not yet been updated from the last round of localization.

-Maybe the TM is corrupted.

-Maybe the engineers have made small, search-and-replace changes in thousands of places, modifying details that don't really matter in the translation. Suppose they changed "recieve" to "receive" in 500 segments, totaling 5,000 words. The segments would not be counted as exact matches - which costs me - but they would nonetheless go to the translators, who would see that there is nothing to change in the translation: "Money for old rope," as they say in the UK.

This bears some more investigation, because I don't trust that TM wordcount.

Labels: ,

01 August 2006

So the API Ref weighs in at 3280 HTML pages now, about 750 more than in the last release.

The trick will be in figuring out which of these zillions of pages have substantive changes (i.e., new translatable text, changed translatable text) and which have changed due to non-translatable issues (i.e., changes to the HTML code inside the tags). Translation memory tools are meant to ignore the latter, but I can't leave good translation inside outdated HTML; something is bound to break, or at least look bad, if we shuffle multiple generations of HTML code and tag conventions together and compile it.


I don't think the TM tools are going to rescue me from this. I should figure out a way to translate the source header files instead of the downstream HTML files.

Labels: