30 September 2006

The Localization Consultant Amid His Buckets

After a few hours hunched over Beyond Compare, I've sorted the deltas between version 3.9 and version 4 into several buckets:
  1. New Content, based on filenames appearing for the first time in this version - 718 files
  2. Content Unchanged, except for the datestamp at the bottom of the page - 727 files
  3. Content Changed, but with changes that do not require translation (HTML tags, formatting) - 1517 files
  4. Other, including content with translatable changes and anything else - 319 files
My hope is that the vendor can hand off to the Japanese translators only those pages in which there is real translation work, then internally take care of #2 and #3 with search-and-replace and other engineering techniques to bring the 3.9 pages into parity with the 4.0 pages. For that matter, I could probably do the engineering myself, except that: 1) it's boring work; and 2) the vendor needs to update translation memory with the results.

We'll see how this goes. It doesn't help that the original English files have a lot of formatting errors in them, and that errors in the Perl scripts wipe out the content on several dozen pages and toss them into the CHM blank.

Labels: , ,

25 September 2006

Doing the Localization Vendor's Work?

Sometimes I know too much about this process.

Or, maybe I'm just too nice a guy.

To make things easier for the vendor (and cheaper for me) I've resolved to carve the 3200 HTML files in the API Reference CHM into different buckets, depending on whether and how much they require translation vs. engineering. Naturally, the ultimate arbiter is the Trados or SDLX analysis that the vendor will perform, but I've already mentioned my concern about false positives and need write no more on the topic here.

My tool of choice is the extremely capable Beyond Compare which, at US$30, is worth it just to see how well thought-out a software package it is. I compare version 3.9 files against version 4 files, tuning the comparison rules to groom the file buckets as accurately as possible.

The distribution is not perfect, if for no other reason than because its first level of triage is the filename and not the file contents, but it's better than guessing, and it's much better than thousands of false positives.

Once I've gone through the files, I'll have a better idea of how to label the buckets in a way that meets both my needs and those of the vendor.

At least, I think I'm being too nice a guy. Maybe this is just a big pain for the vendor, and they're too polite to inform me of that.

Labels: , , , ,

20 September 2006

Segmentation and Translation Memory

To get the broken sentences in the new files to find their equivalents (or even just fuzzy matches) in translation memory we have three options:

  1. Modify the Perl scripts that extract the text from the header files into the HTML, so that the scripts no longer introduce the hard returns.
  2. Massage the HTML files themselves and replace the hard returns with spaces.
  3. Tune the segmentation rules in Trados such that it ignores the hard returns (but only the ones we want it to ignore) and doesn't consider the segment finished until it gets to a hard stop/period.
To go as far upstream as possible, I suppose we should opt for #1 and fix the problem at its source. This seems optimal, unless we subsequently break more things than we repair. Options #2 and #3 are neat hacks and good opportunities to exercise fun tools, but they burn up time and still don't fix the problem upstream.

Also, I don't want the tail to wag the dog. The money spent in translating false positives may be less than the time and money spent in fixing the problem.

Labels: , , , ,

15 September 2006

Moving the Localization Carpet under the Source Text

Here's the mess I face.

The HTML files are filled with paragraphs formatted like this:

Currently, this function gets called for trust overrides and

client authentication. On client authentication, the supplied

interface contains the server's Certificate Authorities Distinguished

Names list (see references) and the negotiation handler

always gets called so as to give a chance to the client to supply

the correct client certificate based on the DN list.


At the end of each line are two hard returns. It wasn't always this way, so each complete sentence is sitting happily in translation memory. Unfortunately, Trados pulls in each of these six 80-character fragments and calls it low- or no-match because it can't find enough of a concordance. This is a classic case of false positives driving up translation costs.

I'm still exploring options. Meanwhile, there's no sense in starting the translation work.

Labels: , ,

10 September 2006

Who's in trouble: the Localization Vendor or me?

The localization estimate has come back on the HTML files in the API Reference, and it's as ghastly high as I'd feared.

The vendor's project manager does something clever with these thousands of files: she uses SDLX Glue to glue them together into six or seven batches of several hundred files each. That way she avoids carpet-bombing the translator with jillions of files; this also keeps the translator in the translation business and out of the file management business. After translation, the project manager un-glues them using SDLX Glue and hands them off internally for engineering, QA, etc.

The downside to this technique is that the TM analysis comprises only the six or seven files. I can't see down to the level of granularity I want unless I ask for a special analysis down to the file. They don't mind doing it for me, but it's not in their regular workflow and I have to wait for it.

Anyway, the count of unmatched words is preposterously high, and I'm pretty sure it's due to changes in the scripts that extract the HTML from the header files. Sentences and segments in version 4.0 don't match those in the last version because of things like double line-feeds and mucked up HTML tags.

I need to have a deeper look at the original English HTML files and bucket them for handoff. BeyondCompare shows me that the text in some files hasn't changed at all, and I'll need to spoon-feed these to the vendor.

Either that or get shot down when I take this estimate up to the third floor for approval...

Labels: , , ,

05 September 2006

The Lonely Localization Manager

It's a bit strange, the way in which I get my work done.

Naturally, I play the role of localization manager and primary contact at developer meetings, and I manage budget and schedule for several projects at once. I've become the lightning rod for issues ranging from character sets to encodings to what-do-these-Chinese-characters-say. All in all, most localization issues are pretty well in hand because I've been able to manage them in a way that conforms to best practices, with a bit of experimentation thrown in to see how much better we can make things.

I suspect, though, that most localization managers who might read this have teams and staff and reporting structures. I would bet they also spend more time scrapping internally with development managers, QA and upper management, struggling to make localization conspicuous and wildly successful. I just make it work.

As localization manager for a software company in the early 1990's I went through that. I get a lot more done this way, and I enjoy it more.

There was that exception last year. I had a client that drove me bonkers because their entire corporate culture was geared to nothing more than tolerating localization, and that with a clothespin on their nose. Wish I'd been blogging during that engagement; it was a wild ride.

Labels: , ,

01 September 2006

Localization-surfing InstallShield

I like the way in which InstallShield does what it does. It's too bad they won't give me my own copy of the software, though.

It took me several iterations, but I've successfully navigated the areas in the InstallShield UI that do and do not support wide characters. (I thought there was a time when I could put anything anywhere and have it display properly, but suddenly I was getting corrupted characters in some screens in Asian installers. So much for the ailing memory of the noble localization project manager...)

TITLE_MAIN=Glopware v3.0.1 Ko main (ASCII only)

TITLE_CAPTIONBAR=Glopware v3.0.1 설정 caption

COMPANY_NAME=GlopWeb Inc.

PRODUCT_NAME=Glopware v3.0.1 Ko prodname

PRODUCT_KEY=Glopstart.exe (ASCII only)

PRODUCT_VERSION=3.0.1.12

FOLDER_NAME=Glopware v3.0.1 한국어 folder

PRODUCT_DIR=GlopWeb 3.0.1 Ko proddir (ASCII only)


Your mileage may certainly vary, but I found that, anytime we used double-byte characters in the strings labeled "(ASCII only)", they became corrupted. My hunch is that there's some other variable or setting required to support such characters in those places, but I figured I could give the release engineers only so much grief before I suffered backlash.

If you're a double-byte InstallShield expert, by all means reply with the better localization way.

Labels: , ,