13 July 2007

Where Translation Memory Goes to Die

Have you ever heard that you're better off not going into the kitchen at your favorite restaurant? You're likely to see a number of things you'd rather not associate with a place and a group of people you like.

The same may apply to your translation memory databases. Unfortunately, you don't have the luxury of ignoring them, because things could be dying in there and costing you money.

Let's start with this sentence:

Some interfaces use "redial function/redial context" semantics instead of using IRedial to specify both.

Any TM tool could store this string and its translation without problems. Suppose, though, that the sentence (segment, in TM terms) only looks contiguous when displayed in an HTML browser, which is a very forgiving viewer, and that the source is actually broken into three pieces:

1. Some interfaces use "redial function/redial context" semantics instead of using
2.
to specify both.
3.[HTML tags] IRedial.htm [closing HTML tags] IRedial

The text comes from include files written by engineers for engineers, and no line is longer than 80 characters. The tags come from the well-intentioned Tech Pubs team, which struggles to introduce some organization, hyperlinking and search capability to the product. This is pretty bruising to TM, which relies on being able to fuzzily match new occurrences to old occurrences of similar text. When the full sentence comes through the TM tool, its correspondence to the three broken fragments in TM is sharply impaired, and you (or I, in this case) pay for it.

It gets worse. If an engineer pushes words from one line to the next between versions, or if the tags are modified, the impact on match-rates is similarly impaired.

I've huddled with engineers, Tech Pubs and the localization house on this matter several times, with little progress to show for it, but here's a new twist:

We've offshored one of these projects to a vendor in China. Their solution was to re-align ALL of the English-language HTML pages from the previous version to ALL of the translated HTML pages of the previous version, effectively re-creating TM. They report about 20% higher match rates after doing this. I think this is because they're embracing the broken, dead segments in TM and finding them in the source files for the new version.

This seems like a counterintuitive approach, but who can argue with the benefits?

Labels: , , , , ,

0 Comments:

Post a Comment

<< Home