26 February 2007

Translation non-savings, Part I

How far will you go to improve your localization process?

Because of how localization is viewed in many companies, the best improvements are the ones that lower cost. Low cost helps keep localization inconspicuous, which is how most companies want it.

But if a big improvement didn't save any obvious money, would your organization go for it?

Elsewhere in this blog I relate the saga of the compiled help file with 3500+ HTML pages in it. These pages come from a series of Perl scripts that we run on the header files to extract all of the information about the product's API and wrap it up in a single, indexed, searchable CHM. In a variety of experiments, we've sought to move the focus of translation from the final HTML files to a point further upstream, at or near the header files themselves. If the raw content were translated, we believe, all downstream changes in the Perl scripts, which get revised quite often, would be imposed automatically on the localized CHM.

One of the biggest cost items - we have suspected - is due to changes in line wrapping and other HTML variations that confuse TM tools into thinking that matches are fuzzier than they really are. The false positives look like untranslated words when analyzed, so the wordcounts rise, and not in our favor.

"If we work with raw text, before HTML formatting," our thinking goes, "the match rate will rise."

Not.

I'll describe my experiment shortly.

Labels: , , , , ,

2 Comments:

  • Hi John, This is what I would expect from TM. Most people make the assumption that text is parsed by tm's as strings, line by line. But its a little more complex than that. TM tools expect data to come with formattting tags etc. TM tools see segments, instead of sentences. So with code that is tagged, its easy to find a complete sentence. With raw text the TM tools have less to work on to define a segment or sentece.

    Take a simple example like this:

    "At I.B.M. we always file I.R.S. returns on the 4th. day of the month."

    Now based on raw text where does the sentence end? Can you assume that a full stop is the end of a sentence? If you do the TM will break this sentence into multipl segments. With tags in HTML or XML you can easily find the entire block of text.

    So thats why when you just produce raw text, you don't always get better TM results.

    Damian Scattergood
    MD STAR-TS Ireland.
    http://www.star-ts.com

    By Anonymous Anonymous, at 14:15  

  • Damian:

    Right you are. See "Part II" for the results of the experiment.

    I was trying to get as close to the source of the documentation as possible (before HTML formatting), and I managed to do so, but the translation savings broke down for different, more frustrating reasons.

    Thanks for commenting.

    John

    By Blogger John White, Localization Guy, at 14:39  

Post a Comment

<< Home