02 October 2008

Wordcount Woes - Part 2

If you're working client-side, how many words have you paid for that translators didn't even need to touch?

I posted a couple of weeks ago on translatable words that vendors may miss in analyzing files. Alert reader arithmandar commented that slide decks can be even worse, if there is a lot of verbiage on the master slide that does not get easily captured (although Trados finds these words, according to him/her). Flash is another story altogether, and arithmandar's recommendation is that a Flash engineer should probably perform the analysis.

The other side of the coin is also unpleasant, but for the other party: Clients can hand off vast expanses of words that nobody will translate, artificially inflating the wordcount and estimate.
  • Code samples - If your documentation contains examples of code used in your product (e.g., in an API reference), there is no point in having that included in the wordcount, because nobody translates code.
  • XML/HTML/DITA/Doxygen tags - I hope your vendor is parsing these files to ignore text (especially href text) in the tags. Otherwise, not only will you get back pages that won't work worth a darn, but you'll also be charged for the words.
  • Legal language - Some companies want their license agreements, trademark/copyright statements, and other legal pages left untranslated. (Usually these are American companies.)
  • Directives - Certain directives and warnings apply to certain countries only. The documentation for computer monitors and medical devices often contains a few pages of such directives, which appear in the language of the country requiring them. There is usually set language for these directives, so free translation is not appreciated; have your colleagues in Compliance obtain the language for you, paste it in yourself, and point it out to your vendor.
Mind you, there are costs associated with finding and removing all of these words: Do you want to spend time extracting the words? Do you want to hire somebody to find and extract them? Will your savings offset those costs?

If the words to be ignored add up to enough money - as they often do for a couple of our clients - pull them all into a text file and send them to your vendor with instructions to align them against themselves for all languages in the translation memory database. That way, when the vendor analyzes your files, the untranslatable words will fall out at 100% matches.

Do you have ideas on how to handle such text?

Labels: , , , , ,

13 July 2007

Where Translation Memory Goes to Die

Have you ever heard that you're better off not going into the kitchen at your favorite restaurant? You're likely to see a number of things you'd rather not associate with a place and a group of people you like.

The same may apply to your translation memory databases. Unfortunately, you don't have the luxury of ignoring them, because things could be dying in there and costing you money.

Let's start with this sentence:

Some interfaces use "redial function/redial context" semantics instead of using IRedial to specify both.

Any TM tool could store this string and its translation without problems. Suppose, though, that the sentence (segment, in TM terms) only looks contiguous when displayed in an HTML browser, which is a very forgiving viewer, and that the source is actually broken into three pieces:

1. Some interfaces use "redial function/redial context" semantics instead of using
2.
to specify both.
3.[HTML tags] IRedial.htm [closing HTML tags] IRedial

The text comes from include files written by engineers for engineers, and no line is longer than 80 characters. The tags come from the well-intentioned Tech Pubs team, which struggles to introduce some organization, hyperlinking and search capability to the product. This is pretty bruising to TM, which relies on being able to fuzzily match new occurrences to old occurrences of similar text. When the full sentence comes through the TM tool, its correspondence to the three broken fragments in TM is sharply impaired, and you (or I, in this case) pay for it.

It gets worse. If an engineer pushes words from one line to the next between versions, or if the tags are modified, the impact on match-rates is similarly impaired.

I've huddled with engineers, Tech Pubs and the localization house on this matter several times, with little progress to show for it, but here's a new twist:

We've offshored one of these projects to a vendor in China. Their solution was to re-align ALL of the English-language HTML pages from the previous version to ALL of the translated HTML pages of the previous version, effectively re-creating TM. They report about 20% higher match rates after doing this. I think this is because they're embracing the broken, dead segments in TM and finding them in the source files for the new version.

This seems like a counterintuitive approach, but who can argue with the benefits?

Labels: , , , , ,

26 February 2007

Translation non-savings, Part I

How far will you go to improve your localization process?

Because of how localization is viewed in many companies, the best improvements are the ones that lower cost. Low cost helps keep localization inconspicuous, which is how most companies want it.

But if a big improvement didn't save any obvious money, would your organization go for it?

Elsewhere in this blog I relate the saga of the compiled help file with 3500+ HTML pages in it. These pages come from a series of Perl scripts that we run on the header files to extract all of the information about the product's API and wrap it up in a single, indexed, searchable CHM. In a variety of experiments, we've sought to move the focus of translation from the final HTML files to a point further upstream, at or near the header files themselves. If the raw content were translated, we believe, all downstream changes in the Perl scripts, which get revised quite often, would be imposed automatically on the localized CHM.

One of the biggest cost items - we have suspected - is due to changes in line wrapping and other HTML variations that confuse TM tools into thinking that matches are fuzzier than they really are. The false positives look like untranslated words when analyzed, so the wordcounts rise, and not in our favor.

"If we work with raw text, before HTML formatting," our thinking goes, "the match rate will rise."

Not.

I'll describe my experiment shortly.

Labels: , , , , ,

07 October 2006

Localization and the Perl Script

After some cajoling, I've prevailed on our tech-writer-who-doesn't-do-any-writing to modify his Perl scripts. The changes will remove the thousands of CRLF (hard returns) in the 3700 extracted HTML files, and result in better Trados matching between the new files and translation memory.

Then, of course, it will take a few hours' perusal to see what breaks as a result of that fix.

It seems to be an unsung inconvenience of localization that
a
sentence
put
together
with
these
words
and
looking
like
this

separated by hard returns in the raw HTML file (which you can see by viewing source in a browser) becomes

a sentence put together with these words and looking like this

when viewed in a browser. The translation memory tools, of course, see the hard returns and try in vain to match accordingly, but they can result in a fair bit of head-scratching to those viewing the files only through a browser.

Labels: , , ,

20 September 2006

Segmentation and Translation Memory

To get the broken sentences in the new files to find their equivalents (or even just fuzzy matches) in translation memory we have three options:

  1. Modify the Perl scripts that extract the text from the header files into the HTML, so that the scripts no longer introduce the hard returns.
  2. Massage the HTML files themselves and replace the hard returns with spaces.
  3. Tune the segmentation rules in Trados such that it ignores the hard returns (but only the ones we want it to ignore) and doesn't consider the segment finished until it gets to a hard stop/period.
To go as far upstream as possible, I suppose we should opt for #1 and fix the problem at its source. This seems optimal, unless we subsequently break more things than we repair. Options #2 and #3 are neat hacks and good opportunities to exercise fun tools, but they burn up time and still don't fix the problem upstream.

Also, I don't want the tail to wag the dog. The money spent in translating false positives may be less than the time and money spent in fixing the problem.

Labels: , , , ,

10 September 2006

Who's in trouble: the Localization Vendor or me?

The localization estimate has come back on the HTML files in the API Reference, and it's as ghastly high as I'd feared.

The vendor's project manager does something clever with these thousands of files: she uses SDLX Glue to glue them together into six or seven batches of several hundred files each. That way she avoids carpet-bombing the translator with jillions of files; this also keeps the translator in the translation business and out of the file management business. After translation, the project manager un-glues them using SDLX Glue and hands them off internally for engineering, QA, etc.

The downside to this technique is that the TM analysis comprises only the six or seven files. I can't see down to the level of granularity I want unless I ask for a special analysis down to the file. They don't mind doing it for me, but it's not in their regular workflow and I have to wait for it.

Anyway, the count of unmatched words is preposterously high, and I'm pretty sure it's due to changes in the scripts that extract the HTML from the header files. Sentences and segments in version 4.0 don't match those in the last version because of things like double line-feeds and mucked up HTML tags.

I need to have a deeper look at the original English HTML files and bucket them for handoff. BeyondCompare shows me that the text in some files hasn't changed at all, and I'll need to spoon-feed these to the vendor.

Either that or get shot down when I take this estimate up to the third floor for approval...

Labels: , , ,

07 August 2006

Trados, you've got to be kidding!

I handed off the preliminary version of the 3280 HTML pages to the localization vendor for analysis against the existing TM. What came back was most ugly.

Out of 400,000 words, the analysis finds that over 220,000 are not exact matches, which is inaccurate at best and depressing at worst.

-Maybe the translation units in the TM are including the HTML tags, which have changed greatly over the years.

-Maybe the TM has not yet been updated from the last round of localization.

-Maybe the TM is corrupted.

-Maybe the engineers have made small, search-and-replace changes in thousands of places, modifying details that don't really matter in the translation. Suppose they changed "recieve" to "receive" in 500 segments, totaling 5,000 words. The segments would not be counted as exact matches - which costs me - but they would nonetheless go to the translators, who would see that there is nothing to change in the translation: "Money for old rope," as they say in the UK.

This bears some more investigation, because I don't trust that TM wordcount.

Labels: ,