11 September 2008

Wordcount Woes - Part 1

Do you spend much time fretting about wordcount?

My hunch is that translators worry about it more than agencies do, because it's often the only metric by which translators earn their daily bread. Agencies have project management, layout, graphics, consulting, rush charges and other metrics to observe, but most translators have one line-item on their invoices: wordcount.

I suppose that we all live and die by it because everybody's calculations get down to wordcount - either source or target text - sooner or later, but no two tools define words the same way, so wordcount can vary considerably.

Still, the bigger issue with wordcount is "wordcount leakage." If you're working vendor-side, how many times have you quoted on a project, then realized that you had overlooked a chunk of text?

  • Graphics are the biggest culprit. The document contains charts and diagrams that require translation, but TM tools don't find those words. Many vendors wisely exclude such text from wordcount and cover it in an hourly or per-graphic charge. (Nobody can ever find the source files for the graphics so that you can localize them properly, but that's a whole other talk show.)
  • Bookmarked text is also slippery. It appears as text (sentences, paragraphs) in one place, and is referred into other places in the document. True, you only translate it in one place, but you need to deal with it - layout, formatting, page flow - in other places as well.
  • Conditional text, a favorite of Framemaker professionals, can also cause you trouble. If you don't calculate wordcount with the conditions set to expose all of the text, you may miss it. The author should arrange for this before handoff.
  • Embedded documents (spreadsheets, word processing, HTML, presentations) are very sneaky. We just saw this the other day with an MS Word document that contained several embedded spreadsheets visible only as 1cm square icons on the page; double-clicking the icons opened up the embedded files. The TM tools don't see those words, but the client certainly would have if they had come back untranslated. Fortunately, we caught this in time.
The Moral: Two pairs of eyes should review every file before the TM analysis, NOT one pair of eyes and a TM software package.

Labels: , ,

16 November 2007

Where do your glossaries live?

The experienced project manager with your localization/translation vendor approaches a new client/project by asking you, "Has this ever been translated before?" Her big goal is to discover whether there's a translation memory database floating around, to help her translators do their work more quickly and keep your costs low, and her background goal is to find existing documents with key terms already translated and approved.

Smart companies maintain these key terms in a "glossary" or terminology list. Glossaries are far less comprehensive than translation memory because they serve a slightly different purpose: Instead of proposing a fuzzy-match translation for an entire sentence, they serve as a reference for the translators. Good translators know how to find translations for generally accepted terms like "closed-loop servomechanism" and "high-definition multimedia interface," but if the sales manager in your Shanghai office has already told you how he likes to see the word translated, everybody will be happier if that preference is observed.

So where do your glossaries live?

"Live" is the important word, because glossaries change and grow with time. Most glossaries I've seen are in a spreadsheet or word processing document. While that's better than nothing, it can suffer from decentralization, since updates don't always make it to everybody involved in the project, and some translators run the risk of using old terminology.

One of my more localization-savvy clients makes its glossary available on its partner portal, requiring a login and password. The php-based application, which is actually hosted by a translation vendor, allows searching in multiple languages. My client deliberately does not make the glossary available for download or export; this ensures that everybody is using the same version with all updates.

I like this model. The assets reside on the client/owner's site, and the terminology "lives" with the linguistic experts, who can easily modify it. It's a bit more work for the translator, who would rather have a flat-file document, but overall it serves linguistic interests well. It's tried-and-true technology built in to most computer-aided translation tools.

What are you doing with your glossaries?

Labels: , , , , , , ,

13 July 2007

Where Translation Memory Goes to Die

Have you ever heard that you're better off not going into the kitchen at your favorite restaurant? You're likely to see a number of things you'd rather not associate with a place and a group of people you like.

The same may apply to your translation memory databases. Unfortunately, you don't have the luxury of ignoring them, because things could be dying in there and costing you money.

Let's start with this sentence:

Some interfaces use "redial function/redial context" semantics instead of using IRedial to specify both.

Any TM tool could store this string and its translation without problems. Suppose, though, that the sentence (segment, in TM terms) only looks contiguous when displayed in an HTML browser, which is a very forgiving viewer, and that the source is actually broken into three pieces:

1. Some interfaces use "redial function/redial context" semantics instead of using
2.
to specify both.
3.[HTML tags] IRedial.htm [closing HTML tags] IRedial

The text comes from include files written by engineers for engineers, and no line is longer than 80 characters. The tags come from the well-intentioned Tech Pubs team, which struggles to introduce some organization, hyperlinking and search capability to the product. This is pretty bruising to TM, which relies on being able to fuzzily match new occurrences to old occurrences of similar text. When the full sentence comes through the TM tool, its correspondence to the three broken fragments in TM is sharply impaired, and you (or I, in this case) pay for it.

It gets worse. If an engineer pushes words from one line to the next between versions, or if the tags are modified, the impact on match-rates is similarly impaired.

I've huddled with engineers, Tech Pubs and the localization house on this matter several times, with little progress to show for it, but here's a new twist:

We've offshored one of these projects to a vendor in China. Their solution was to re-align ALL of the English-language HTML pages from the previous version to ALL of the translated HTML pages of the previous version, effectively re-creating TM. They report about 20% higher match rates after doing this. I think this is because they're embracing the broken, dead segments in TM and finding them in the source files for the new version.

This seems like a counterintuitive approach, but who can argue with the benefits?

Labels: , , , , ,

27 April 2007

Getting the Writers to Care about Localized Documents

Do your technical writers go through the localized documents before handing them off to production?

I thought not.

It is, of course, just one more thing on a writer's already crowded list of things to do. Add to that the appeal for the writer of going through a book in a language of which s/he has probably no notion, and you have a recipe for can't/won't/don't want to.

You can go through it yourself, localization manager that you are, and you'll probably find a few things wrong. But the writers are looking for very different things, and they have a talent for spotting them immediately. If you can get your writers around the corner on the inconvenience of the exercise, you'll find that they add real value. The movement into and out of translation software can break things in a large document, and who better to detect such things - even with no more than a cursory overview - than the people who wrote the book in the first place?

I've seen writers go through translated versions of their documents and find:
  • unexplained typeface changes
  • broken or dead hyperlinks
  • missing callouts
  • untranslated text
  • incorrect document part numbers
  • corrupted graphics
The real showstopper, though, occurs at the end of a two-month translation cycle for a 300-page manual, when the writer spends ten minutes going through the book, then sends you e-mail that reads, "Nice job on the Chinese manual, but you got the wrong version translated."

Maybe not the optimal time to find this out, but once again: Who besides the writer would have caught this?

Labels: , , ,

26 February 2007

Translation non-savings, Part I

How far will you go to improve your localization process?

Because of how localization is viewed in many companies, the best improvements are the ones that lower cost. Low cost helps keep localization inconspicuous, which is how most companies want it.

But if a big improvement didn't save any obvious money, would your organization go for it?

Elsewhere in this blog I relate the saga of the compiled help file with 3500+ HTML pages in it. These pages come from a series of Perl scripts that we run on the header files to extract all of the information about the product's API and wrap it up in a single, indexed, searchable CHM. In a variety of experiments, we've sought to move the focus of translation from the final HTML files to a point further upstream, at or near the header files themselves. If the raw content were translated, we believe, all downstream changes in the Perl scripts, which get revised quite often, would be imposed automatically on the localized CHM.

One of the biggest cost items - we have suspected - is due to changes in line wrapping and other HTML variations that confuse TM tools into thinking that matches are fuzzier than they really are. The false positives look like untranslated words when analyzed, so the wordcounts rise, and not in our favor.

"If we work with raw text, before HTML formatting," our thinking goes, "the match rate will rise."

Not.

I'll describe my experiment shortly.

Labels: , , , , ,

08 February 2007

Rights to the Localized Product

You need to keep an eye out for TRM ("translation management rights," a term I've just coined), because they may not be enforced just for hopeful thinking.

If you're creating software, documentation or a Web site in English, who owns it? Your company, of course. True, the words that went into the product came out of your brain and fingers, and not those of your company, but it's work you've done for hire, so your company owns it. This is probably spelled out in your employment agreement.

Most of the time, this will apply to the localized version of your product as well, but it's worth being clear about it with your vendor. True, the translated words came out of the brain and fingers of the translators, but you wouldn't want the translators to own it even if they were in your employ, let alone when they're outside contractors, which they usually are.

I hadn't thought about this for ages until a recent project took me to the site of Association of
Finnish Translation Companies - SKTOL. Since a prospective vendor referred to them in his estimate, I took the time to read their General Terms and Conditions and found the following among them:
7. COPYRIGHT

The company holds the copyright as referred to in the Copyright Act (404/1961) to the translated text unless otherwise agreed. The company assigns the right of use to the translated text in the extent and for the purpose required by the commission.

Unless otherwise agreed, the company holds all rights to the translation memories generated in conjunction with the work it carries out.

This struck me as odd, and either stunningly progressive or hopelessly out of date. Of course, they don't claim "exclusive copyright" or "exclusive rights to translation memories," so they wouldn't likely withhold either from the client, but these struck me as singular rights for a contractor to try and claim.

None of my clients would have any part of this. Would your company?

Labels: , ,

07 October 2006

Localization and the Perl Script

After some cajoling, I've prevailed on our tech-writer-who-doesn't-do-any-writing to modify his Perl scripts. The changes will remove the thousands of CRLF (hard returns) in the 3700 extracted HTML files, and result in better Trados matching between the new files and translation memory.

Then, of course, it will take a few hours' perusal to see what breaks as a result of that fix.

It seems to be an unsung inconvenience of localization that
a
sentence
put
together
with
these
words
and
looking
like
this

separated by hard returns in the raw HTML file (which you can see by viewing source in a browser) becomes

a sentence put together with these words and looking like this

when viewed in a browser. The translation memory tools, of course, see the hard returns and try in vain to match accordingly, but they can result in a fair bit of head-scratching to those viewing the files only through a browser.

Labels: , , ,

20 September 2006

Segmentation and Translation Memory

To get the broken sentences in the new files to find their equivalents (or even just fuzzy matches) in translation memory we have three options:

  1. Modify the Perl scripts that extract the text from the header files into the HTML, so that the scripts no longer introduce the hard returns.
  2. Massage the HTML files themselves and replace the hard returns with spaces.
  3. Tune the segmentation rules in Trados such that it ignores the hard returns (but only the ones we want it to ignore) and doesn't consider the segment finished until it gets to a hard stop/period.
To go as far upstream as possible, I suppose we should opt for #1 and fix the problem at its source. This seems optimal, unless we subsequently break more things than we repair. Options #2 and #3 are neat hacks and good opportunities to exercise fun tools, but they burn up time and still don't fix the problem upstream.

Also, I don't want the tail to wag the dog. The money spent in translating false positives may be less than the time and money spent in fixing the problem.

Labels: , , , ,

15 September 2006

Moving the Localization Carpet under the Source Text

Here's the mess I face.

The HTML files are filled with paragraphs formatted like this:

Currently, this function gets called for trust overrides and

client authentication. On client authentication, the supplied

interface contains the server's Certificate Authorities Distinguished

Names list (see references) and the negotiation handler

always gets called so as to give a chance to the client to supply

the correct client certificate based on the DN list.


At the end of each line are two hard returns. It wasn't always this way, so each complete sentence is sitting happily in translation memory. Unfortunately, Trados pulls in each of these six 80-character fragments and calls it low- or no-match because it can't find enough of a concordance. This is a classic case of false positives driving up translation costs.

I'm still exploring options. Meanwhile, there's no sense in starting the translation work.

Labels: , ,

13 August 2006

On the varieties of Translation Memory mismatches

A close look at two different versions of the API Reference and the TM was enlightening.

Whereas the early version of the API Reference had multiple instances of

"These are filled by

Interface A and can be examined in the callback function."

and the current version has

"These are filled by
Interface A and can be examined in the callback function."

TM has

"These are filled by Interface A and can be examined in the callback function."

Ordinarily, segmentation rules in the TM tools would ignore the whitespace and CRLFs and see the sentence as a single translation unit, but it appears that the TM tool cannot (or will not) ignore all of the whitespace. It could also be that ignoring it correctly here would have adverse effects elsewhere in the document. It's also more or less successful in ignoring "unimportant" differences inside HTML tags.

In any event, because the engineers are constantly tweaking the Perl scripts, the source header files are still much more stable than the extracted HTML files. We should figure out a way to localize the former and generate the latter using scripts, as we do for English.

Labels:

10 August 2006

Localization beyond the Translation Memory tools

Still reeling from the preliminary wordcount on the HTML pages in the API Ref...

While it's true that 3280 pages is not a phenomenal amount, it's too onerous a number for review and vetting by a single human, particularly when I would be that human.

I've dusted off BeyondCompare for some preliminary testing. I think I can use it to pour these pages into a few different buckets:

1) New, orphan pages - newly written content
2) Pages which have not changed at all since the last time I handed them off for localization
3) Pages which have changed immaterially (datestamp in footer, etc.) since last handoff
4) Pages which have changed for reasons that won't matter to translators (format changes, cleaned up typo's in English)

The question is: Will this lead to a higher or lower margin of error than I get when I simply throw all 3280 pages into Trados?

This is the stuff localization consulting is made of.

Labels:

07 August 2006

Trados, you've got to be kidding!

I handed off the preliminary version of the 3280 HTML pages to the localization vendor for analysis against the existing TM. What came back was most ugly.

Out of 400,000 words, the analysis finds that over 220,000 are not exact matches, which is inaccurate at best and depressing at worst.

-Maybe the translation units in the TM are including the HTML tags, which have changed greatly over the years.

-Maybe the TM has not yet been updated from the last round of localization.

-Maybe the TM is corrupted.

-Maybe the engineers have made small, search-and-replace changes in thousands of places, modifying details that don't really matter in the translation. Suppose they changed "recieve" to "receive" in 500 segments, totaling 5,000 words. The segments would not be counted as exact matches - which costs me - but they would nonetheless go to the translators, who would see that there is nothing to change in the translation: "Money for old rope," as they say in the UK.

This bears some more investigation, because I don't trust that TM wordcount.

Labels: ,