04 December 2008

Localizing DITA Projects

Have you seen DITA projects land in your inbox yet? The full promise of XML is about to become your next headache.

If you don't know what DITA is, here's the thumbnail from the Open Toolkit's User Guide:
"DITA (Darwin Information Typing Architecture) is an XML-based, end-to-end architecture for authoring, producing, and delivering information (often called content) as discrete, typed topics."
In short, the source content you hand off for localization lives in XML files. If you get to the party soon enough, you can help your own cause by asking the authors to use specific XML tags in their authoring to make it easy for you to find text you need to translate and to ignore text you don't need to translate. The authors will surely fall all over themselves to make you happy with this new technology, so take advantage of it while it's still novel.

The problem with XML is that it's ugly and nobody can use it as documentation in that format, so it needs to be transformed into HTML, PDF, CHM, XHTML, or some other gestalt that people will use. The DITA Open Toolkit is an open-source means for performing this transformation, using scripts and languages to shape the content.

Your problem as a localization professional is not in the XML; it's in the transformation.

How do you know that the scripts your writers use for the source language (let's say, English) will work when you have to run them on XML files translated into Korean or Hebrew or Russian? (Well, they will run; the question is whether the result is good or garbage.)

With a kit like the Open Toolkit, things run as advertised when used right out of the box. The open-source project even devotes a chapter of its user guide to "Localizing (translating) your DITA content," and they are kind enough to provide pre-translated text like "Parent Topic," "Previous," "Next," which you can hook with the xml:lang attribute. The tricky part lies in the customization.

One Tech Pubs team engaged a team of script programmers to customize the toolkit. They've introduced strings like "Copyright Statement" and "Enter keyword" and placed a "Last updated" datestamp on every page in the help project. They've also implemented a search function (gulp!) so users can locate content in the help files. There's nothing wrong with this customization work, except that nobody was thinking of other languages while doing it. Now we're sorting out the location of the custom strings, the way to get the toolkit to format dates according to locale, and how to convince the search function that characters can take up more than one byte.

You will face the same problems. You'll need to internationalize your writers' customizations so that things work properly in your target language.

So when your writers tell you how much easier your life will be now that content is in XML, don't forget to look a bit further down the road at what they're using to transform that XML into something useful. That's where you'll put in the hours.

Labels: , , ,

02 October 2008

Wordcount Woes - Part 2

If you're working client-side, how many words have you paid for that translators didn't even need to touch?

I posted a couple of weeks ago on translatable words that vendors may miss in analyzing files. Alert reader arithmandar commented that slide decks can be even worse, if there is a lot of verbiage on the master slide that does not get easily captured (although Trados finds these words, according to him/her). Flash is another story altogether, and arithmandar's recommendation is that a Flash engineer should probably perform the analysis.

The other side of the coin is also unpleasant, but for the other party: Clients can hand off vast expanses of words that nobody will translate, artificially inflating the wordcount and estimate.
  • Code samples - If your documentation contains examples of code used in your product (e.g., in an API reference), there is no point in having that included in the wordcount, because nobody translates code.
  • XML/HTML/DITA/Doxygen tags - I hope your vendor is parsing these files to ignore text (especially href text) in the tags. Otherwise, not only will you get back pages that won't work worth a darn, but you'll also be charged for the words.
  • Legal language - Some companies want their license agreements, trademark/copyright statements, and other legal pages left untranslated. (Usually these are American companies.)
  • Directives - Certain directives and warnings apply to certain countries only. The documentation for computer monitors and medical devices often contains a few pages of such directives, which appear in the language of the country requiring them. There is usually set language for these directives, so free translation is not appreciated; have your colleagues in Compliance obtain the language for you, paste it in yourself, and point it out to your vendor.
Mind you, there are costs associated with finding and removing all of these words: Do you want to spend time extracting the words? Do you want to hire somebody to find and extract them? Will your savings offset those costs?

If the words to be ignored add up to enough money - as they often do for a couple of our clients - pull them all into a text file and send them to your vendor with instructions to align them against themselves for all languages in the translation memory database. That way, when the vendor analyzes your files, the untranslatable words will fall out at 100% matches.

Do you have ideas on how to handle such text?

Labels: , , , , ,

11 September 2008

Wordcount Woes - Part 1

Do you spend much time fretting about wordcount?

My hunch is that translators worry about it more than agencies do, because it's often the only metric by which translators earn their daily bread. Agencies have project management, layout, graphics, consulting, rush charges and other metrics to observe, but most translators have one line-item on their invoices: wordcount.

I suppose that we all live and die by it because everybody's calculations get down to wordcount - either source or target text - sooner or later, but no two tools define words the same way, so wordcount can vary considerably.

Still, the bigger issue with wordcount is "wordcount leakage." If you're working vendor-side, how many times have you quoted on a project, then realized that you had overlooked a chunk of text?

  • Graphics are the biggest culprit. The document contains charts and diagrams that require translation, but TM tools don't find those words. Many vendors wisely exclude such text from wordcount and cover it in an hourly or per-graphic charge. (Nobody can ever find the source files for the graphics so that you can localize them properly, but that's a whole other talk show.)
  • Bookmarked text is also slippery. It appears as text (sentences, paragraphs) in one place, and is referred into other places in the document. True, you only translate it in one place, but you need to deal with it - layout, formatting, page flow - in other places as well.
  • Conditional text, a favorite of Framemaker professionals, can also cause you trouble. If you don't calculate wordcount with the conditions set to expose all of the text, you may miss it. The author should arrange for this before handoff.
  • Embedded documents (spreadsheets, word processing, HTML, presentations) are very sneaky. We just saw this the other day with an MS Word document that contained several embedded spreadsheets visible only as 1cm square icons on the page; double-clicking the icons opened up the embedded files. The TM tools don't see those words, but the client certainly would have if they had come back untranslated. Fortunately, we caught this in time.
The Moral: Two pairs of eyes should review every file before the TM analysis, NOT one pair of eyes and a TM software package.

Labels: , ,

21 August 2008

Localizing Code Snippets - Part II

Last week I posted on the dilemma of how to localize Code Snippets, the selected pieces of your documentation that you shoehorn into XML files so that Visual Studio can present them in tool-tip-like fashion to the user while s/he is writing code that depends on your documentation.

My goal was to ensure that the process of grabbing these bits of documentation (mostly one-sentence descriptions and usage tips) was internationalized, so that we could run it on translated documentation and save money. This has proved more difficult than anticipated.

Here is the lesson: If you think it's hard to get internal support for internationalizing your company's revenue-generating products, just try to get support for internationalizing the myriad hacks, scripts, macros and shortcuts your developers use to create those products.

In this client's case, it makes more sense to translate the documentation, then re-use that translation memory on all of the Code Snippet files derived from the documentation. It will cost more money (mostly for translation engineering and QA, rather than for new translation) in the short run, but less headache and delay in the long run. Not to mention fewer battles I need to fight.

Discretion is the better part of localization valor.

Labels: , , , , ,

14 August 2008

Localizing Code Snippets

"Why would I localize code snippets?" you ask. (Go ahead; ask.)

Everybody knows you don't translate snippets of code. Even if you found a translator brave enough to take on something like int IBACKLIGHT_GetBacklightInfo(IBacklight *p, AEEBacklightInfo * pBacklightInfo), the compiler would just laugh and spit out error messages.

However, if you're a developer (say, of Windows applications) working in an integrated development environment (say, Microsoft Visual Studio), you may want to refer very quickly to the correct syntax and description of a feature without searching for it in the reference manual. The Code Snippet enhancement to Visual Studio makes this possible with a small popup box that contains thumbnail documentation on the particular interface the developer wants to use. It's similar in concept and appearance to the "What's This?" contextual help offered by right-clicking on options in many Windows applications.

How does the thumbnail documentation get in there? It's a tortuous path, but the enhancement pulls text from XML-formatted .snippet files. You can fill the .snippet files with the information yourself, or you can populate them from your main documentation source using Perl scripts and XSL transformation. So while you're not really translating code snippets, you're translating Code Snippets.

And therein lies the problem.


One of our clients is implementing Code Snippets, but the Perl scripts and XSL transformation scripts they're using to extract the documentation, don't support Unicode. I found this out because I pseudo-translated some of the source documentation and ran the scripts on them. Much of the text didn't survive to the .snippet files, so we're on a quest to find the offending portions of the scripts and suggest internationalization changes.

We've determined that the translated documentation in the Code Snippets will display properly in Visual Studio; the perilous part of the journey is the process of extracting the desired subset of documentation and pouring it into the .snippet files. Don't expect that your developers will automatically enable the code for this; you'll probably have to politely persist to have it done right.

Alternatives:
  • Wait until all of your documentation has been translated, then translate the .snippet files. It's more time-consuming and it will cost you more, but working this far downstream may be easier than getting your developers to clean up their scripts.
  • Make your Japanese developers tolerate English documentation in the Code Snippets.
Neither one is really the Jedi way. Work with your developers on this.

Labels: , , , , ,

10 July 2008

Getting your Documentation Ready for Localization

Have you had to prepare your documentation for localization yet? My experience is that in almost all companies, writers have far too many other oppressive concerns gnawing at them to think about writing for localization.

A few days ago an industry colleague sent me a message asking, "Do you have experience making recommendations for how documentation can be authored for localization? I am looking to make our doc  process more efficient to reduce costs."

I replied that, given his stature and tenure in the industry, there was not likely anything I could suggest that he hadn't already considered. Nevertheless, I sent him a list of ideas, in increasing order of difficulty:
  1. Make sure all the writers' computers are plugged in. (A bit of ironic humor I could not resist.)
  2. Is it easy to get from the authoring tool(s) into TM, and back out into publishable format? This is my current headache with an API reference manual we localize for one client, because moving from source language to the translator tools and back to target format is a colossal headache. If you have similar problems, devote some cycles at the format-layer, even if it means writing an interface between your content management system and the translation tool.
  3. There are "authoring memory" tools that can suggest and re-use already-xlated source text, so that writers don't say nearly the same thing multiple times and incur unnecessary TM penalties. Sajan has one, and SDLX contains one as well. I've never used either one, but I can imagine that success with the tools would require somebody with the documentation-familiarity of a technical writer and the global consciousness of a localization manager. Like you.
  4. I've presented on localization to a variety of audiences, and have consistently found tech writers to be the most interested in it, vastly more so than developers. When you show writers how the TM tools work, tell them how they can save money and re-use content, and let them know that you care about the impact of their work on international products, they will smell the coffee and engage. This takes a bit of evangelism, but it's worth it if the writers change their own practices.
  5. Convert everything to XML. Although Renato and Don of Common Sense Advisory joke that that will fix any L10n problem, it's nonetheless a good, long-term direction in which to move. It's easier to re-use text, and easier to mark text that should/should not be translated. That will save you money.
  6. Start a program of controlled language authoring (dumbing down the sentences, always writing in a structure that machine translation will recognize, etc.). I guess that GM and Caterpillar are poster children for this kind of thing, but it puts the writers (and you, in the bargain) through the change of life, which is why I mention it last.
What about you? Have you faced this in your organization? How have you made document localization easier for the company, without driving your writers crazy?

If you liked this post, have a look at Getting Writers to Care about Localized Documents.

Labels: , , , ,

29 May 2008

Localizing Robohelp Files - The Basics

We get a lot of search engine queries like "localize Robohelp file" and "translate help project." I'm pretty sure that most of them come from technical writers who have used Robohelp to create help projects (Compiled HTML Help Format), and who have suddenly received the assignment to get the projects localized.

The short answer
Find a localization company who can demonstrate to your satisfaction that it has done this before, and hand off the entire English version of your project - .hpj, .hhc, .hhk, .htm/.html and, of course, the .chm. Then go back to your regularly scheduled crisis. You should give the final version a quick smoke test before releasing it, for your own edification as well as to see whether anything is conspicuously missing or wrong.

The medium answer
Maybe you don't have the inclination or budget to have this done professionally, and you want to localize the CHM in house. Or perhaps you're the in-country partner of a company whose product needs localizing, and you've convinced yourself that it cannot be that much harder than translating a text file, so why not try it?

You're partially right: it's not impossible. In fact, it's even possible to decompile all of the HTML pages out of the binary CHM and start work from there. But your best bet is to obtain the entire help project mentioned above and then use translation memory software to simplify the process. Once you've finished translating, you'll need to compile the localized CHM using Robohelp or another help-authoring product (even hhc.exe).

The long answer
This is the medium answer with a bit more detail and several warnings.
  • There may be a way to translate inside the compiled help file, but I wouldn't trust it. Fundamentally, it's necessary to translate all of the HTML pages, then recompile the CHM; thus, it requires translation talent and some light engineering talent. If you don't have either one, then stop and go back to The Short Answer.
  • hhc.exe is the Microsoft HTML Help compiler that comes with Windows. It's part of the HTML Help Workshop freely available from Microsoft. This workshop is not an authoring environment like Robohelp, but it offers the engineering muscle to create a CHM once you have created all of the HTML content. If you have to localize a CHM without recourse to the original project, you can use hhc.exe to decompile all of the HTML pages out of the CHM.
  • Robohelp combines an authoring environment for creating the HTML pages and the hooks to the HTML Help compiler. As such, it is the one-stop shopping solution for creating a CHM. However, it is known to introduce formatting and features that confuse the standard compiler, such that some Robohelp projects need to be compiled in Robohelp.
  • Robohelp was developed by BlueSky Software, which morphed into eHelp, which was acquired by Macromedia, which Adobe bought. Along the way it made some decisions about Asian languages that resulted in the need to compile Asian language projects with the Asian language version of Robohelp. This non-international approach was complicated by the fact that not all English versions of Robohelp were available for Asian languages. Perhaps Adobe has dealt with this by now, but if you're still authoring in early versions, be prepared for your localization vendor to tell you that it needs to use an even earlier Asian- language version.
  • Because the hierarchical table of contents is not HTML, you may find that you need to assign to it a different encoding from that of the HTML pages for everything to show up properly in the localized CHM, especially in double-byte languages.
  • The main value in a CHM lies in the links from one page to another. In a complex project, these links can get quite long. Translators should stay away from them, and the best way to accomplish that is with translation memory software such as Déjà Vu, SDL Trados, across or Wordfast. These tools insulate tags and other untouchable elements from even novice translators.
We've marveled at how many search engine queries there are about localizing these projects, and we think that Robohelp and the other authoring environments have done a poor job explaining what's involved.

If you liked this article have a look at "Localizing Robohelp Projects."

Labels: , , , , , , , ,

22 May 2008

If it isn't broken...break it!

What's the most effective way to bump up your translation costs unnecessarily?

Probably by localizing something that nobody will ever want in a foreign language, of course. But nobody would ever approve an expense like that, so it wouldn't have the opportunity to affect your translation costs.

There's a much sneakier, more pernicious way of wasting translation money: Tinkering with the original text (for example, English).

Suppose you localized your product or documentation from 2002 through 2007. You'd have five years' worth of translation memory (TM) economies and glossary entries going for you, with thousands of exactly matched words that incurred no translation cost from one version to the next. Then suppose that someone decided in 2008 to go in and "clean up" the original English text to make it more "readable" or "user-friendly."

What do you think would happen the next time you handed off this content for TM analysis? Suddenly, non-matches would pop up where exact matches used to be. Among the causes:
  • Combining short sentences together
  • Breaking long sentences apart
  • Making stylistic changes to common terms (e.g., changing "phone" to "telephone" or "handset")
  • Standardizing disparate terms (e.g., selecting one of "Proceed as follows," "Perform the following steps," "Following is the required procedure" and propagating throughout the documentation)
  • Typographical or grammatical corrections
You might tolerate these modifications in the interest of improving your product in all languages - not just English - but the sad truth is that you may find that they make no difference in the localized products. You'd pay for words that the translator did not need to touch. This is an unfortunate artifact of the way in which translation jobs are estimated, but the analysis software cannot predict that the changes will make no difference to the translation; only the translator sees that.

Note that re-organizing content should not cost you additional translation money; as long as the sentence is the same (i.e., an exact match), it doesn't matter where it's located in the product.

So, are you better off leaving errors and other undesirables in your original-language content? No. It would be a mistake to let concern for translation cost impede your product improvement effort, like having the tail wag the dog. Still, to the extent you can control it, you should try to avoid purely stylistic changes that make no difference in how your customers use your product. A good editor can make a hundred such changes per hour, not realizing the ramifications on translation costs.

If you learned something from this post, you might like to read Improved Docs through Localization or Getting the Writers to Care about Localized Documents.

Labels: , , , ,

15 May 2008

Doxygen and localization

Are you localizing any documentation projects that use Doxygen? It's an open-source tool for documenting source code.

If your documentation set includes things like an API reference or extensive details in programming code, Doxygen allows you to embed tags in the original code or header files, then automatically create entire help systems organized around the tagged text. Doxygen does not compile anything, but takes the tagged bits of source files, turns them into HTML pages, then links them for viewing in a browser.

Like most tools, it's a breath of fresh air when it works properly, but it can require a lot of re-plumbing and retrofitting.

As far as localization goes, it can be a life-saver. In theory, you can have the header files themselves localized, then run them through Doxygen as you would the original English files. Working this far upstream can be a big advantage.

Some months ago a client embarked on a conversion of a help system to Doxygen. While it was still in the proof-of-concept stage, we pseudo-translated some header files and tested the tool for global-readiness.

The good news is that the developers of Doxygen have enabled it for multiple languages. It encodes pages in UTF-8 (or other character sets), so translated text displays properly in the browser. It's possible to set the OUTPUT_LANGUAGE parameter to your target language (e.g., Japanese, in our test scenario) so that the datestamp and other text supplied by Doxygen displays in Japanese, rather than in the default English.

There are some I18n problems with Doxygen, though.
  • Each header file page begins with "Collaboration diagram for" followed by the page title. When the page title contains double-byte characters, the Japanese characters for "Collaboration diagram for" are corrupted. It appears that Doxygen is not pushing UTF-8 characters for this phrase, though it pushes UTF-8 characters in other places.
  • Some hyperlinked words in body text will require translation. If so, it will be important to ensure that they are translated the same everywhere. Note, however, that Doxygen will not generate the necessary file if the hyperlink has double-byte characters in it (not even on a Japanese OS).
  • Doxygen allows for generation of the .hhc, .hhp and .hhk files needed for Compiled HTML Help (CHM). It can also be configured to execute hhc.exe and compile the project. However, Doxygen outputs the .hhc file in UTF-8 format, which is incompatible with the table of contents pane in the Help viewer. To fix this, open the .hhc in Notepad (preferably on a Japanese OS) and save it back out as Shift-JIS ("ANSI" in Japanese Notepad). Then recompile the CHM by invoking hhc.exe from the command line and the contents will show up properly.
  • Searches using single- or double-byte characters do not work in the resulting CHM.
These strike me as rather large, empty boxes on the checklist of global-readiness. Still, the source code is available, so if your organization has already started down the Doxygen path, you can clean up problems like these for your worldwide versions.

Interested in this topic? You might enjoy another article I've written called Localizing Robohelp Projects.

Labels: , , , ,

17 April 2008

Putting More "Sim" in your "SimShip"

How are you doing on your simultaneous shipment ("simship")? This is a common term in the industry that refers to releasing your domestic and localized products at the same time. Is your organization getting closer to simship? It shouldn't be getting further from it.

What measures have you put in place to reduce your time to market for localized versions? It's never easy to pry finalized content from writers and engineers in time to have it translated, but that's the dragon that most of us have to slay, so we focus on it a lot. How can we peel off content and get the translation process started sooner?

In the same way that eating lightly 5 times a day keeps you from getting really hungry and eating voraciously 3 times a day, we've found that handing off smaller bits of content even before they're finished keeps us from having to panic when somebody calls for a localized version.

We manage projects for a client who has the advantages of lots of sub-releases (3.1.2, 3.1.3, 3.1.5) between main releases (3.1, 3.2), and few overseas customers who want the sub-releases. (They also have the disadvantage of lacking a content management system that would make this much easier.) Even if your situation is not an exact match, you'll find that some principles apply anyway.
  • The biggest nut in the product is a 3500-page API reference guide in HTML. (Most products have a big, fat component that dwarfs all of the others.)
  • One month before each release, we assume that any new pages are about 95% final, so we hand them off for translation.
  • By the release date, we know whether we need to release a localized version of the entire product or not. If so, we proceed to hand off all of the rest of the product for translation, knowing that there will be some re-work of the new pages handed off a month before; if not, we hand off only the changed pages.
  • Thus, we almost always have pages from the API reference guide in translation. If we need them for a release, we have a lot of momentum already; if we don't need them for a release, we put the translations into our back pocket and wait until it's time for the next localized version.
This costs some more money than normal because of the inevitable re-translation that goes on, not to mention the hours refreshing the localization kit, and preparing files for translators. But this cost is acceptably low compared to the look of anguish on the international product manager's face when we have to say, "It will take about three months to finish the Korean version because of all of the change since we last localized it."

We also need to assume that, sooner or later, there will be a request for the product in certain languages. If business conditions change and the new translations never see release, then the effort has been wasted for those languages, but that's a normal business risk.

Labels: , , ,

06 December 2007

We all [heart] PDFs!

Any good localization manager (vendor- or client-side) knows that there's very little you can do with a PDF as a source file. Yet time and again, we confront the best intentions of our customers and co-workers who say, "It's not a very large file, so it shouldn't cost much to translate. I'll send it to you." They send us a PDF.

This has happened to me with two new clients this week. We'd all like more translation business, and it's convenient that it exists as a lingua-franca format for us, but PDF is something of a double-edged sword.

PDFs contain everything we need to view a file, but not everything we need to extract the text, formatting, callouts, frames, tags, etc. from it. Creating a localization estimate on a PDF is asking for trouble, because it smooths over a multitude of different issues that we'll encounter once we have the source files, most of which concern text that we know requires translation, but which is not "live" in the PDF and may not be live in the source file from which the PDF came. It's the equivalent of hard-coded strings in software, or localizing a binary without the .properties or resource files.

There are, of course, utilities for converting PDF to RTF to capture the live text and formatting, and that's better than nothing, but it's probably still a far cry from the Quark or InDesign or even MS Word file from which you started. I've sent one of my new clients back to the drawing board several times this week already:
  1. He gave me a PDF and I asked for the source file.
  2. He found the source file (Quark) and I asked for the Photoshop files from which the text-bearing graphics had originated.
  3. There were tables in the Quark file that were Illustrator objects, because these looked much better than Quark's native tables.
  4. Another PDF of a Word document contains eight graphs created by engineers all over the building. He said he'd try to obtain the original artwork (probably PowerPoint, every engineer's favorite Etch-a-Sketch), but I'll be surprised if he can find it.
So, folks, we love to localize your pieces, but try to keep tabs on all the bits that you drop into them. We can do things so much better-cheaper-faster when you do.

Labels: , ,

19 October 2007

Whaddya know? They asked me first this time!

Do you spend a lot of your time running to catch up to the train? Have you ever been surprised in the middle of a meeting by project plans that were well underway with no thought given yet to localization? Are you getting used to it?

What if they asked you first (or at least early on) about the project's implications for internationalization and localization? Would you know how to react?

This certainly caught me by surprise a few months ago. A client called me in for consultation. He didn't want me to manage the upcoming localization of his user manuals; he wanted me to review and edit the English versions so that they would be ready to localize.

This client, though small, is enlightened. The company is selling English, French, German, Spanish and Japanese versions of several products, and it has a hand-in-glove relationship with its localization company. It knows where its global bread is buttered.

I jumped at the chance to work with people thinking this far in advance, so I reviewed the manuals and submitted changes, almost all of which were acceptable.

How can you review/edit documentation with an eye to translating it?
  1. Take advantage of redundancy. Ensuring that identical sentences and paragraphs remain identical is a good way to lower per-word translation costs. Turn the text into a bookmark at its first occurrence, then invoke or cross-reference that bookmark at subsequent occurrences.
  2. Ensure that the product matches the documentation. Not all organizations get around to this, believe it or not, and it becomes a bit of value added by the internationalization/localization function.
  3. Standardize terms. Especially in companies without a well developed team of writers, manuals end up with pairs or trios of synonyms that will vex translators and add no information, so take the liberty of eliminating one in favor of the other:
    • Determine/specify
    • based on/according to
    • click the button/click on the button/select the button
    • lets you/enables you to/allows you to
  4. Mention errors and inconsistencies that have nothing to do with internationalization. Again, you increase the perceived value of the localization function. Even though the result doesn't affect the localized products, the Localization Department (you) are contributing to a better core product.
  5. Axe a few "dead" words. They add little to the explanation, will probably not survive translation, and inflate wordcount:
    • unique
    • basically
    • popular
    • congratulations
    • very much
By the way, the review took longer than I'd anticipated, so if you have a similar opportunity, don't bid a flat fee the first time.

Interested in this topic? Have a look at Improved Docs through Localization.

Labels: , , , ,

20 July 2007

Machine translation in action

Has your boss asked you to use Google or AltaVista or some other flavor of machine translation to lower your translation costs?

Here's somebody who has put his money where your boss' mouth is.

Controlled language website attracts visitors from 110 countries

www.muegge.cc, a website dedicated to demonstrating the value of controlled language authoring and machine translation (MT), has attracted visitors from more than 110 countries since its launch in the summer of 2006. One of the unique features of this website is the fact that it uses Google language tools to automatically translate the site's content into 15 language pairs such as German to English or English to Simplified Chinese. The website was created from the ground up for MT, and all text was written in compliance with the CLOUT rule set, a controlled language designed specifically for MT.

muegge.cc, E-mail: info@muegge.cc, Web: http://www.muegge.cc

How do they do it? By controlling the text that goes into the translation machine. The simpler, more predictable and better structured the text, the more likely it can generate a satisfactory translation. In other words, machine translation would probably work better on a page of Hemingway than on a page of Shakespeare or Faulkner.

Don't forget, though: What you save in translation, you'll spend in whipping your writers into line. It may not look like real dollars, but it's time.

And time, as they say, is money.

Labels: , ,

11 May 2007

Localizing RoboHelp projects

Is it time for you to localize you RoboHelp projects? What's involved?

"RoboHelp project" is shorthand for "compiled help system." When this lives on a Windows client computer it is usually HTML Help (CHM) files. There are other variations like Web Help, which are also compiled HTML, but which do not run on the client.

The projects are a set of HTML files, authored in a tool such as--but not limited to--RoboHelp, then compiled into a binary form that allows for indexing, hierarchy and table of contents. Other platforms (Mac OS, Linux, Java) require a different compiler, but the theory is the same.

If you've done localization before, you'll find that RoboHelp projects are relatively easy, compared to a software project. RoboHelp (or whatever your authoring/compilation environment may be) creates a directory structure and file set that is easy to archive and hand off. It includes a main project file, table of contents file and index file. In fact, it's even possible in a pinch to simply hand off the compiled file, and have the localizers decompile it; the files they need will fall into place as a result of the decompilation.

Although you may think of the project as a single entity for localization purposes, each HTML page is a separate component. There may be large numbers of these pages that don't change from one version of your product to the next; nevertheless, you need to hand them off with the project, and you'll likely be charged for a certain amount of "touching" that the localizer's engineers will need to do. You may be able to save them some work and yourself some money by analyzing the project and determining which pages have no translatable changes, but by and large you should consider the costs for touching unchanged pages an unavoidable expense.

The biggest problem with these projects is in-country review. There's no easy way for an in-country reviewer to make changes or post comments in the compiled localized version. We've found that MS Excel is the worst way of doing this (except for all the others), so we've learned to live with it.

In theory, the translators are not mucking about with any tags, so the compiled localized version should work the same as the original. Yeah, right. All the links need to be checked--they do break sometimes--and the index and table of contents should be validated. And, don't forget to try a few searches to make sure they work; your customers surely will, and you want to spare them any unpleasant surprises.

Remember:
  • If you've included graphics in your help project, you'll need to obtain the original source files. These are not GIFs or JPEGs; they will be the application files from which the GIFs and JPEGs were generated. You'll need to hand off files from applications like Adobe Illustrator, or Flash or even PowerPoint, so that the translators can properly edit the text in them. Engineers often do quick mock-ups in Microsoft Word's Word Art that end up in the final product, and it takes a while to track them down.
  • Encoding can be thorny. Some compilers behave oddly if you try to impose the same encoding on both the HTML pages and the table of contents, especially in Japanese, in our experience.

Labels: , , , , ,

27 April 2007

Getting the Writers to Care about Localized Documents

Do your technical writers go through the localized documents before handing them off to production?

I thought not.

It is, of course, just one more thing on a writer's already crowded list of things to do. Add to that the appeal for the writer of going through a book in a language of which s/he has probably no notion, and you have a recipe for can't/won't/don't want to.

You can go through it yourself, localization manager that you are, and you'll probably find a few things wrong. But the writers are looking for very different things, and they have a talent for spotting them immediately. If you can get your writers around the corner on the inconvenience of the exercise, you'll find that they add real value. The movement into and out of translation software can break things in a large document, and who better to detect such things - even with no more than a cursory overview - than the people who wrote the book in the first place?

I've seen writers go through translated versions of their documents and find:
  • unexplained typeface changes
  • broken or dead hyperlinks
  • missing callouts
  • untranslated text
  • incorrect document part numbers
  • corrupted graphics
The real showstopper, though, occurs at the end of a two-month translation cycle for a 300-page manual, when the writer spends ten minutes going through the book, then sends you e-mail that reads, "Nice job on the Chinese manual, but you got the wrong version translated."

Maybe not the optimal time to find this out, but once again: Who besides the writer would have caught this?

Labels: , , ,

23 March 2007

Localizing Declarations of Conformity

Does your documentation contain Declarations of Conformity with European Community standards? If it does, here is some due diligence you should undertake before having the docs translated.

The EC has promulgated a long series of directives on a variety of industries ranging from aerospace to toys. Some of these directives describe industrial policy and consumer protection. If your product falls into the category of those covered by a set of directives, then 1) the product must conform to the directives; and 2) you must declare that it conforms and list the directives with which it conforms.

This second requirement leads to some of the driest text with which you'll ever fill pages in a user guide; for instance:

Protection requirements concerning electromagnetic compatibility to Article 3(1)(b)

Harmonised standards applied:

EN 301-489-1, V1.4.1 (2002-08); Electromagnetic compatibility and Radio spectrum Matters (ERM); Electromagnetic Compatibility (EMC) Standard for Radio Equipment and Service. Part 1: Common technical requirements

ETSI EN 301 489-25 V2.2.1 (2003-05)

Fascinating reading. And, it makes for even more fascinating translation work.

If you're localizing your U.S. product for sale in Germany, the translation of the names of these standards with which you're declaring conformity should match the German names acknowledged by the EC. You could hand off the English text to a German translator, who could trip through several technical dictionaries creating his own translation. The numbers of the directives would be correct (because not translated), but strictly speaking, the titles would not be correct, unless your translator was extremely lucky.

Fortunately, the EC has made this easy. Depending on the industry, they offer accepted translations of the titles and text of the directives in as many as twenty languages on their Web site. With a bit of digging, your translators can find and re-use approved text. This will not only save them (and you) time, but will ensure you of a better fit for your localized documentation.

Labels: ,

02 March 2007

Translation non-savings, Part II

Again I ask: How far will you go to improve your localization process? If a big improvement didn't save any obvious money, would your organization go for it?

I selected a sample of 180 files. In one set, I left all of the HTML tags and line-wrapping as they have been; in the other set, I pulled out raw, unwrapped text without HTML tags. My assumption was that the translation memory tools would find more matches in the raw, unwrapped text than in the formatted text.

I cannot yet figure out how or why - let alone what to do about it - but the matching rate dropped as a result of this experiment.























Original HTML Formatting and TagsUnwrapped, unformatted text
100% match and Repetitions65%51%
95-99% match9%14%
No match9%15%

This is, as they say in American comedy, a revoltin' development. It means that the anticipated savings in translation costs won't be there - though I suspect that the translators themselves will spend more time aligning and copy-pasting than they will translating - and that I'll have to demonstrate process improvement elsewhere. If I can find an elsewhere.


True, the localization vendor will probably spend less time in engineering and file preparation, but I think I need to demonstrate to my client an internal improvement - less work, less time, less annoyance - rather than an external one.

Labels: , , , , , ,

30 January 2007

Localization Train slowing

We're seeing the localization juggernaut lose some steam.

In the early years, this client localized its flagship software package for developers in China, Japan and Korea (CJK), then added Brazil. It took small, reference applications into as many as 10 languages (including Hebrew and Thai) as those markets showed promise. The budget was pretty fat, the localized products were freshened frequently, and the developers were happy to have software and doc in their own language.

I suppose it was to be expected that this would peter out with time, because markets change, business cases wax and wane, and some regions never return the investment.

The new stressor on localization was less easy to anticipate: bulk. Each generation of improvements to the product brings several hundred more pages of documentation. All of this new documentation is, of course, "free" in English, but somebody has to pull out a checkbook to deal with it in other languages, and that checkbook comes out more slowly and with more misgivings these days.

Engineering and Product Management furrow their brow nowadays when I walk in with cost estimates. I've adapted to this change in attitude with a few techniques:
  1. The Technical Reference is the fattest target and the source of most of the expansion. It lives in a compiled help file (CHM) that is no longer written by Tech Pubs, but generated by Perl scripts from header files written by the engineers. Our modus localizandi has been to hand off the finished help project, now comprising 3700 HTML files, and have the HTML translated. In an effort to lower cost, I'm attempting a proof-of-concept to localize the header files themselves, then tune the scripts to convert them into localized HTML. This should lower our localization engineering costs considerably.
  2. I agitate for interim localization updates, peeling off documentation deltas every few weeks and handing them off for translation, even if there are no plans to release them yet. This reduces the sticker shock and time-to-market delay that comes of getting an estimate on a release only when necessary, which may be a 10- to 18-month interval. Product Management and Engineering, who only think about localization when it's absolutely unavoidable, find the tsunami of untranslated text depressing.
  3. Although it's not a very clean way of doing things, I screen from the localization handoff those items that I know have little to be translated. Sometimes I go to the level of resource files, but more often I take documents to which only a few minor changes have been made from one En version to the next, hand off changed text, then place the translations myself. This is not for the faint of heart, nor for those who don't really know the languages involved, but it can save some money.
  4. I try to keep global plates spinning, in the hope that more people will consider the global dimension of what we do, and the fact that localization is the necessary step for making your product acceptable to people whose use of your product will make you money, if you make it easy for them.
  5. I never impart bad news on Friday.

Labels: , , , , ,

16 January 2007

Improved Docs through Localization

I spent some time on the phone with new clients last week, going through a user guide they plan to have localized. Discussing the usual localization questions (i.e., the ones I figured the translators would ask sooner or later), we began to edge towards the initially depressing realm of Changing Documentation to Suit Localization.

"Don't misunderstand," I repeated intoned, "I'm not trying to get you to re-write an already published book just to make localization easier. We're just bringing up small issues in how you can write future books a bit more generically so that you can take exactly what you've published in English and hand it off for localization without customizing it first."

Still, I thought I detected a collective, resigned sigh from them. I've learned by now that it translates to "Writing for translation is really going to be a pain, isn't it?"

They then asked for suggestions about optimizing future documents for localization purposes, in the form of guidelines or style guides. This is good thinking, and I told them so. It amounts to documentation internationalization.

I've read plenty of articles on how to do this (authors include Kit Brown of Comgenesis and Nancy Combe), but I usually find them superficial (leave white space, use numbered callouts, be sure to do the software first...), because the solution doesn't lie in documents, but rather in each organization and in the way that Engineering, Product Management, Tech Pubs and the overseas partners work together.

I told them that they could read up on this for a month, or we could all just go through the process of localizing an already written manual and make our own guidelines. The former won't do any harm, but I think they'll find that the latter will result in more - and more-specific - pointers that will apply to future books.

The important thing is to arrive at gradual changes that the company will tolerate in the next 3/6/12 months, so that their books become more global without the localization-tail wagging the dog.

Labels: , ,

20 December 2006

Localization Conundrum

My client received a request from Korea for a localized version 6.5. There are two issues:

  1. It's going to cost a lot, because the last version localized into Ko was version 5.01.
  2. English is up to version 8, and the process of creating the help is much better than in 6.5 . Should we include those enhancements to 6.5 Ko, even though they would take it out of parity with 6.5 En?
I experimented to see whether the improvements mattered to the localization process in general and the cost in particular. I re-created portions of version 5.01 help using version 5.01 Perl scripts, then did portions of that same help using version 6.5 Perl scripts. Then I handed both sets off for wordcount analysis. They were within 2-3% of each other, so the cost-savings in translation are not there.

However, I suspected that the vendor would charge me a lot more for engineering on the 5.01 help, because the version 6.5 scripts are much cleaner, and they handle the raw text much better. This compelled me to examine the matter further.

Better help or not, the problem is one of product management. Even if 6.5 help is "better," it differs too much from 5.01 help. I imagine a Korean customer struggling to bounce back and forth between 5.01 En and Ko, and puzzling at the discrepancies, even though the Ko version had a lot more information than the En version.

They are the sort of discrepancies that make cowards of us all (albeit well advised cowards). I've decided to hand off the pure 5.01 En help system for this project, warts and all.

Labels: , ,