29 May 2008

Localizing Robohelp Files - The Basics

We get a lot of search engine queries like "localize Robohelp file" and "translate help project." I'm pretty sure that most of them come from technical writers who have used Robohelp to create help projects (Compiled HTML Help Format), and who have suddenly received the assignment to get the projects localized.

The short answer
Find a localization company who can demonstrate to your satisfaction that it has done this before, and hand off the entire English version of your project - .hpj, .hhc, .hhk, .htm/.html and, of course, the .chm. Then go back to your regularly scheduled crisis. You should give the final version a quick smoke test before releasing it, for your own edification as well as to see whether anything is conspicuously missing or wrong.

The medium answer
Maybe you don't have the inclination or budget to have this done professionally, and you want to localize the CHM in house. Or perhaps you're the in-country partner of a company whose product needs localizing, and you've convinced yourself that it cannot be that much harder than translating a text file, so why not try it?

You're partially right: it's not impossible. In fact, it's even possible to decompile all of the HTML pages out of the binary CHM and start work from there. But your best bet is to obtain the entire help project mentioned above and then use translation memory software to simplify the process. Once you've finished translating, you'll need to compile the localized CHM using Robohelp or another help-authoring product (even hhc.exe).

The long answer
This is the medium answer with a bit more detail and several warnings.
  • There may be a way to translate inside the compiled help file, but I wouldn't trust it. Fundamentally, it's necessary to translate all of the HTML pages, then recompile the CHM; thus, it requires translation talent and some light engineering talent. If you don't have either one, then stop and go back to The Short Answer.
  • hhc.exe is the Microsoft HTML Help compiler that comes with Windows. It's part of the HTML Help Workshop freely available from Microsoft. This workshop is not an authoring environment like Robohelp, but it offers the engineering muscle to create a CHM once you have created all of the HTML content. If you have to localize a CHM without recourse to the original project, you can use hhc.exe to decompile all of the HTML pages out of the CHM.
  • Robohelp combines an authoring environment for creating the HTML pages and the hooks to the HTML Help compiler. As such, it is the one-stop shopping solution for creating a CHM. However, it is known to introduce formatting and features that confuse the standard compiler, such that some Robohelp projects need to be compiled in Robohelp.
  • Robohelp was developed by BlueSky Software, which morphed into eHelp, which was acquired by Macromedia, which Adobe bought. Along the way it made some decisions about Asian languages that resulted in the need to compile Asian language projects with the Asian language version of Robohelp. This non-international approach was complicated by the fact that not all English versions of Robohelp were available for Asian languages. Perhaps Adobe has dealt with this by now, but if you're still authoring in early versions, be prepared for your localization vendor to tell you that it needs to use an even earlier Asian- language version.
  • Because the hierarchical table of contents is not HTML, you may find that you need to assign to it a different encoding from that of the HTML pages for everything to show up properly in the localized CHM, especially in double-byte languages.
  • The main value in a CHM lies in the links from one page to another. In a complex project, these links can get quite long. Translators should stay away from them, and the best way to accomplish that is with translation memory software such as Déjà Vu, SDL Trados, across or Wordfast. These tools insulate tags and other untouchable elements from even novice translators.
We've marveled at how many search engine queries there are about localizing these projects, and we think that Robohelp and the other authoring environments have done a poor job explaining what's involved.

If you liked this article have a look at "Localizing Robohelp Projects."

Labels: , , , , , , , ,

15 May 2008

Doxygen and localization

Are you localizing any documentation projects that use Doxygen? It's an open-source tool for documenting source code.

If your documentation set includes things like an API reference or extensive details in programming code, Doxygen allows you to embed tags in the original code or header files, then automatically create entire help systems organized around the tagged text. Doxygen does not compile anything, but takes the tagged bits of source files, turns them into HTML pages, then links them for viewing in a browser.

Like most tools, it's a breath of fresh air when it works properly, but it can require a lot of re-plumbing and retrofitting.

As far as localization goes, it can be a life-saver. In theory, you can have the header files themselves localized, then run them through Doxygen as you would the original English files. Working this far upstream can be a big advantage.

Some months ago a client embarked on a conversion of a help system to Doxygen. While it was still in the proof-of-concept stage, we pseudo-translated some header files and tested the tool for global-readiness.

The good news is that the developers of Doxygen have enabled it for multiple languages. It encodes pages in UTF-8 (or other character sets), so translated text displays properly in the browser. It's possible to set the OUTPUT_LANGUAGE parameter to your target language (e.g., Japanese, in our test scenario) so that the datestamp and other text supplied by Doxygen displays in Japanese, rather than in the default English.

There are some I18n problems with Doxygen, though.
  • Each header file page begins with "Collaboration diagram for" followed by the page title. When the page title contains double-byte characters, the Japanese characters for "Collaboration diagram for" are corrupted. It appears that Doxygen is not pushing UTF-8 characters for this phrase, though it pushes UTF-8 characters in other places.
  • Some hyperlinked words in body text will require translation. If so, it will be important to ensure that they are translated the same everywhere. Note, however, that Doxygen will not generate the necessary file if the hyperlink has double-byte characters in it (not even on a Japanese OS).
  • Doxygen allows for generation of the .hhc, .hhp and .hhk files needed for Compiled HTML Help (CHM). It can also be configured to execute hhc.exe and compile the project. However, Doxygen outputs the .hhc file in UTF-8 format, which is incompatible with the table of contents pane in the Help viewer. To fix this, open the .hhc in Notepad (preferably on a Japanese OS) and save it back out as Shift-JIS ("ANSI" in Japanese Notepad). Then recompile the CHM by invoking hhc.exe from the command line and the contents will show up properly.
  • Searches using single- or double-byte characters do not work in the resulting CHM.
These strike me as rather large, empty boxes on the checklist of global-readiness. Still, the source code is available, so if your organization has already started down the Doxygen path, you can clean up problems like these for your worldwide versions.

Interested in this topic? You might enjoy another article I've written called Localizing Robohelp Projects.

Labels: , , , ,

05 October 2007

"Why are you charging me for that?" - Part 1

Have you ever asked your localization vendor this question? Or, if you're a vendor, has any client ever asked it of you?

For a few clients, we manage large documentation projects, notably HTML Help and Robohelp localization. When the vendor translated 800 HTML pages for version 1.0 of the product, a particular client swallowed hard and paid for all non-matches, because it was the first time localizing the product.

By version 2.0, the Help had grown to 1400 pages. Many of the original 800 pages had no translatable changes, but Trados dutifully scooped up all of those words, dropped them into the "100%" or "95-99%" buckets, and the vendor charged us for them, even if at a greatly discounted rate.

"Why are you charging me for that?" I asked. I'll have more on this topic in an upcoming post, but for now:

If you're on the vendor-side, do you have a good answer for that question? If you're on the client-side, have you ever received an answer to that question that satisfied you?

Labels: , , ,

13 July 2007

Where Translation Memory Goes to Die

Have you ever heard that you're better off not going into the kitchen at your favorite restaurant? You're likely to see a number of things you'd rather not associate with a place and a group of people you like.

The same may apply to your translation memory databases. Unfortunately, you don't have the luxury of ignoring them, because things could be dying in there and costing you money.

Let's start with this sentence:

Some interfaces use "redial function/redial context" semantics instead of using IRedial to specify both.

Any TM tool could store this string and its translation without problems. Suppose, though, that the sentence (segment, in TM terms) only looks contiguous when displayed in an HTML browser, which is a very forgiving viewer, and that the source is actually broken into three pieces:

1. Some interfaces use "redial function/redial context" semantics instead of using
2.
to specify both.
3.[HTML tags] IRedial.htm [closing HTML tags] IRedial

The text comes from include files written by engineers for engineers, and no line is longer than 80 characters. The tags come from the well-intentioned Tech Pubs team, which struggles to introduce some organization, hyperlinking and search capability to the product. This is pretty bruising to TM, which relies on being able to fuzzily match new occurrences to old occurrences of similar text. When the full sentence comes through the TM tool, its correspondence to the three broken fragments in TM is sharply impaired, and you (or I, in this case) pay for it.

It gets worse. If an engineer pushes words from one line to the next between versions, or if the tags are modified, the impact on match-rates is similarly impaired.

I've huddled with engineers, Tech Pubs and the localization house on this matter several times, with little progress to show for it, but here's a new twist:

We've offshored one of these projects to a vendor in China. Their solution was to re-align ALL of the English-language HTML pages from the previous version to ALL of the translated HTML pages of the previous version, effectively re-creating TM. They report about 20% higher match rates after doing this. I think this is because they're embracing the broken, dead segments in TM and finding them in the source files for the new version.

This seems like a counterintuitive approach, but who can argue with the benefits?

Labels: , , , , ,

11 May 2007

Localizing RoboHelp projects

Is it time for you to localize you RoboHelp projects? What's involved?

"RoboHelp project" is shorthand for "compiled help system." When this lives on a Windows client computer it is usually HTML Help (CHM) files. There are other variations like Web Help, which are also compiled HTML, but which do not run on the client.

The projects are a set of HTML files, authored in a tool such as--but not limited to--RoboHelp, then compiled into a binary form that allows for indexing, hierarchy and table of contents. Other platforms (Mac OS, Linux, Java) require a different compiler, but the theory is the same.

If you've done localization before, you'll find that RoboHelp projects are relatively easy, compared to a software project. RoboHelp (or whatever your authoring/compilation environment may be) creates a directory structure and file set that is easy to archive and hand off. It includes a main project file, table of contents file and index file. In fact, it's even possible in a pinch to simply hand off the compiled file, and have the localizers decompile it; the files they need will fall into place as a result of the decompilation.

Although you may think of the project as a single entity for localization purposes, each HTML page is a separate component. There may be large numbers of these pages that don't change from one version of your product to the next; nevertheless, you need to hand them off with the project, and you'll likely be charged for a certain amount of "touching" that the localizer's engineers will need to do. You may be able to save them some work and yourself some money by analyzing the project and determining which pages have no translatable changes, but by and large you should consider the costs for touching unchanged pages an unavoidable expense.

The biggest problem with these projects is in-country review. There's no easy way for an in-country reviewer to make changes or post comments in the compiled localized version. We've found that MS Excel is the worst way of doing this (except for all the others), so we've learned to live with it.

In theory, the translators are not mucking about with any tags, so the compiled localized version should work the same as the original. Yeah, right. All the links need to be checked--they do break sometimes--and the index and table of contents should be validated. And, don't forget to try a few searches to make sure they work; your customers surely will, and you want to spare them any unpleasant surprises.

Remember:
  • If you've included graphics in your help project, you'll need to obtain the original source files. These are not GIFs or JPEGs; they will be the application files from which the GIFs and JPEGs were generated. You'll need to hand off files from applications like Adobe Illustrator, or Flash or even PowerPoint, so that the translators can properly edit the text in them. Engineers often do quick mock-ups in Microsoft Word's Word Art that end up in the final product, and it takes a while to track them down.
  • Encoding can be thorny. Some compilers behave oddly if you try to impose the same encoding on both the HTML pages and the table of contents, especially in Japanese, in our experience.

Labels: , , , , ,

20 April 2007

Localization Testbenches, Part IV (Online Help)

What are you using to test your localized products? If you're handing them to your domestic QA team and expecting that they'll intuitively test them with correct language locale settings, you may be in for an unpleasant surprise.

3) Help files
Your online documentation also deserves some testing. After its contents (usually HTML pages or XML documents) have been translated - in the correct encoding for the target language - the help project will be compiled, in the same way that software applications are compiled. This compilation step needs to account for the correct language, locale and encoding, and this doesn't happen by itself, no matter how lucky you may feel today.

Again, it's important to test the help file in an environment that closely matches your customers' environment. Run your Greek help file on a native Greek operating system. Be sure to test the main window, the contents pane and the index for properly displayed characters. Above all, perform a few searches using native characters in the Find field to ensure that your help file's index was properly created and encoded; if your searches are successful, then your customers' searches will probably be successful as well.

Note: HTML Help under Windows has some idiosyncrasies when it comes to the table of contents (TOC) pane and the main window. Most tools like RoboHelp will properly encode the TOC and main pane content for, say Japanese, when all of the content resides in the same project. However, if you're building your HTML help files with your own tools (e.g., Perl scripts and hh.exe), you may find that encoding sauce for the goose is not encoding sauce for the gander. We've found, for example, that the HTML pages displayed in the main window are happy with UTF-8, whereas the TOC pane won't support UTF-8 but will support Shift-JIS.

Labels: , , ,

30 January 2007

Localization Train slowing

We're seeing the localization juggernaut lose some steam.

In the early years, this client localized its flagship software package for developers in China, Japan and Korea (CJK), then added Brazil. It took small, reference applications into as many as 10 languages (including Hebrew and Thai) as those markets showed promise. The budget was pretty fat, the localized products were freshened frequently, and the developers were happy to have software and doc in their own language.

I suppose it was to be expected that this would peter out with time, because markets change, business cases wax and wane, and some regions never return the investment.

The new stressor on localization was less easy to anticipate: bulk. Each generation of improvements to the product brings several hundred more pages of documentation. All of this new documentation is, of course, "free" in English, but somebody has to pull out a checkbook to deal with it in other languages, and that checkbook comes out more slowly and with more misgivings these days.

Engineering and Product Management furrow their brow nowadays when I walk in with cost estimates. I've adapted to this change in attitude with a few techniques:
  1. The Technical Reference is the fattest target and the source of most of the expansion. It lives in a compiled help file (CHM) that is no longer written by Tech Pubs, but generated by Perl scripts from header files written by the engineers. Our modus localizandi has been to hand off the finished help project, now comprising 3700 HTML files, and have the HTML translated. In an effort to lower cost, I'm attempting a proof-of-concept to localize the header files themselves, then tune the scripts to convert them into localized HTML. This should lower our localization engineering costs considerably.
  2. I agitate for interim localization updates, peeling off documentation deltas every few weeks and handing them off for translation, even if there are no plans to release them yet. This reduces the sticker shock and time-to-market delay that comes of getting an estimate on a release only when necessary, which may be a 10- to 18-month interval. Product Management and Engineering, who only think about localization when it's absolutely unavoidable, find the tsunami of untranslated text depressing.
  3. Although it's not a very clean way of doing things, I screen from the localization handoff those items that I know have little to be translated. Sometimes I go to the level of resource files, but more often I take documents to which only a few minor changes have been made from one En version to the next, hand off changed text, then place the translations myself. This is not for the faint of heart, nor for those who don't really know the languages involved, but it can save some money.
  4. I try to keep global plates spinning, in the hope that more people will consider the global dimension of what we do, and the fact that localization is the necessary step for making your product acceptable to people whose use of your product will make you money, if you make it easy for them.
  5. I never impart bad news on Friday.

Labels: , , , , ,

20 December 2006

Localization Conundrum

My client received a request from Korea for a localized version 6.5. There are two issues:

  1. It's going to cost a lot, because the last version localized into Ko was version 5.01.
  2. English is up to version 8, and the process of creating the help is much better than in 6.5 . Should we include those enhancements to 6.5 Ko, even though they would take it out of parity with 6.5 En?
I experimented to see whether the improvements mattered to the localization process in general and the cost in particular. I re-created portions of version 5.01 help using version 5.01 Perl scripts, then did portions of that same help using version 6.5 Perl scripts. Then I handed both sets off for wordcount analysis. They were within 2-3% of each other, so the cost-savings in translation are not there.

However, I suspected that the vendor would charge me a lot more for engineering on the 5.01 help, because the version 6.5 scripts are much cleaner, and they handle the raw text much better. This compelled me to examine the matter further.

Better help or not, the problem is one of product management. Even if 6.5 help is "better," it differs too much from 5.01 help. I imagine a Korean customer struggling to bounce back and forth between 5.01 En and Ko, and puzzling at the discrepancies, even though the Ko version had a lot more information than the En version.

They are the sort of discrepancies that make cowards of us all (albeit well advised cowards). I've decided to hand off the pure 5.01 En help system for this project, warts and all.

Labels: , ,

16 December 2006

Favorite Localization Tools

Here's a short list of Windows-based tools I use a great deal in managing localization projects:

Beyond Compare
- Clients constantly drill me about the differences between the last version of their product and this version, with an eye to the order of magnitude of localization expense they're in for. Beyond Compare is the best tool I've found for finding the files that have changed, then comparing older and newer versions of files in a specialized viewer. Good technical support as well.

EmEditor - As long as you have the font and OS support installed, you can view multi-byte characters in their appropriate applications under English-language Windows, but EmEditor allows you to change the encoding of a text file to better display it, or so that you can edit it. My standard text editor is Ultra-Edit, which has excellent search-and-replace capability, but it's not as deft as EmEditor for multibyte work on an English OS.

SDLX Glue - An obscure utility inside the SDLX suite, this will append up to I don't know how many hundred HTML files together. Translation vendors like it for work on big sites because it slashes the number of files being slung around. Naturally, it includes an unglue utility as well.

FAR - A technical writer introduced me to this utility, which includes a compiler system for HTML Help and MS Help. It will compile CHM files in any language such that, if you have a good HTML authoring tool, you don't need RoboHelp to build your CHMs. (Unfortunately, I've had problems when I've tried to use FAR on projects that have been created in RoboHelp, but there are some ways around them.)

Moreover, FAR stands for "Find And Replace", and this is hands down the best front end on regular expressions that I've ever found. The Holy Grail of search-and-replace is ignoring line breaks, and while regex supports that, not many utilities (that I've found) implement it. For instance, in the text

In a white room

with black curtains

at the station

if your goal was to find "room with black curtains at", most utilities would not be able to locate it because of the line breaks. FAR does find it, and even allows you to replace the text with line breaks. Top-flight technical support also.

Most of these are shareware, but they're well worth the US$25-$50.

(compiling CHMs, finding and replacing across line breaks)

Labels: , , , , , ,

25 September 2006

Doing the Localization Vendor's Work?

Sometimes I know too much about this process.

Or, maybe I'm just too nice a guy.

To make things easier for the vendor (and cheaper for me) I've resolved to carve the 3200 HTML files in the API Reference CHM into different buckets, depending on whether and how much they require translation vs. engineering. Naturally, the ultimate arbiter is the Trados or SDLX analysis that the vendor will perform, but I've already mentioned my concern about false positives and need write no more on the topic here.

My tool of choice is the extremely capable Beyond Compare which, at US$30, is worth it just to see how well thought-out a software package it is. I compare version 3.9 files against version 4 files, tuning the comparison rules to groom the file buckets as accurately as possible.

The distribution is not perfect, if for no other reason than because its first level of triage is the filename and not the file contents, but it's better than guessing, and it's much better than thousands of false positives.

Once I've gone through the files, I'll have a better idea of how to label the buckets in a way that meets both my needs and those of the vendor.

At least, I think I'm being too nice a guy. Maybe this is just a big pain for the vendor, and they're too polite to inform me of that.

Labels: , , , ,