A client recently had a problem where content that was sent for translation wasn't translated fully. The returned translation was only half done and appeared to be truncated. After a quick investigation I discovered there was an error in the HTML for a node.
<a href='...'>link text</a>
(a space was missing between 'a' and 'href')
Many times, browsers manage to display pages with broken HTML, so it's difficult to notice there's a problem. In this case the link is missing but the text is all there.
Broken HTML leads to problems
This is what ICanLocalize does when you send content for translation:
- The node title and body text are sent to the ICanLocalize server
- The ICanLocalize server parses the HTML
- The HTML parser extracts the text for translation
The HTML parser extracts the text in such a way that translators only have to edit text and not HTML tags. While translators are editing, a preview panel shows them how the translated document would appear.
Our parser is fairly robust but obviously in this situation it failed.
Remember that the ICanLocalize server is not the only computer to process your pages. Search engines (a.k.a Googlebot) read your pages and try to make sense of them. When they encounter broken HTML, they get confused. Parts of the page, or even entire pages can be lost if search engines cannot process them and cannot follow links.
Make sure your HTML is valid
Before sending content for translation it's always a good idea make sure the HTML in your nodes are valid. I've done a quick search for drupal markup validation modules and this one looks useful: http://drupal.org/project/w3c_validator
You can use the validator directly: http://validator.w3.org/
My personal favorite is the Firefox HTML validator. It's pretty simple. Green means GO, red means no-go and yellow means check. You get instant validation for entire pages as soon as you save the first draft.