Hi Intellego!

What is achievable?

Translation quality like google

Nope, not by a long shot. They have several key things mozzilla doesn’t have:

  • Data: they parse every page on the web, every book ever printed and much more. All this gargantic amount of data boosts translation quality and isn’t available to us.
  • Engine: they have their proprietary google translate which they tweak since ten years. They’re probably ahead of the open-source engines out there.
  • Infrastructure: putting a translation model in 100Gig RAM? Peanuts for google. They can easely parralellize huge tasks and have huge computing power at their finger tips.
  • Manpower: lots of smart PhD guys having worked dozens of years to sharpen their expertise in this field.
  • Money: the glue to fuel it all.

Better translations for minority languages

One of the slogan of inteligo was “translation is not a commodity”, in the sense that major languages had better translation systems than relative “exotic” languages. It was attributed to lack of research for such languages driven by low ROI. This may in part be true, but the major reason is probably something else: the lack of linguistic resources.

The more data you feed in a translation system, the better it’ll perform. With the millions(billions) of parrallel english-french sentences to build the models, it goes without saying they are pretty good. With a few thousands Javanese/Pakistanese translations, building translation models from them will result in pretty poor (crappy) translation quality.

Modest translations integrated into firefox / as web service

Yes, this is actually achievable, provided that you pay some expert to do it. At least if you want something decent.

Do we really need to hire someone? Well, at least I think so. You need a lot of knowledge and know-how to set up a system and you are probably not going to achieve it through a bunch of volunteers. It’s not something you can hack in a week of coding. I think the fact that nothing is available after nearly a year of intellego kind of prooves the point.

It will also require the proper infrastructure to work, not a small server in a basement. Here is also, budget is involved.

On the brighter side, if Mozilla can pull it off, they could reap the benefits too. User retention mainly.

 

Planning

Infrastructure

1. Where is the infrastructure?

Machine translation engines are hungry beasts. Ideally, you’ll need:

  • one or more server per language pair, with enough RAM to hold all the models into memory.
  • machines to compute the models used by the translation engines, also called “training”, and perform experimentations.

In order to have satisfying or efficient systems, the servers should be equally efficient. It’s no place for low-balling with 1Gig RAM servers because they’ll quickly run out of memory or lag like crazy.

While the first servers, dedicated to the translation service should be dedicated servers, the others, for training and experimentation may very well take advantage of cloud infrastructures are the need for computation power may vary greatly upon time and training/experiments performed.

So, who takes cares of that? Where are the servers? …it goes without saying that this is accompagned by a budget.

 

Resources

1. Which corpora?

Gathering all bilingual resources you can find is a task for itself. This is a good place the community could provide help, as it doesn’t require prerequisite knowledge and we tap in a large pool of individuals who perhaps know of this or that bilingual texts.

Who is in charge of this? Where is the website listing corpora resources?

Not only bi-lingual resources but also mono-lingual resources are used to build models. In particular language models. As usual, the more data you have, the better the translation will be. This can constitute relatively big amount of text. Just to give you a rough idea:

http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html

Remember, this is 2006 …a long long time ago. Their english sample covered 1,024,908,267,229 words and they extracted 5-grams from them, resulting in 24Gb compressed data. Now, more than 8 years after, it’s likely to be much more. This is just to give you a rough idea of the dimension the data we’re dealing with.

Gathering data blindly is not particularly useful either. If you scrape all forums, you are probably going to end up with translations like: “u can translate 4 fun butt it suckz”. On the other hand, just pull legal sites and the translation engine will speak like an attorney …not particulary helpful either. Final words: quality matters.

 

2. Processing corpora

How you process your raw data is hugely important. Poor processing will break the best translation engine and make it look like crap.

What’s processing corpora? Filtering bullshit from bi-lingual corpora, badly aligned sentences, non-ut8 symbols, homogenizing synbols like quotes and accents… up to tokenization itself which is a black magic. Since it’s the beginning of the pipeline, it will affect the quality of everything up to the final translation. There is a lot of nitty picky details in there, and it can have a surprising impact on the final quality of the system.

Lastly, it goes without saying that all resources should tokenized the same way. While using different corpora processed in different ways is possible, it is not optimal.

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

3. Sentence alignment

Do you want to do that? Or only use pre-aligned bilingual corpora? If not, who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Derivative Resources

1. N-Grams

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

2. Word alignment

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Models

1. Language models

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

2. Phrase tables

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

3 Other models

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Translation engines

…it’s not “download, install run here” …you will have to learn a lot of stuff to get it running fine

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: