Ten Years of TAUS Data Cloud Taught Us How to Fix the Data Gap

by Jaap van der Meer 7 Jan 2019

Since the launch of Data Cloud, TAUS has gathered many insights from the data experts of the industry regarding their needs and expectations. Our solution to fix this data gap is TAUS Matching Data service: a high-performance clustered search methodology, based on data selection techniques. keywords- Data Cloud, industry-shared language data, TAUS Matching Data service, data gap, global translation sector, clustered search methodology, data selection techniques, statistical phrase-based MT engines, neural MT, Crawled Data, TAUS Matching Data Library, Data Test Bed

In 2008, TAUS launched the Data Cloud, the first (and only) industry-shared language data repository. It was a revolutionary idea at the time, trusted and funded by a group of 45 of the largest global tech companies and their language service providers. They set two simple rules. The first was that copyright remains with the data owner who makes data available for others to exploit in whatever way they like. The second was that anyone who uploads data earns credits to download data from other members and users. This reciprocal model has proven to be successful: in ten years the TAUS Data repository grew to 70 billion words in 2,300 language pairs and has helped many of the leading MT suppliers to train and improve their engines.

And yet, as we have also learned over the years, there are still things that can be improved. Besides, the times have changed and with it, our data requirements. Following consultations with the member community, TAUS has embarked on a roadmap that will help to fix the data gap in the global translation sector. Below we share a preview on this roadmap.

1. Matching Data on TAUS Data Cloud

One of the major setbacks of the current TAUS Data Cloud is the difficulty of finding exactly the right data in the very large repository. The only way to select data was by choosing one of the 17 preset industry categories, tagged by the owner of the data as the relevant attribute when the dataset was first uploaded. This is not very reliable as datasets can be very diverse and contain text from multiple domains.

That relatively rough data selection method worked well for statistical phrase-based MT engines, where more data was always better data (as they say). It also served the early days of MT development quite well, because the focus was mostly on building generic and better-performing baseline engines. Now, with the new generation of neural MT, we are all looking out for cleaner, higher quality and more specific data that can help us customize our engines and make them work better for our own companies and products.

As a result, TAUS is launching Matching Data. This is a high-performance clustered search methodology, based on data selection techniques developed in the DatAptor project. Here is how it works: based on a sample mini-corpus we identify the best matching data, on a segment-level basis, from the entire data repository. We can increase or decrease the selections by adjusting the matching rates.

The customized corpora built with the Matching Data service are collections of segments extracted from all datasets in the TAUS Data Cloud. This tailored approach to data selection significantly reduces the data volume requirements and finetunes the MT training process. By way of reference: the sample mini-corpus provided by the user should contain ten to twenty thousand segments, monolingual or bilingual, representative for the specific domain. Experience so far has shown that we can condense the data selection by a factor of ten.

Matching Data as a service is now available! We plan to make Matching Data available as a new automated feature on the TAUS Data Cloud in Q3 of 2019.

2. Matching Data on TAUS Crawled Data

Crawling data from the web has been the primary source of data collection for all of the large MT development companies and projects. TAUS has also built up extensive experience in web crawling and the development of bilingual corpora as partners in several European Commission funded projects: MosesCore, MMT and ParaCrawl. The volumes of aligned text harvested from the web add up to hundreds of billions of words. While these resources are of course overwhelming, the concerns are that the crawled data may not be consistent, or clean or specific enough.

Therefore, TAUS offers to extend the Matching Data service to the vast volume of crawled data that TAUS can avail of. And where and when necessary, the web crawling can be extended as a customized service. This way we can expand to virtually all domains and, in doing so, fix a crucial data gap.

Matching Data on TAUS Crawled Data will be available as a service from January 2019.

3. TAUS Matching Data Library

This Matching Data service leads to a unique new offering that TAUS plans to release in collaboration with its users and members - we call it the TAUS Matching Data Library. Let’s see this service in action: a TAUS member, whose business is travel and hotel bookings, uses the Matching Data service (both on TAUS Data Cloud and on Crawled Data) and TAUS builds a set of corpora tuned to the Hospitality Domain in fifteen different languages. These domain corpora are now offered to other members and users in the TAUS Matching Data Library. The domain corpora are offered at a fixed price or can be purchased with credits that members have acquired by uploading their own data. The member or user who initiates a new domain corpus and provides the Query Corpus that starts that clustered search, gets a discount on the purchase of the first corpora in the new domain.

The TAUS Matching Data Library will be available on the TAUS Data Cloud from December 2018.

4. Data Test Bed

“What if….”, many users asked us, “we could immediately try if an engine trained with the domain-specific corpora delivers better translation results for our own documents?” TAUS plans to make this possible with a Data Test Bed service. NMT engines trained with the corpora from the Matching Data Library will be accessible through the TAUS Data Cloud, allowing users to measure the impacts of the new data on their own documents with standard metrics such as BLEU.

The TAUS MT Test Bed will be launched on the TAUS Data Cloud in Q2 of 2019.

5. Cleaning Data

Another stumbling block in data sharing and using data for training MT engines is lack of confidence that the data we obtain are ‘clean’. This type of ‘cleanliness’ can range from having doubts whether the language identifiers are correct, or the segments are properly aligned, or whether the segment is translated at all to fears that the translations are below-quality due to typos, spelling errors, terminology mistakes, and perhaps MT hallucinations.

Consequently, TAUS will launch Cleaning Data, first as a service together with the Matching Data service, and later as an online and automated feature on the TAUS Data Cloud. The Cleaning Data service will be based on open-source tools such as bicleaner, TMop, Zipporah, Okapi CheckMate, and also on some proprietary tools developed by TAUS.

Cleaning Data will be available as a service from January 2019 and will be incorporated as a feature on the TAUS Data Cloud in Q3 of 2019.6. Anonymizing Data

Language service providers, translators and also translation buyers would feel a lot more comfortable sharing their language data if there was a way of automatically filtering out all personally identifiable information (PII). This concern has only grown, and particularly in Europe, since the new General Data Protection Regulation came into force in May 2018.

Therefore, TAUS will launch an Anonymizing Data service. The anonymization is cast as a substitution task, whereby sensitive information is replaced with specific placeholders. Among all possible PII, we will specifically focus on proper names, email, URL, long integers, codes, and addresses. The Anonymizing Data service will be based on open source tools from ParaCrawl and proprietary tools from the TAUS data pipeline. To increase the anonymization coverage and meet users’ requirements, we will support the capability of loading a list of PIIs that need to be obscured, provided by the data provider.

Anonymizing Data will be available as a service from Q2 2019 and will be incorporated as a feature on the TAUS Data Cloud in Q4 of 2019.

7. Marketplace

Finally, a real inhibitor on language service providers, translators and translation buyers sharing their language data is the lack of incentives, at least under the current reciprocal business model of the TAUS Data Cloud. Yes, they can earn credits if they upload data, but unless they are in the business of MT, these credits do not receive much value.

For that reason, TAUS plans to replace the reciprocal model with a marketplace model. Under the marketplace model, a data owner receives a monetary reward every time another user buys their data. Since the Matching Data feature has changed the scale from trading complete files or datasets to selecting and exchanging segments, implementation of the marketplace model will require a micro-payment system. Imagine for instance the Hospitality Domain corpus in the TAUS Matching Data Library in the English to Spanish language pair, which may contain five million words sourced from twenty different translators and translation agencies. The buyer pays one price, for instance, 5,000 Euro, but this amount needs to be distributed in accordance with their contributions to this corpus among all the twenty data owners.

It is not clear yet when exactly the current reciprocal model will be dropped and replaced by a full Data Marketplace.

What else…?

What else could be improved on the TAUS Data Cloud? We don’t ultimately know. Yes, we know people are concerned about the legal aspects, copyright issues, and confidentiality of data. But there isn’t much that we can do about that from the TAUS side. We have a solid legal framework in place: the TAUS Data Provider & Pooling Conditions. These conditions were set up and agreed upon in 2008 by the legal counsels of the 45 founding members of the TAUS Data Cloud, and have governed the exchange of data for the past ten years without any claims or flaws. All that new data providers need to do is check whether they own the copyright to the data that they want to upload, and if they don’t, ask permission from the data owner.

If there’s anything else that TAUS can do to fix the data gap, please let us know by commenting below or writing directly to data@taus.net.

7 minute read