The availability of language and translation data sets in highly-demanded and business-oriented language pairs as well as in smaller and usually less-resourced languages, be it general or domain/origin-specific data, is crucial for the language and localization industry. The state-of-the-art translation technology is data-driven and learns from the data.
In its nearly 10-years of existence, the TAUS Data Cloud has proven to be a valuable resource allowing access to large industry-shared translation data sets in multiple languages, industry domains and content types for different use cases .
So what is the best way to leverage the data from the Data Cloud for making the best use of it in your projects? What are some of the Data Cloud use cases ?
Once you identify the relevant data sets for your specific purposes from the list, you can view random samples of each data set. This “look-inside” option allows you to browse through the parallel data sets to make sure that they meet your expectations regarding data relevance, quality and style, before you decide to download them.
The primary use of Data Cloud generic and domain/origin-specific data is machine translation (MT); more specifically the training of the MT engines and the evaluation and improvement of MT performance. Some use cases can be seen in the overview of the Data Cloud “Capitalizing on Translation Data” (slides 13-16). While the volume of the available project data is an important factor (the more data, the better), in many cases the relevance of the data to a project and the quality of the data are proven to be more important when it comes to new generation translation technologies. So, the trade-off of quantity, relevance and quality should definitely be considered.
Another way to use the Data Cloud is the creation of derivatives from the Data Cloud monolingual or multilingual data by performing curation-related activities such as processing and managing data. Examples are (further) annotation or cleaning of data to fit better into a specific project, or extraction of useful, project-specific data. Of the latter, extraction of industry-shared terminology to create term banks or glossaries for different purposes can be given as a prime example.
You can also deploy Data Cloud resources to fill up existing translation memory (TM) repositories for various purposes such as for boosting productivity on post-editing (PE), enhancing autosuggest results, and retrieving more translation references through a search tool.
Furthermore, you can leverage the Data Cloud resources for natural language processing (NLP) tasks such as cross-lingual information retrieval, text classification, language modeling, image captioning, question answering, speech recognition and document summarization.And finally, you can use the translation data for projects based on translation or comparative language studies i.e. explore two different languages within an interdisciplinary and linguistic context, on the basis of a parallel domain-specific Data Cloud data set.
The above-mentioned list of use cases of the Data Cloud is by no means exhaustive, as the volume and richness of Data Cloud resources can facilitate many other use cases and purposes.
There are ongoing efforts to collect more data and make it available through the Data Cloud. These efforts include data business development activities within the TAUS Data Cloud roadmap and large-scale web crawling activities within collaborative projects funded by the EU such as the 3-year ModernMT project (completed by the end of 2017) and the 18-month ParaCrawl project within Connecting Europe Facility (CEF) program (initialized in September, 2017) both of which have translation technology experts on board from research/academic institutions and the industry.
TAUS is investigating the best possible business model to tackle the issues surrounding availability and accessibility of translation data sets. Is it the current TAUS Data Cloud reciprocal model or a future TAUS Data Market transaction model that can provide the best solution for driving market adoption?
We welcome you to follow our some of our efforts, news and achievements by reading relevant TAUS reports such as Translation Data Landscape Report and Data Market White Paper.
5 minute read