TAUS spoke with Mark Brayan, the CEO of Appen to find out what he has learned based on his over twenty-five years’ experience in technology and services. The theme of the conversation was: "What is the single biggest lesson that you have learnt about the translation industry?"
Mark, what is the most important lesson you’ve learnt about the language data business?
In the technology business generally, nothing stands still. You always need to think about Act 2 if you don’t want to be disrupted. As more data is generated and distributed, it all ends up being useful somehow. What we need to be thinking about is how these data types can be exploited. So for me, the most important lesson is that you’ve got to disrupt yourself on a daily basis.
Here’s how we do it. We provide training data for machine learning and AI products. Much of the data we provide is language-based but we also work in other modalities as well such as image and video data. Many of these products provide a threat or opportunity to people in the language industry, augmenting and/or replacing services such as translation, transcription and interpretation.
These products improve with more data, which is good for our business, but data is expensive to capture and prepare. So we’re keeping an eye open on how data can be used and re-used in multiple ways to produce useful outcomes.
For example, in speech recognition, the acoustic environment is very important as microphones pick up extraneous noise. This is why in-vehicle speech recognition lags conventional speech recognition for example. Cars are noisy environments. So combining audio speech data and different non-speech sounds gets more use from existing data.
I also look for data exhaust – unexpected data that is being thrown off by the data you are actually looking at. For example, by knowing which lights are on and off in a big building you know how much electricity is being used. But this data can be used in more ways than one. It could show which floors or rooms are occupied and a this can be useful for the real estate industry.
What makes data relevant?
The best AI is narrow AI. That is, the closer it is to the real world use case the better it performs. So while you want to exploit what you have as much as possible, you can’t reuse it as much as you hope because each product needs something slightly different. The data needs to be relevant to the problem you’re trying to solve and the product you’re trying to build.
Can you manufacture data yourself in for example the speech domain, or do you have to use the crowd?
The problem is that everyone’s voice is unique, which makes voice biometrics a useful technology by the way. What our clients are looking for is as much variety as possible in data sets to cover all situations. Potentially you could generate a voice stream artificially with speech synthesis, but it would not have that richness you get in natural voices. Speech synthesis is improving but using synthesized data to train an engine means that you are not really training it for the real world.
How is a typical data job carried out?
We get client requests for speech data that are tied to their product development lifecycle. They might want five thousand hours of data for a speech recognizer in UK English and Dutch as well as in locally-accented English for the NL market. We collect this, for example, by reaching out to the crowd who then talk into mobile phones for us through an app. Or in some cases such as vehicle collections, we have a project manager on the ground who gets a person to drive around and talk naturally so we get the background noise. We then prepare this by transcribing it and applying other annotations, check the quality and send it to the client. The price of the data is determined by such features as volume, complexity of the collection process, and the extent of the annotations.
With the breakthrough of NMT, can we expect that Appen will enter this space as well?
The barrier to entry in our business is not just about having a crowd, though this is a necessity of course. Our real value-add is the underlying technology that makes it all happen. That is the same for a large translation company with respect to the whole process of segmenting work, sending, checking, recompiling etc. We too have complex in-house systems and crowd onboarding tools. So what stops Appen from going into translation at scale is the same as what would make it difficult for a translation company to come into our area: the power of a specific technology.
If all data could be relevant in some way one day, what types of metadata would you like to have to access it, and how do you build the crowd to provide the data?
It has taken many years to build up our data stock, and the profiles of contributors vary widely. We have to call on a bigger pool than the translation business as we need specific demographics for some work such as relevance judging – e.g. people who speak a given language and live in a particular place, satisfy demographic requirements and so on. Also crowds age over time, so profiles change. We are constantly recruiting new crowd workers to have the right active cohort. And the requirements get complex. Working with children’s voices needs special protection, of course, such as carers or parents in the room when recording. We may also have to provide data from very specific representative samples of the population, such as “Spanish speakers who have been in the US for 10 years.” Meeting quality standards like this can be complicated.
How about changes you are seeing in the language spread?
Smart phone manufacturers are always on the look-out for speech data in highly populous developing nations. Yet they are still collecting US-accented English, which is the largest market for a big range of products. The US market is still very lucrative for our customers. At the same time, China is probably the largest AI market outside the US, so it offers a key opportunity for us too.
What's next for APPEN – competition or consolidation?
Our central concern looking forward is to continually add value to our customers, staff and shareholders. We believe that we are in a good position, working at scale in data collection for AI, with very deep expertise in language and speech data collection. We’ve collected data from more than 130 countries and 180 languages. There are competitors in our space but the breadth and scale of what we do - combined with over 20 years of experience – sets us apart.
At the same time, just like everyone else we have to accept that technology is going to impact our and other businesses more and more so we need to stay ahead of that. ‘Act 2’ in the world of language services will be fascinating!
6 minute read