I recently had the luck to participate in the BabelNet Workshop that was organized by the European Commission, the Publication Office and the European Parliament, in Luxembourg on the 2nd and 3rd of March.
The workshop provided valuable insights and a technical guide tour of BabelNet: the largest online multilingual encyclopedic dictionar
y and lexicalized semantic network. and Babelfy: the multilingual state-of-the-art disambiguation and entity linking system based on BabelNet.
The intense two-day workshop program enabled participants from EU institutions, industry and academia to gain hands-on experience with the different ways to access and leverage BabelNet and Babelfy, to disambiguate text written in mixed languages using the so-called language-agnostic setting as well as to explore ways of linking other resources to BabelNet.
BabelNet’s huge potential for research and industry applications was discussed in a truly interactive atmosphere. Three case studies demonstrated how to make EU resources, namely IATE, EUROVOC and Euramis, more effective by linking them to BabelNet. Speakers from industry and academia gave interesting ‘Lightning talks’, explaining how they used BabelNet for improving text alignment, text classification, sentiment analysis and other areas.
BabelNet is a result of the 5-year MultiJEDI ERC Starting Grant (2011-2016), headed by Prof. Roberto Navigli of Sapienza University of Rome. The project has received funding from the European Union's specific program ‘Ideas’ under the 7th Framework Program. Its two main objectives were to create large-scale lexical resources for dozens of languages and enable multilingual text understanding.
BabelNet has been created by the automatic seamless integration of resources such as WordNet, Wikipedia, Wikidata, Wiktionary, OmegaWiki and GeoNames, and the use of statistical machine translation (SMT) to acquire a large amount of multilingual concept lexicalizations. It currently covers 14 million concepts and named entities (NEs) lexicalized in 271 languages. The multilingual lexicalizations (i.e. words) are grouped into sets of synonyms called Babel synsets.
BabelNet is fully integrated with Babelfy, which is, more specifically, a unified, multilingual graph-based approach to multilingual entity linking and word sense disambiguation, as well as with Wikipedia Bitaxonomy, a state-of-the-art taxonomy of Wikipedia pages aligned to a taxonomy of Wikipedia categories. The BabelNet team is currently working on linking concepts to the 30 domains (topics) of the Wikipedia featured-articles plus a few more add-on domains, such as ‘Fashion’.
BabelNet and Babelfy can be accessed and queried online either from a browser or programmatically:
The latest version of BabelNet (3.6) is also a knowledge base, offering semantic relations from linked resources and information extraction techniques, domain labels for millions of synsets and phrases and collocations for most of them. For commercial purposes the way to go is to contact the BabelNet team in order to exploit the best ways to leverage the resource and collaborate for a win-win. BabelNet and Babelfy can also be licensed offline for research purposes only. Both BabelNet and Babelfy resources and their APIs are made available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.
BabelNet is also provided as a Linked Data Interface as part of the Linguistic Linked Open Data Cloud (LLOD). Prof. Asunción Gómez-Pérez of Technical University of Madrid (UPM) discussed Linked Data as a source of large background knowledge for NLP. More specifically, Gómez-Pérez discussed the need and benefit of connecting licensed language resources as well as of establishing a license for the link between resources. She said that closed resources can also be linked, in which case the owner negotiates with the interested party regarding accessibility issues (price, license, etc.).
There are different policies to govern conditional access to Linked Data. For example access may be provided for a price, either for the whole set or per triple ( i.e. statement in ‘subject/predicate/object’ form). The metadata of the dataset is kept in the DataHub, the content is in the LLOD while the actual datasets are kept by the owners.
The uses and users of Linked Data are: programmers, that are able to build applications making queries in SPARQL and get RDF, citizens/users that can access it through a user interface and machine-machine that perform data exchange and semantic interoperability in RDF.
In one of his presentations, Navigli discussed some industrial applications of BabelNet:
Navigli said that BabelNet has a lot of potential for machine translation (MT). The MT community is heavily focused in statistical methods and has not shown interest so far in how BabelNet could also contribute to improve machine translation quality. Navigli mentioned that, if such interest does not rise in the near future, he is tempted himself to prove BabelNet’s potential in MT, although MT is not his field.
Content producers and data owners may wonder if linking their data to BabelNet would mean sharing them publicly. Navigli confirmed that they don’t have to, unless they want to. Both parties can benefit from the resource linking in various ways.
On the one hand, BabelNet, can benefit by enhancing its knowledge base coverage, by using the data to improve disambiguation techniques or by introducing a BabelNet Plus version for enhanced services and so on. On the other hand, content producers and data owners can use their linked resources as they want. For example, by linking a translation memory (TM) resource to BabelNet such as the TAUS Data Cloud, the synsets and concepts could be used for quality assessment of translations. Each TM segment would be linked to specific synsets and concepts and the segment itself would be the context for disambiguation. An algorithm comparing the matching scores and the distance of synsets in the graph could judge if the source and the target are a (good) translation of each other.
Andrzej Zydron, CTO of XTM International, said that they are mainly using BabelNet to create large bilingual dictionaries on the fly in order to address alignment problems such as the runaway problem i.e. when something is misaligned and the rest of the document goes wrong. They have licensed BabelNet for 50 languages. He mentioned that it takes about one hour to produce large dictionaries between any languages.
Navigli gave an introduction to BANG, a collaborative experiment that has no precedent due to the type of annotation (Wiktionary is very different). BabelNet users registered on the website can access blocks of synsets to be annotated in their native language. The data will be made openly available with a Creative Commons free license. The incentives for users to perform such annotations are to contribute to achieving a larger coverage of their own language, to contribute pictures for difficult and abstract concepts which can be a fun process and so on.
With the completion of the MultiJEDI project, Roberto Navigli is in the process of founding the Sapienza startup company Babelscape to take over and exploit the project’s outcomes: BabelNet and Babelfy. He welcomes interested parties to contact him with ideas on how they can collaborate and leverage BabelNet.
We sure look forward to the release of BabelNet live, which will be updated every day or every week. This will be a crucial improvement of the next versions, as any update will be (almost) immediately available to BabelNet users, especially with regard to the Named Entities for disambiguation purposes.
For more information about the event and access to all presentations you can visit the The Luxembourg BabelNet Workshop site. An interesting article by Andrew Joscelyne on BabelNet - How the World Can Help Disambiguate Words was published on TAUS Review #3 in April 2015.
7 minute read