Linked Lexical Knowledge Bases. Iryna Gurevych
of LKB information. Alternatively, one might imagine developing integrated corpus-based and knowledge-based representations that would inherently involve explicit symbolic representations, even though, currently, this might be seen as wishful thinking.
Finally, one would hope that the current book, and work on new lexical representations in general, would encourage researchers to better connect the development of knowledge resources with generic aspects of their utility for NLP tasks. Consider for example the common use of the lexical semantic relationships in WordNet for lexical inference. Typically, WordNet relations are utilized in an application to infer the meaning of one word from another in order to bridge lexical gaps, such as when different words are used in a question and in an answer passage. While this type of inference has been applied in numerous works, surprisingly there are no well-defined methods that indicate how to optimally exploit WordNet for lexical inference. Instead, each work applies its own heuristics, with respect to the types of WordNet links that should be followed, the length of link chains, the senses to be considered, etc. In this state of affairs, it is hard for LKB developers to assess which components of the knowledge and representations that they create are truly useful. Similar challenges are faced when trying to assess the utility of vector-based representations.3
Eventually, one might expect that generic methods for utilizing and assessing lexical knowledge representations would guide their development and reveal their optimal form, based on either implicit or explicit representations, or both.
Ido Dagan
Department of Computer Science
Bar-Ilan University, Israel
1
https://www.wikipedia.org
2
https://www.wikidata.org
3One effort to address these challenges is the ACL 2016 workshop on Evaluating Vector Space Representations for NLP, whose mission statement is “To develop new and improved ways of measuring the quality or understanding the properties of vector-space representations in NLP.” https://sites.google.com/site/repevalacl16/
.
Preface
MOTIVATION
Lexical Knowledge Bases (LKBs) are indispensable in many areas of natural language processing (NLP). They strive to encode the human knowledge of language in machine-readable form, and as such they are required as a reference when machines are supposed to interpret natural language in accordance with the human perception. Examples for such tasks are word sense disambiguation (WSD) and information retrieval (IR). The aim of WSD is to determine the correct meaning of ambiguous words in context, and in order to formalize this task, a so-called sense inventory is required, i.e., a resource encoding the different meanings a word can express. In IR, the goal is to retrieve, given a user query formulating a specific information need, the documents from a collection which fulfill this need best. Here, knowledge is also necessary to correctly interpret short and often ambiguous queries, and to relate them to the set of documents.
Nowadays, LKBs exist in many variations. For instance, the META-SHARE repository4 lists over 1,000 different lexical resources, and the LRE Map5 contains more than 3,900 resources which have been proposed as a knowledge source for natural language processing systems. A main distinction, which is also made in this book, is between expert-built and collaboratively constructed resources. While the distinction is not always clean-cut, the former are generally resources which are created by a limited set of expert editors or professionals using their personal introspection, corpus evidence, or other means to obtain the knowledge. Collaboratively constructed resources, on the other hand, are open for every volunteer to edit, with no or only few restrictions such as registration for a website. Intuitively, the quality of the entries should be lower when laypeople are involved in the creation of a resource, but it has been shown that the collaborative process of correcting errors and extending articles (also known as the “wisdom of the crowds”; Surowiecki [2005]) can lead to results of remarkable quality [Giles, 2005]. The most prominent example is Wikipedia, the largest encyclopedia and one of the largest knowledge sources known. Although originally not meant for that purpose, it has also become a major source of knowledge for all kinds of NLP applications, many of which we will discuss in this book [Medelyan et al., 2009].
Apart from the basic distinction with regard to the production process, LKBs exist in many flavors. Some are focusing on encyclopedic knowledge (Wikipedia), others resemble language dictionaries (Wiktionary) or aim to describe the concepts used in human language and the relationships between them from a psycholinguistic (Princeton WordNet [Fellbaum, 1998a]) or a semantic (FrameNet [Ruppenhofer et al., 2010]) perspective. Another important distinction is between monolingual resources, i.e., those covering only one language, and multilingual ones, which not only feature entries in different languages but usually also provide translations. However, despite the large number of existing LKBs, the growing demand for large-scale LKBs in different languages is still not met. While Princeton WordNet has emerged as a de facto standard for English NLP, for most languages corresponding resources are either considerably smaller or missing altogether. For instance, the Open Multilingual Wordnet project lists only 25 wordnets in languages other than English, and only few of them (like the Finnish or Polish versions) match or surpass Princeton WordNet’s size [Bond and Foster, 2013]. Multilingual efforts such as Wiktionary or OmegaWiki provide a viable option for such cases and seem especially suitable for smaller languages due to their open construction paradigm and low entry requirements [Matuschek et al., 2013], but there are still considerable gaps in coverage which the corresponding language communities are struggling to fill.
A closely related problem is that, even if comprehensive resources are available for a specific language, there usually does not exist a single resource which works best for all application scenarios or purposes, as different LKBs cover not only different words and senses, but sometimes even completely different information types. For instance, the knowledge about verb classes (i.e., groups of verbs which share certain properties) contained in VerbNet is not covered by WordNet, although it might be useful depending on the task, for example to provide subcategorization information when parsing low frequency verbs.
These considerations have led to the insight that, to make the best possible use of the available knowledge, the orchestrated exploitation of different LKBs is necessary. This lets us not only extend the range of covered words and senses, but more importantly, gives us the opportunity to obtain a richer knowledge representation when a particular meaning of a word is covered in more than one resource.
Examples where such a joint usage of LKBs proved beneficial include WSD using aligned WordNet and Wikipedia in BabelNet [Navigli and Ponzetto, 2012a], semantic role labeling (SRL) using a mapping between PropBank, VerbNet and FrameNet [Palmer, 2009], and the construction of a semantic parser using a combination of FrameNet, WordNet, and VerbNet [Shi and Mihalcea, 2005]. These combined resources, known as Linked Lexical Knowledge Bases (LLKB), are the focus of this book, and we shed light on their different aspects from various angles.
TARGET AUDIENCE AND FOCUS
This book is intended to convey a fundamental understanding of Linked Lexical Knowledge Bases, in particular their construction and use, in the context of NLP. Our target audience are students and researchers from NLP and related fields who are interested in knowledge-based approaches. We assume only basic familiarity with NLP methods and thus this book can be used both for self-study and for teaching at an introductory level.
Note that the focus of this book is mostly on sense linking between general-purpose LKBs, which are most commonly used in NLP. While we acknowledge that there are many efforts of linking LKBs, for instance, to ontologies or domain-specific resources, we only discuss them briefly where appropriate and provide references for readers interested in these more specific linking scenarios. The same is true for the recent efforts in creating ontologies from LKBs and formalizing the relationships between them—while we give an introduction to this topic in Section 1.3, we realize that this diverse area of research