Linked Lexical Knowledge Bases. Iryna Gurevych
own, which indeed has been published recently [Chiarcos et al., 2012]. Our attention is rather on the actual algorithmic linking process, and the benefits it brings for applications. Furthermore, we put an emphasis on monolingual linking efforts (i.e., between resources in the same language), as the vast majority of algorithms have covered this scenario in the past and cross-lingual approaches were mostly direct derivatives thereof, for instance by introducing machine translation as an intermediate component (cf. Chapter 3). Nevertheless, we recognize the increasing importance of multilingual NLP and thus provide a dedicated chapter covering applications in this area (Chapter 6).
OUTLINE
After providing a brief description of the typographic conventions which we applied throughout this book, we start by introducing and comparatively analyzing a selection of LKBs which have been widely used in NLP (Chapter 1). Our description of these LKBs provides a foundation for the main part of this book, where their integration into LLKBs is considered from various different angles. We include expert-built LKBs, such as WordNet, as well as collaboratively constructed resources, such as Wikipedia and Wiktionary, and also cover established standards and representation formats which are relevant in this context.
Then, in Chapter 2, we give a more formal definition of LLKBs, and also of word sense linking, which is crucial for combining different resources semantically, and thus is of utmost importance. We go on by describing various LLKBs which have been suggested, putting a focus on current large-scale projects which dominate the field, but also considering smaller, more specialized initiatives which have yielded important insights and paved the way for large-scale resource integration.
In Chapter 3, we approach the core issue of automatic word sense linking. While the notion of similar or even equivalent word senses in different resources is intuitively understandable and often (but now always) quite easily grasped by humans, it poses a complex challenge for automatic processing due to word ambiguities, different sense granularities and information types [Navigli, 2006]. First, to contextualize the challenge, we describe some related tasks in NLP and other fields, and outline how word sense linking relates to them. Then, we discuss in detail different ways to automatically create sense links between LKBs, based on textual descriptions of senses (i.e., glosses), the structure of the resources, or a combination thereof. The broader context of LLKBs lies of course not in the mere linking of resources for its own sake, but in the potential it holds for NLP applications.
Thus, in the following chapters, we present a selection of methods and applications where the use of LLKBs leads to particular benefits for NLP. In Chapter 4, we describe how the disambiguation of textual units benefits from the richer structure and combined knowledge, and also how the clustering of fine-grained word senses by exploiting 1:n links improves WSD accuracy. Building on that, we present more advanced disambiguation techniques in Chapter 5, including a discussion of using LLKBs for distant supervision and in neural vector space models, which are two recent and especially promising topics in machine learning for NLP. In Chapter 6 we briefly present multilingual applications, and computer-aided translation in particular, and show how they benefit from linked multilingual resources. Finally, in Chapter 7, we supplement our considerations of LLKB applications by discussing the enabling technologies, i.e., how LLKBs can be accessed via user interfaces and application programming interfaces. Based on the discussion of access paths for single resources, we describe how interfaces for current complex linked resources have evolved to cater to the needs of researchers and end users.
Chapter 8 concludes this book and points out directions for future work.
TYPOGRAPHIC CONVENTIONS
• Newly introduced terms and example lemmas are typed in italics.
• Synsets (groups of synonymous words) are enclosed by curly brackets, e.g., {car, automobile}.
• Concepts are typed in small caps, e.g., STREET VEHICLE WITH FOUR WHEELS.
• Relations between senses are written as pairs in parentheses, e.g., (car, vehicle).
• Classes of the Lexical Markup Framework (LMF) standard are printed in a monospace font starting with an upper case letter (e.g., LexicalEntry
).
• LMF data categories are printed in a monospace font starting with a lower case letter (e.g., partOfSpeech
).
We acknowledge support by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806, by the German Institute for Educational Research (DIPF), and by the German Research Foundation under grant No. GU 798/17-1. We also thank our colleagues and students for their contributions to this book.
Iryna Gurevych, Judith Eckle-Kohler, and Michael Matuschek
July 2016
4
http://www.meta-share.eu
5
http://www.resourcebook.eu
Acknowledgments
…Mentors matter! The authors of the book are very grateful to each and everyone who generously offered their guidance, support, advice, strategic feedback and valuable insights of all kinds during our professional careers. This helped us grow, learn, identify and accomplish the right goals, including this very book.
Iryna Gurevych, Judith Eckle-Kohler, and Michael Matuschek
July 2016
CHAPTER 1
Lexical Knowledge Bases
In this chapter we give an overview of different types of lexical knowledge bases that are used in natural language processing (NLP). We cover widely known expert-built Lexical Knowledge Bases (LKBs), and also collaborative LKBs, i.e., those created by a large community of layman collaborators. First we define our terminology, then we give a broad overview of various kinds of LKBs that play an important role in NLP. For particular resource-specific details, we refer the reader to the respective reference publications.
Definition Lexical Knowledge Base: Lexical knowledge bases (LKBs) are digital knowledge bases that provide lexical information on words (including multi-word expressions) of a particular language.1 By word, we mean word form, or more specifically, the canonical base word form which is called lemma. For example, write is the lemma of wrote. Most LKBs provide lexical information for lemmas. A lexeme is a word in combination with a part of speech (POS), such as noun, verb or adjective. The majority of LKBs specify the part of speech of the lemmas listed, i.e., provide lexical information on lexemes.
The pairings of lemma and meaning are called word senses or just senses. We use the terms meaning and concept synonymously in this book to refer to the possibly language-independent part of a sense. Each sense is typically identified by a unique sense identifier. For example, there are two meanings of the verb write which give rise to two different senses:2 (write, “to communicate with someone in writing”) and (write, “to produce a literary work”). Accordingly, a LKB might use identifiers, such as write01
and write02
to distinguish between the former and the latter sense. The set of all senses listed in a LKB is called its sense inventory.
Depending on their particular focus, LKBs can contain