Linked Lexical Knowledge Bases. Iryna Gurevych

Linked Lexical Knowledge Bases - Iryna Gurevych


Скачать книгу
the link targets are not disambiguated in all language editions, e.g., in the English edition, the links merely lead to pages for the lexical entries, which is problematic for NLP applications as we will see later on. The ambiguity of the links is due to the fact that Wiktionary has been primarily designed to be used by humans rather than machines. The entries are thus formatted for easy perception using appropriate font sizes and bold, italic, or colored text styles. In contrast, for machines, data needs to be available in a structured and unambiguous manner in order to become directly accessible. For instance, an easily accessible data structure for machines would be a list of all translations of a given sense, and encoding the translations by their corresponding sense identifiers in the target language LKBs would make the representation unambiguous.

      This kind of explicit and unambiguous structure does not exist in Wiktionary, but needs to be inferred from the wiki markup.13 Although there are guidelines on how to properly structure a Wiktionary entry, Wiktionary editors are permitted to choose from multiple variants or to deviate from the standards if this can enhance the entry. This presents a major challenge for the automatic processing of Wiktionary data. Another hurdle is the openness of Wiktionary—that is, the possibility to perform structural changes at any time, which raises the need for constant revision of the extraction software.

      Wiktionary as a resource for NLP has been introduced by Zesch et al. [2008b], and has been considered in many different contexts in subsequent work [Gurevych and Wolf, 2010, Krizhanovsky, 2012, Meyer, 2013, Meyer and Gurevych, 2010, 2012b]. While much work on Wiktionary specifically focuses on few selected language editions, the multilingual LKB Dbnary by Sérasset and Tchechmedjiev [2014] has taken a much broader approach and derived a LKB from Wiktionary editions in 12 languages. A major goal of DBnary is to make Wiktionary easily accessible for automatic processing, especially in Semantic Web applications [Sérasset, 2015].

      Particularly interesting for this book are the recent efforts to ontologize Wiktionary and transform it into a standard-compliant, machine-readable format [Meyer and Gurevych, 2012a]. These efforts address issues which are also relevant for the construction of Linked Lexical Knowledge Bases (LLKBs) we will discuss later on. We refer the interested reader to Meyer [2013] for an in-depth survey of Wiktionary from a lexicographic perspective and as a resource for NLP.

      Information Types In summary, the main information types contained in Wiktionary are as follows.

      • Sense definition—Glosses are given for the majority of senses, but due to the open editing approach gaps or “stub” definitions are explicitly allowed. This is especially the case for smaller language editions.

      • Sense examples—Example sentences which illustrate the usage of a sense are given for a subset of senses.

      • Sense relations—As mentioned above, semantic relations are generally available, but depending on the language edition, these might be ambiguously encoded. Moreover, different language editions show a great variety of the amount of relations relative to the number of senses. For instance, the German edition is six times more densely linked than the English one.

      • Syntactic behavior—Lexical-syntactic properties are given for a small set of senses. These include subcat frame labels, such as “transitive” or “intransitive.”

      • Related forms—Related forms are available via links.

      • Equivalents—As for Wikipedia, translations of senses to other languages are available by links to other language editions. An interesting peculiarity of Wiktionary is that distinct language editions may also contain entries for foreign-language words, for instance, the English edition also contains German lexemes, complete with definitions etc. in English. This is meant as an aid for language learners and is frequently used.

      • Sense links—Many Wiktionary entries contain links to the corresponding Wikipedia page, thus providing an easy means to supply additional knowledge about a particular concept without overburdening Wiktionary with non-essential (i.e., encyclopedic) information.

      In general, it has to be noted that the flexibility of Wiktionary enables the encoding of all kinds of linguistic knowledge, at least in theory. In practice, the information types listed here are those which are commonly used, and thus interesting for our subsequent considerations.

      OmegaWiki,14 like Wiktionary, is freely editable via its web frontend. The current version of OmegaWiki contains over 46,000 concepts and lexicalizations in almost 500 languages. One of OmegaWiki’s discriminating features, in comparison to other collaboratively constructed resources, is that it is based on a fixed database structure which users have to comply with [Matuschek and Gurevych, 2011]. It was initiated in 2006 and explicitly designed with the goal of offering structured and consistent access to lexical information, i.e., avoiding the shortcomings of Wiktionary described above.

      To this end, the creators of OmegaWiki decided to limit the degrees of freedom for contributors by providing a “scaffold” of elements which interact in well-defined ways. The central elements of OmegaWiki’s organizational structure are language-independent concepts (so-called defined meanings) to which lexicalizations of the concepts are attached. Defined meanings can thus be considered as multilingual synsets, comparable to resources such as WordNet (cf. Section 1.1.1). Consequently, no specific language editions exist for OmegaWiki as they do for Wiktionary. Rather, all multilingual information is encoded in a single resource.

      As an example, defined meaning no. 5616 (representing the concept HAND) carries the lexicalizations hand, main, mano, etc., and also definitions in different languages which describe this concept, for example, “That part of the fore limb below the forearm or wrist.” The multilingual synsets directly yield correct translations as these are merely different lexicalizations of the same concept. It is also possible to have multiple lexicalizations in the same language, i.e., synonyms. An interesting consequence of this design, especially for multilingual applications, is that semantic relations are defined between concepts regardless of existing lexicalizations. Consider, for example, the Spanish noun dedo: it is marked as hypernym of finger and toe, although there exists no corresponding lexicalization for the defined meaning FINGER OR TOE in English. This is, for instance, immediately helpful in translation tasks, since concepts for which no lexicalization in the target language exists can be described or replaced by closely related concepts. Using this kind of information is not as straightforward as in other multilingual resources like Wiktionary, because the links are not necessarily unambiguous.

      The fixed structure of OmegaWiki ensures easy extraction of the information due to the consistency enforced by the definition of database tables and relations between them. However, it has the drawback of limited expressiveness, for instance, the coding of grammatical properties is only possible to a small extent. In OmegaWiki, the users are not allowed to extend this structure and thus are tied to what has been already defined. Consequently, OmegaWiki’s lack of flexibility and extensibility, in combination with the fact that Wiktionary was already quite popular at its creation time, has caused the OmegaWiki community to remain rather small. While OmegaWiki had 6,746 users at the time of writing, only 19 of them had actively been editing in the past month, i.e., the community is considerably smaller than for Wikipedia or Wiktionary [Meyer, 2013]. Despite the above-mentioned issues, we still believe that OmegaWiki is not only interesting for usage in NLP applications (and thereby for integration into LLKBs), but also as a case study, since it exemplifies how the process of collaboratively creating a large-scale lexical-semantic resource can be guided by means of a structural “skeleton.”

      Information Types The most salient information types in OmegaWiki, i.e., those encoded in a relevant portion of entries are as follows.

      • Sense definitions—Glosses are provided on the concept level, usually in multiple languages.

      • Sense examples—Examples are given for individual lexicalizations, but only for a few of them.

      •


Скачать книгу