Linked Lexical Knowledge Bases. Iryna Gurevych

Linked Lexical Knowledge Bases

semi-automatically from existing French resources (thus also including subcat frames) combined with a translation of the English VerbNet verbs.

Information Types We summarize the main lexical information types for senses present in the English VerbNet.

• Sense definition—Verbnets do not provide textual sense definitions. A verb sense is defined extensionally by the set of verbs forming a VerbNet class; the verbs share common subcat frames, as well as semantic roles and selectional preferences of their arguments.

• Sense relations—The verb classes in verbnets are organized hierarchically and the subclass relation is therefore defined on the verb class level.

• Syntactic behavior—VerbNet lists detailed subcat frames for verb senses.

• Predicate argument structure information—In the English VerbNet, each individual verb sense is characterized by a semi-formal semantic predicate based on the event decomposition of Moens and Steedman [1988]. Furthermore, the semantic arguments of a verb are characterized by their semantic role and linked to their syntactic counterparts in the subcat frame. Most semantic arguments are additionally characterized by their semantic type (i.e., selectional preference information).

1.2 COLLABORATIVELY CONSTRUCTED KNOWLEDGE BASES

More recently, the rapid development of Web technologies and especially collaborative participation channels (often labeled “Web 2.0”) has offered new possibilities for the construction of language resources. The basic idea is that, instead of a small group of experts, a community of users (“crowd”) collaboratively gathers and edits the lexical information in an open and equitable process. The resulting knowledge is in turn also free to use, adapt and extend for everyone. This open approach has turned out to be very promising to handle the enormous effort of building language resources, as a large community can quickly adapt to new language phenomena like neologisms while at the same time maintaining a high quality by continuous revision—a phenomenon which has become known as the “wisdom of crowds” [Surowiecki, 2005]. The approach also seems to be suitable for multilingual resources, as users speaking any language and from any culture can easily contribute. This is very helpful for minor, usually resource-poor languages where expert-built resources are small or not available at all.

1.2.1 WIKIPEDIA

Wikipedia¹⁰ is a collaboratively constructed online encyclopedia and one of the largest freely available knowledge sources. It has long surpassed traditional printed encyclopedias in size, while maintaining a comparative quality [Giles, 2005]. The current English version contains around 4,700,000 articles and is by far the largest one, while there are many language editions of significant size. Some, like the German or French editions, also contain more than 1,000,000 articles, each of which usually describes a particular concept.

Although Wikipedia has not been designed as a sense inventory, we can interpret the pairing of an article title and the concept described in the article text as a sense. This interpretation is in accordance with the disambiguation provided in Wikipedia, either as part of the title or on separate disambiguation pages. An example of the former are some articles for Java where its different meanings are marked by “bracketed disambiguations” in the article title such as Java (programming language) and Java (town). An example of the latter is the dedicated disambiguation page for Java which explicitly lists all Java senses contained in Wikipedia.

Due to its focus on encyclopedic knowledge, Wikipedia almost exclusively contains nouns. Similar as for word senses, the interpretation of Wikipedia as a LKB gives rise to the induction of further lexical information types, such as sense relations of translations. Since the original purpose of Wikipedia is not to serve as a LKB, this induction process might also lead to inaccurate lexical information. For instance, the links to corresponding articles in other languages provided for Wikipedia articles can be used to derive translations (i.e., equivalents) of an article “sense” into other languages. An example where this leads to an inaccurate translation is the English article Vanilla extract which links to a subsection titled Vanilleextrakt within the German article Vanille (Gewürz); according to our lexical interpretation of Wikipedia, this leads to the inaccurate German equivalent Vanille (Gewürz) for Vanilla extract.

Nevertheless, Wikipedia is commonly used as a lexical resource in computational linguistics where it was introduced as such by Zesch et al. [2007], and has subsequently been used for knowledge mining [Erdmann et al., 2009, Medelyan et al., 2009] and various other tasks [Gurevych and Kim, 2012].

Information Types We can derive the following lexical information types from Wikipedia.

• Sense definition—While by design one article describes one particular concept, the first paragraph of an article usually gives a concise summary of the concept, which can therefore fulfill the role of a sense definition for NLP purposes.

• Sense examples—While usage examples are not explicitly encoded in Wikipedia, they are also inferable by considering the Wikipedia link structure. If a term is linked within an article, the surrounding sentence can be considered as a usage example for the target concept of the link.

• Sense relations—Related articles, i.e., senses, are connected via hyperlinks within the article text. However, since the type of the relation is usually missing, these hyperlinks cannot be considered full-fledged sense relations. Nevertheless, they express a certain degree of semantic relatedness. The same observation holds for the Wikipedia category structure which links articles belonging to particular domains.

• Equivalents—The different language editions of Wikipedia are interlinked at the article level—the article titles in other languages can thus be used as translation equivalents.

Related Projects As Wikipedia has nowadays become one of the largest and most widely used knowledge sources, there have been numerous efforts to make it better accessible for automatic processing. These include projects such as YAGO [Suchanek et al., 2007], DBPedia [Bizer et al., 2009], WikiNet [Nastase et al., 2010], MENTA [de Melo and Weikum, 2010], or DBPedia [Bizer et al., 2009]. Most of them aim at deriving a concept network from Wikipedia (“ontologizing”) and making it available for Semantic Web applications. WikiData,¹¹—a project directly rooted in Wikimedia—has similar goals, but within the framework given by Wikipedia. The goal here is to provide a language-independent repository of structured world knowledge, which all language editions can easily integrate.

These related projects basically contain the same knowledge as Wikipedia, only in a different representation format (e.g., suitable for Semantic Web applications), hence we will not discuss them further in this chapter. However, some of the Wikipedia derivatives have reached a wide audience in different communities, including NLP (e.g., DBPedia), and have also been used in different linking efforts, especially in the domain of ontology construction. We will describe corresponding efforts in Chapter 2

1.2.2 WIKTIONARY

Wiktionary¹² is a dictionary “side project” of Wikipedia that was created in order to better cater for the need to represent specific lexicographic knowledge, which is not well suited for an encyclopedia, e.g., lexical knowledge about verbs and adjectives. Wiktionary is available in over 500 languages, and currently the English edition of Wiktionary contains almost 4,000,000 lexical entry pages, while many other language editions achieve a considerable size of over 100,000 entries. Meyer and Gurevych [2012b] found that the collaborative construction approach of Wiktionary yields language versions covering the majority of language families and regions of the world, and that it especially covers a vast amount of domain-specific descriptions not found in wordnets for these languages.

For each lexeme, multiple senses can be encoded, and these are usually described by glosses. Wiktionary contains hyperlinks which lead to semantically related lexemes, such as synonyms, hypernyms, or meronyms, and provides a variety of other information

Скачать книгу