Natural Language Processing for the Semantic Web. Diana Maynard
between the tasks of named entity and relation extraction and discussing the major research challenges.
Chapter 5 explains how to perform entity linking by adding semantics into a standard flat information extraction system, of the kind that has been described in the preceding chapters. It discusses why this flat information extraction is not sufficient for many tasks that require greater richness and reasoning and demonstrates how to link the entities found to an ontology and to Linked Open Data resources such as DBpedia and Freebase. Examples of a typical semantic annotation pipeline and of real-world applications are provided.
Chapter 6 introduces the concept of automated ontology development from unstructured text, which comprises three related components: learning, population, and refinement. Some discussion of these terms and their interaction is given, the relationship between ontology development and semantic annotation is discussed, and some typical approaches are described, again building on the notions introduced in the previous chapters.
Chapter 7 describes methods and tools for the detection and classification of various kinds of opinion, sentiment, and emotion, again showing how the NLP processes described in previous chapters can be applied to this task. In particular, aspect-based sentiment analysis (such as which elements of a product are liked and disliked) can benefit from the integration of product ontologies into the processing. Examples of real applications in various domains are given, showing how sentiment analysis can also be slotted into wider applications for social media analysis. Because sentiment analysis is often performed on social media, this chapter is best read in conjunction with Chapter 8.
Chapter 8 discusses the main problems faced when applying traditional NLP techniques to social media texts, given their unusual and inconsistent usage of spelling, grammar, and punctuation amongst other things. Because traditional tools often do not perform well on such texts, they often need to be adapted to this genre. In particular, the core pre-processing components described in Chapters 2 and 3 can have a serious knock-on effect on other elements in the processing pipeline if errors are introduced in these early stages. This chapter introduces some state-of-the-art approaches for processing social media and gives examples of some real applications.
Chapter 9 brings together all the components described in the previous chapters by defining and describing a number of application areas in which semantic annotations are required, such as semantically enhanced information retrieval and visualization, the construction of social semantic user models, and modeling online communities. Common approaches and open source tools are described for these areas, including evaluation, scalability, and state-of-the-art results.
The concluding chapter summarizes the main concepts described in the book, and gives some discussion of the current state-of-the-art, major problems still to be overcome, and an outlook to the future.
CHAPTER 2
Linguistic Processing
2.1 INTRODUCTION
There are a number of low-level linguistic tasks which form the basis of more complex language processing algorithms. In this chapter, we first explain the main approaches used for NLP tasks, and the concept of an NLP processing pipeline, giving examples of some of the major open source toolkits. We then describe in more detail the various linguistic processing components that are typically used in such a pipeline, and explain the role and significance of this pre-processing for Semantic Web applications. For each component in the pipeline, we describe its function and show how it connects with and builds on the previous components. At each stage, we provide examples of tools and describe typical performance of them, along with some of the challenges and pitfalls associated with each component. Specific adaptations to these tools for non-standard text such as social media, and in particular Twitter, will be discussed in Chapter 8.
2.2 APPROACHES TO LINGUISTIC PROCESSING
There are two main kinds of approach to linguistic processing tasks: a knowledge-based approach and a learning approach, though the two may also be combined. There are advantages and disadvantages to each approach, summarized in Table 2.1.
Knowledge-based or rule-based approaches are largely the more traditional methods, and in many cases have been superseded by machine learning approaches now that processing vast quantities of data quickly and efficiently is less of a problem than in the past. Knowledge-based approaches are based on hand-written rules typically written by NLP specialists, and require knowledge of the grammar of the language and linguistic skills, as well as some human intuition. These approaches are most useful when the task can easily be defined by rules (for example: “a proper noun always starts with a capital letter”). Typically, exceptions to such rules can be easily encoded too. When the task cannot so easily be defined in this way (for example, on Twitter, people often do not use capital letters for proper nouns), then this method becomes more problematic. One big advantage of knowledge-based approaches is that it is quite easy to understand the results. When the system incorrectly identifies something, the developer can check the rules and find out why the error has occurred, and potentially then correct the rules or write additional rules to resolve the problem. Writing rules can, however, be quite time-consuming, and if specifications for the task change, the developer may have to rewrite many rules.
Machine learning approaches have become more popular recently with the advent of powerful machines, and because no domain expertise or linguistic knowledge is required. One can set up a supervised system very quickly if sufficient training data is available, and get reasonable results with very little effort. However, acquiring or creating sufficient training data is often extremely problematic and time-consuming, especially if it has to be done manually. This dependency on training data also means that adaptation to new types of text, domain, or language is likely to be expensive, as it requires a substantial amount of new training data. Human readable rules therefore typically tend to be easier to adapt to new languages and text types than those built from statistical models. The problem of sufficient training data can be handled by incorporating unsupervised or semi-supervised methods for machine learning: these will be discussed further in Chapters 3 and 4. However, these typically produce less accurate results than supervised learning.
Table 2.1: Summary of knowledge-based vs. machine learning approaches to NLP
Knowledge-Based | Machine Learning Systems |
Based on hand-coded rules | Use statistics or other machine learning |
Developed by NLP specialists | Developers do not need NLP expertise |
Make use of human intuition | Requires large amounts of training data |
Easy to understand results | Cause of errors is hard to understand |
Development could be very time consuming | Development is quick and easy |
Changes may require rewriting rules | Changes may require re-annotation |