Domain-Sensitive Temporal Tagging. Jannik Strötgen

Domain-Sensitive Temporal Tagging

List of Figures

1.1 The two tasks of temporal tagging: extraction and normalization

1.2 Excerpts of the Wikipedia page about “Heidelberg University” and a timeline to which occurring temporal expressions are mapped

1.3 Excerpts of the CNNMoney article of Figure 1.1

1.4 Temporal information retrieval example

2.1 Temporal information is well-defined

2.2 Temporal information can be normalized

2.3 Temporal information can be organized hierarchically

2.4 Different realization types of date expressions in documents

4.1 Example of a news document with arrows indicting required information to interpret the temporal expressions

4.2 Excerpts of documents of the news domain

4.3 Tense information may be misleading to normalize underspecified expressions

4.4 Excerpts of a document of the narrative domain

4.5 Excerpts of a French narrative document of the AncientTimes corpus (i.e., a document about history)

4.6 Excerpts of a narrative document of the English AncientTimes corpus

4.7 Excerpts of colloquial-style documents of the Time4SMS corpus containing short messages with relative and underspecified expressions

4.8 Excerpts of an autonomic-style document of the Time4SCI corpus

4.9 An autonomic timeline for the document shown in Figure 4.8

4.10 Excerpts of autonomic- and narrative-style literary texts

4.11 Distribution of date, time, duration, and set expressions in the corpora of the four different domains

4.12 Distribution of the occurrence types of date and time expressions in the four corpora

4.13 Domain-dependent strategies to detect the reference times for relative and underspecified expressions

4.14 Domain-dependent strategies to identify the temporal relation between an underspecified expression and its reference time

List of Tables

2.1 The four categories how temporal expressions can be realized

3.1 The decisions of a temporal tagger can be categorized using the confusion matrix

3.2 Evaluation example

3.3 Overview of temporal tagging research competitions

3.4 English news corpora annotated with temporal expressions

3.5 Non-English temporally annotated corpora containing news articles

4.1 Non-news corpora containing manual annotations of temporal expressions

4.2 Characteristics, challenges, and examples of the four domains that can be distinguished in the area of temporal tagging

4.3 Statistics of the four temporally annotated corpora

4.4 Six challenges that have to be addressed by a domain-sensitive temporal tagger

5.1 Comparison of the temporal taggers TIPSem, HeidelTime, SUTime, and UWTime

5.2 The official English TempEval-3 results of three HeidelTime versions, SUTime, and TIPSem

5.3 The official Spanish TempEval-3 results of HeidelTime and TIPSemB

5.4 Evaluation results of UWTime and HeidelTime 2.0 on the English TempEval-3 test data reported after the competition

5.5 UWTime’s and HeidelTime’s evaluation results on the WikiWars corpus

5.6 Evaluating HeidelTime, UWTime, SUTime, and TIPSemB on corpora of four different domains

5.7 Comparison of HeidelTime’s manually developed resources and automatically created resources

Preface

Time matters! Whatever document we read, be it a news article, biography, some microblog, or a patient’s record, to name but a few examples, temporal information embedded in the documents typically helps us determine the course of events and actions, to correlate events, and eventually to get an overview of the documents’ content. Driven by the continuously increasing amount of textual data that is available on the Web, in electronic archives, and Intranet document repositories the computer-supported analysis and exploration of textual data has become a necessity and also a challenge in numerous application domains. Named Entity Recognition (NER), that is, the task of information extraction that aims at detecting and classifying elements in some text into predefined classes, such as locations, persons, organizations, and temporal expressions, has become a cornerstone of tools and techniques that help to address this challenge.

Only in the past two decades has the topic of temporal tagging as a specific type of NER task become a major focus in research and development. Temporal tagging addresses the extraction, classification, and normalization of temporal expressions that occur in text documents, and it is the prerequisite for temporal information extraction. By now, the important role of temporal tagging has been well recognized in application domains such as text summarization, question answering, information retrieval, and topic detection and tracking. In these applications of temporal tagging, results can be as simple as the fully automated construction of a timeline of events detected in a document’s content and can be as complex as revealing the temporal discourse structure in documents.

To date, there is no book that provides a comprehensive overview of the various methods, tools, evaluation competitions, and challenges the tasks of temporal tagging are faced with in the presence of diverse types of textual data and application domains. This book aims at closing this gap. Starting from the very fundamental role and concepts of time in documents, it provides an up-to-date overview of annotation standards, techniques, and competitions for evaluating the quality of temporal taggers, annotated corpora (including non-English texts) used for evaluations and developments, as well as a detailed overview of temporal taggers.

As the title indicates, this book focuses particularly on temporal tagging of documents from different domains, including text data different from the well-studied domain of news articles. For

Скачать книгу