Domain-Sensitive Temporal Tagging. Jannik Strötgen
APPLICATION EXAMPLES EXPLOITING TEMPORAL TAGGING
For well-known NLP tasks such as named entity recognition (NER), there are many motivating application scenarios described in the literature. In the following, to illustrate the utility of temporal tagging, we present some use cases, in which applications can easily exploit extracted and normalized temporal information and benefit from the output of temporal taggers and thus from the value of temporal information in general.
Figure 1.1: The two tasks of temporal tagging: extraction and normalization.
TEMPORAL TAGGING FOR INFORMATION EXTRACTION
In many text documents, events play an important role. Typically, events happen at some specific time and some specific place [Strötgen and Gertz, 2012a]. The importance of temporal information when organizing and summarizing extracted events is intuitive: given a text document with event mentions, the chronological ordering of the described events obviously benefits from normalized temporal expressions. Similar to temporal information, geographic information is also important in this context. However, the geographic aspect of events is out of the scope of this book.
As illustrated in Figure 1.2, many documents do not mention events in a chronological order. Typically, sections about specific topics are used and contain temporally overlapping content. Further examples are biographies that often contain temporally overlapping sections about, for instance, “private life” and “professional life”, and news articles that report on recent happenings before referring to events that have happened in the past. An example of such a news article is shown in Figure 1.3.
Similar to the task of summarizing and ordering events extracted from documents, temporal fact extraction also requires temporal tagging output. For instance, when collecting facts for a knowledge base, it should be taken into account that most facts are not static but either evolve with time or are valid only during a particular time period [Kuzey and Weikum, 2012]. For instance, “Bill Clinton” holdsPoliticalPosition
“President of the United States” is a correct fact but only valid for a specific time period.
While extracting events and temporal relations from single documents has a rather long tradition and was, for instance, addressed in the TempEval competitions at SemEval 2007 [Verhagen et al., 2007], 2010 [Verhagen et al., 2010], and 2013 [UzZaman et al., 2013], research was more recently extended to perform cross-document temporal relation extraction, as in the Timeline task of SemEval 2015 [Minard et al., 2015].1 A further indication of the importance of temporal tagging in the context of information extraction is the fact that at the 2015 SemEval competition, in addition to the Timeline task, three additional shared tasks were organized, in which extracted and normalized temporal expressions are a prerequisite to successfully address the tasks: QA TempEval2 [Llorens et al., 2015], Clinical TempEval3 [Bethard et al., 2015], and Diachronic Text Evaluation4 [Popescu and Strapparava, 2015].
TEMPORAL TAGGING FOR TOPIC DETECTION AND TRACKING
The goal of topic detection and tracking (TDT) is to organize news documents in an event-based way by building clusters of topics [Allan, 2002]. In this context, a topic is typically defined as “a seminal event or activity, along with all directly related events and activities” [Fiscus and Doddington, 2002]. For instance, the very first news article about a plane crash opens a new topic, and following news articles such as reports about the number of fatalities belong to the same topic. In contrast, news articles reporting about another plane crash do not belong to the same cluster. To decide whether an upcoming news document belongs to an existing cluster or opens a new cluster, the similarity between documents is typically determined based on some information extracted from the documents. For instance, Makkonen et al. [2003] create event vectors consisting of (i) names, (ii) locations, (iii) temporals, and (iv) content words.
Figure 1.2: Excerpts of the Wikipedia page about “Heidelberg University” and a timeline to which occurring temporal expressions are mapped. The content is not reported in a chronological order due to different topical sections about Heidelberg University. Thus, temporal tagging is crucial to correctly extract and order event information in a chronological way.
Figure 1.3: Excerpts of the CNNMoney article of Figure 1.1. After reporting on a recent happening, it refers to an event from the past in its last paragraph. Again temporal tagging is crucial to correctly extract and order event information.
In general, ambiguous expressions—such as “Tuesday”, “Friday”, and “March” in the news article shown in Figure 1.3—are quite frequent in news documents. To be able to exploit information about temporal expressions occurring in documents, temporal tagging is again a prerequisite because not just the detection but in particular the normalization of temporal expressions is crucial for successful topic detection and tracking.
TEMPORAL TAGGING FOR INFORMATION RETRIEVAL
During recent years, the value of temporal information has been increasingly exploited in the context of information retrieval research and applications [Alonso et al., 2007, 2011, Campos et al., 2014, Derczynski et al., 2015, Kanhabua et al., 2015]. Note, however, that there are different types of temporal information that can be used in information retrieval scenarios. The two main aspects are (i) time as a dimension of relevance and (ii) time as query topic.
On the one hand, when time is used as a dimension of relevance, temporal tagging is not needed. However, information about the document creation time is typically utilized to improve the ranking of documents. For example, for news-related queries, the freshness of search results may be important [see, e.g., Li and Croft, 2003]. In addition to improving search results, time as contextual information can be used to perform time-sensitive query auto-completion [Sengstock and Gertz, 2011, Shokouhi and Radinsky, 2012].
On the other hand, temporal tagging plays a crucial role when time is a query topic. No matter whether the temporal part of a query is provided explicitly or implicitly, temporal expressions occurring in potentially relevant documents have to be detected, normalized, and compared to the temporal aspect of the query. Berberich et al. [2010], for instance, integrate temporal expressions into a language modeling approach, and Strötgen and Gertz [2012a] present a query model to explicitly formulate temporal queries in a flexible way. Note that time as query topic must be handled by search engines, because temporal queries occur frequently as was shown by some query log analyses of web search engines: Nunes et al. [2008] found 1.5% queries with explicit temporal information, Metzler et al. [2009] determined 7% queries with implicit temporal intent, and Zhang et al. [2010] reported 13.8% for queries with explicit time and 17.1% with implicit time.
Note that sometimes the document creation time of a document might be a good indicator for detecting whether a document is relevant for a given query. However, using a temporal tagger to analyze the documents’ content is often crucial to successfully find relevant documents. For instance, both documents shown in Figure 1.4 can be considered as relevant for the example information need “Germanwings” with the time interval of interest being set to “1st of March 2015 to 30th of April 2015”. While the first document is a news document also published during the time interval of interest, the second document is a news article published in November 2015, that is, outside of the time interval of interest. However, both