Domain-Sensitive Temporal Tagging. Jannik Strötgen
on both standards, several research competitions have been organized, and several corpora have been manually annotated to be used as benchmarks. In the following sections, we survey temporal tagging research competitions and present an overview of existing annotated corpora. As different measures have been used in the research competitions to evaluate temporal tagging performance, we first describe how temporal taggers can be evaluated and what issues have to be taken into consideration.
3.2 EVALUATING TEMPORAL TAGGERS
In general, as for many natural language processing tasks, there are two ways of evaluating the extraction and normalization quality of temporal taggers: extrinsically and intrinsically. In the former case, more complex tasks or applications relying on temporal tagging output are evaluated. Examples are the tasks of temporal information retrieval [Alonso et al., 2011], temporal relation extraction [UzZaman et al., 2013], and (time-related) question answering [Llorens et al., 2015]. Much more common to evaluate temporal taggers, however, are intrinsic evaluations, that is, using manually annotated corpora and directly evaluating a temporal tagger’s extraction and normalization quality.
CONFUSION MATRIX
For intrinsic evaluations, temporal tagging is considered as a specific sequential tagging task, and the confusion matrix (also called contingency table or contingency matrix) can be used to describe a system’s output when compared to a gold standard. As shown in Table 3.1, all decisions of a temporal tagger can be grouped with the confusion matrix into one of the following four classes of a binary classification [Manning and Schütze, 2003]:
• true positives (TP): annotated by the system and in the gold standard;
• true negatives (TN): neither annotated by the system nor in the gold standard;
• false positives (FP): annotated by the system but not in the gold standard; and
• false negatives (FN): not annotated by the system but in the gold standard.
Note that because many temporal expressions consist of more than one token, it is also common to distinguish between strict and relaxed matching. Details about the differences will be explained at the end of the section (page 29).
Table 3.1: The decisions of a temporal tagger can be categorized using the confusion matrix
System Prediction | Gold Standard (Ground Truth) | |
Positive | Negative | |
Positive | TP | FP |
Negative | FN | TN |
PRECISION, RECALL, F1-SCORE
Both tasks of temporal taggers—the extraction and the normalization of temporal expressions—can be evaluated based on the confusion matrix. For the extraction, true positives are all instances that are correctly extracted by the system, while for the normalization, only instances that are correctly extracted and normalized are considered as true positives. Typically, in an evaluation the measures of precision, recall, and f1-score are determined.
Precision is a measure to indicate how many of the expressions extracted by the system are correct (Equation 3.1). If all instances marked as positive by the system are correct, then precision equals 1, and if all instances marked as positive by the system are incorrectly marked, then precision equals 0:
In contrast, recall indicates how many of the expressions that should be extracted are correctly extracted by the system (Equation 3.2). Thus, recall equals 0 if none of the instances that should be marked as positive is marked as positive by the system, and recall equals 1 if all instances that should be marked as positive are indeed marked as positive by the system:
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.