Deep Learning Approaches to Text Production. Shashi Narayan
perform summarisation by extracting the key information contained in the input document, aggregating this information, and abstracting from the resulting aggregated text. That is, humans create a summary by abstracting over its content. In practice, however, much of the pre-neural work on automatic text summarisation has focused on extractive summarisation, an approach which simply extracts key sentences from the input document and combines them to form a summary. The difference is illustrated in Figure 2.5. While abstractive summarisation reformulates the selected content, extractive approaches simply stitch together text fragments selected by the extraction step.
There is a vast literature on pre-neural text summarisation. For an overview of this work, we refer the reader to Mani [1999], Nenkova and McKeown [2012], Nenkova et al. [2011].
Figure 2.5: Abstractive vs. extractive summarisation.
Here, we briefly survey some key techniques used in the three main steps usually involved in extractive summarisation, namely:
• creating an intermediate representation of the sentences contained in the input text;
• scoring these sentences based on that representation; and
• creating a summary by selecting the most relevant sentences [Nenkova et al., 2011].
Representing Sentences. Work on extractive summarisation has explored two main ways of representing text: topic and indicators representations.
In topic-based approaches, an input document is assigned a topic representation which captures what the text is about. This topic representation is then used to extract from the input those sentences that are strongly topic-related. Several approaches have been proposed to derive these topic representations.
Frequency-based approaches use counts to identify those words that represent a document topic. Luhn [1958] used a frequency threshold to identify frequent content words in a document as descriptive of the document’s topic. Similarly, Conroy et al. [2006] used the log-likelihood ratio to extract those words that have a likelihood statistic greater than what one would expect by chance. Other approaches have used tf.idf ratio and word probability [Goldstein et al., 2000, Luhn, 1958, Nenkova and Vanderwende, 2005].
Latent semantic analysis (LSA) [Dumais et al., 1988, Gong and Liu, 2001] and Bayesian topic models [Celikyilmaz and Hakkani-Tur, 2010, Daumé III and Marcu, 2006, Haghighi and Vanderwende, 2009, Wang et al., 2009] exploit word co-occurrences to derive an implicit representation of text topics.
Figure 2.6: A document/summary pair from the CNN/DailyMail data set.
Finally, lexical chains [Barzilay and Elhadad, 1997, Galley and McKeown, 2003, Silber and McCoy, 2002] have been proposed to capture the intuition that topics are expressed not by single words but by sets of related words. For example, the words “asana”, “pranayama”, “breathing”, “body”, “soul” indicate a clear topic, even if each of the words is not by itself very frequent. Based on the lexical relations (synonymy, antonymy, part-whole, and general-specific) contained in WordNet, lexical chains track the prominence of different topics discussed in the input by measuring the occurrence of words that are lexically related to each of these topics.
In contrast to topic-based approaches, indicators approaches do not rely on a single topic representation, but on different text-based indicators such as the position of a sentence in the input document or its similarity with the document title [Kupiec et al., 1995]. Two main types of indicator methods can be distinguished: graph-based and vectorial.
In the graph-based approach [Erkan and Radev, 2004, Mihalcea and Tarau, 2004], an input text is represented as a graph whose vertices represent sentences and where edge labels indicate sentence similarity. Sentences that are related to many other sentences are likely to be central and have high weight for selection in the summary.
Vectorial approaches represent input sentences as feature vectors which can then be exploited by classifiers to determine whether or not a given input sentence should be part of the extracted summary [Hakkani-Tur and Tur, 2007, Leskovec et al., 2005, Lin and Hovy, 2000, Louis et al., 2010, Osborne, 2002, Wong et al., 2008, Zhang et al., 2013, Zhou and Hovy, 2003]. In addition to the topic features that are classically derived by topic-based approaches, common features include the position of the sentence in the document (in news articles, first sentences are almost always informative), position in the paragraph (first and last sentences are often important), sentence length, similarity of the sentence with the document title or headings, weights of the words in a sentence determined by any topic representation approach, presence of named entities or cue phrases from a predetermined list, etc.
Scoring Sentences. Based on whichever text representation has been created, each sentence is then assigned a score indicating its importance. For topic representation approaches, the importance of a sentence is usually computed as the number or the proportion of topic words it contains. For vectorial methods, the weight of each sentence is determined by combining the evidence from the different indicators using machine learning techniques to discover feature weights. In the multi-document summarisation LexRank system, the weight of each sentence is derived by applying stochastic techniques to the graph representation of the text [Erkan and Radev, 2004].
Selecting Summary Sentences. The last step consists of selecting the best combination of important sentences to form a paragraph length summary. The extracted summary should obey three main constraints: It should not exceed a given length, it should contain all relevant information, and it should avoid repetitions.
Most summarisation approaches choose content greedily by incrementally selecting the most informative (highest-scoring) sentences until the length limit is reached. A common strategy for greedily constructing a summary one sentence at a time is maximal marginal relevance (MMR) [Carenini et al., 2007], where, at each step, the algorithm is constrained to select a sentence that is maximally relevant and minimally redundant with sentences already included in the summary. Relevance and novelty are measured separately and then combined using some linear combination to produce a single score determining the importance of a sentence at a given stage of the selection process.
Sentence selection global optimisation algorithms have also been proposed to jointly maximise informativeness, minimise repetition, and conform to summary length restrictions [Gillick et al., 2009, Riedhammer et al., 2008].
2.4SUMMARY
In this chapter, we briefly reviewed pre-neural approaches to text-production. We saw that these approaches typically decompose the text-production task into several interacting subtasks which vary depending on the specific text-production task being considered. Figure 2.7 illustrates this intuition.
Data-to-text production typically involves text planning (selecting and structuring content) and sentence planning (choosing words, syntactic structures, and means of avoiding repetitions as well as choosing appropriate referring expressions to describe input entities). Simplification, paraphrasing, and compression involve modelling all or some of four operations, namely, phrase rewriting, reordering and deletion, and sentence splitting. Finally, summarisation can be viewed as encompassing three main modules: content selection (identifying key information), aggregation (structuring key information into a coherent text plan), and generalisation (using linguistic means to generate a naturally sounding, fluent summary).
Figure 2.7: Key modules in pre-neural approaches to text production.
In Chapter 3, we will see that initial neural approaches to text production markedly differ from these pre-neural approaches in that they provide a single, unifying framework, moving away from a decomposition of the text-production task into multiple