Deep Learning Approaches to Text Production. Shashi Narayan
the same entity, it is necessary to decide whether to use, a pronoun or a definite description. Furthermore, when using a definite description (e.g., “the man”), the content of the description needs to be determined: should it be, e.g., “the man with the hat”, “the tall man with a hat”, or “the president”?
Figure 2.1: A Robocup input and output pair example.
• Aggregation: using anaphora, ellipsis, coordination, and generally any convenient syntactic means of avoiding repeating the verbalisation of data fragments which occur multiple times in the input. For instance, the use of the subject relative clause in the example above (“purple3 made a bad pass that was picked off by pink9”) permits generating a more compact text than the alternative “purple3 made a bad pass to pink9. There was a turnover from pink3 to pink9”.
• Surface Realisation: deciding which syntactic constructs should be exploited to combine the selected lexical items into a well-formed output text. For instance, in the example above, surface realisation determined the use of two verbs, one in the passive voice, the other one in the active voice, both verbs being connected by a subject relative clause.
Figure 2.2 shows how these different sub-tasks can be modeled in a pipeline architecture. As discussed in Gatt and Krahmer [2018], alternative architectures have been explored reflecting different levels of division between modules. Modular approaches maintain a strict division between each sub-task, implementing the interactions between them using either a pipeline, a revision, or a constrained architecture. Planning-based approaches view language production as a goal-driven task (following a “language as action” paradigm) and provide a unifying framework where all modules are implemented by a given set of actions. Finally, integrated or global approaches cut across task divisions, implementing all modules via a single unifying mechanism (e.g., probabilistic context-free grammar or discriminative classifiers) and jointly optimising the various generation sub-tasks.
Overall, though, distinct generation sub-tasks are implemented separately. For instance, even in global, joint optimisation approaches such as Konstas and Lapata [2012a], different rule sets are defined to implement content selection, text planning, and surface realisation. Similarly, in Angeli et al. [2010], three distinct classifiers are used to implement each of these components.
Figure 2.2: Data-to-Text: A pipeline architecture (source: Johanna Moore).
2.2MEANING REPRESENTATIONS-TO-TEXT GENERATION
For meaning representation to text (MR-to-text) generation, two main types of approaches can be distinguished depending on the nature of the input, the amount of training data available, and on the type of text to be produced: grammar-centric vs. statistical.
Grammar-centric approaches are used when the generation task mainly involves surface realisation, i.e., when the gap between input MR and output text is small and the generation task mainly consists of linearising and inflecting some input structure (usually a graph or an unordered tree) whose nodes are decorated with lemmas or words. In that case, a grammar can be used to define a mapping between input MR and output text. Additional heuristic or statistical modules are usually added to handle the ambiguity induced by the strong non-determinism of natural-language grammars.
Statistical approaches often provided a more robust solution compared to grammar-centric approaches. They used machine learning to handle a wider range of choices, going beyond just handling syntactic ambiguity resolution.
2.2.1GRAMMAR-CENTRIC APPROACHES
In a grammar-centric approach, a grammar describing the syntax of natural language is used to mediate between meaning and text. The grammar can be handwritten [Gardent and Perez-Beltrachini, 2017] or automatically extracted from parse trees [Gyawali and Gardent, 2014, White, 2006]. Because natural language is highly ambiguous, grammar-centric approaches additionally integrate heuristics or statistical modules for handling non-determinism. Three main types of disambiguation filters have been used: filters that reduce the initial search space (often called hypertaggers), filters that eliminate unlikely intermediate structures, and filters that select the best solution from the set of output solutions (usually referred to as rankers).
Pruning the Initial Search Space. Bottom-up grammar-based approaches first select the set of lexical entries/grammar rules which are relevant given the input. For instance, if the input contains the predicate “book”, the lexical entries for “book” will be selected. Because the grammar and the lexicon of a natural language are highly ambiguous, this first step (lexical selection) yields a very large input space where the number of possible combinations to be considered is the cartesian product of the number of entries/rules selected for each input token. For instance, if the input consists of 10 tokens and each token selects an average of 5 entries, the number of possible combinations to be explored is 510.
To reduce this initial search space, Espinosa et al. [2008] introduced hypertagging, a technique adapted from the supertagging method first proposed for parsing by Bangalore and Joshi [1999]. In essence, a hypertagger is a classifier which was trained to select for each input token, the n most probable grammar rules/lexical entries, with n a small number. Gardent and Perez-Beltrachini [2017] present an interesting application of hypertagging which shows that hypertagging can be used to support the generation of well-formed natural language queries from description logic formulae. In that case, the hypertagger is shown not only to reduce the initial search space, but also to make choices that correctly capture the interactions between the various micro-planning operations (lexicalisation, referring expression generation, sentence segmentation, aggregation, and surface realisation).
Other techniques used to filter the initial search space include polarity filtering, a technique which rules out any combination of grammar rules such that the syntactic requirements of verbs are not exactly satisfied [Gardent and Kow, 2007], and hybrid bottom-up, top-down filtering, where the structure of the input is used both top-down—to constrain the selection of applicable rules—and bottom-up, to filter the initial search space associated with local input trees [Narayan and Gardent, 2012].
Filtering out Intermediate Structures. Carroll and Oepen [2005] present a subsumption-based local ambiguity factoring and a procedure to selectively unpack the generation forest according to a probability distribution given by a conditional, discriminative classifier to filter out unlikely, intermediate structures.
To address the fact that there are n! ways to combine any n modifiers with a single constituent, White [2004] proposes to use a language model to prune the chart of identical edges representing different modifier permutations, e.g., to choose between “fierce black cat” and “black fierce cat.” Similarly, Bangalore and Rambow [2000] assumes a single derivation tree that encodes a word lattice (“a {fierce black, black fierce} cat”) and uses statistical knowledge to select the best linearisation while Gardent and Kow [2007] propose a two-step surface realisation algorithm for FB-LTAG (Feature-Based Lexicalised Tree-Adjoining Grammar) where, first, substitution is applied to combine trees together and, second, adjunction is applied to integrate modifiers and long-distance dependencies.
Ranking. Two main approaches have been used to rank the output of grammar-based sentence generators. Early approaches simply apply language model n-gram statistics to rank alternatives [Bangalore and Rambow, 2000, Langkilde, 2000, Langkilde-Geary, 2002]. Discriminative disambiguation models were later proposed which used linguistically motivated features, often additionally using language model scores as an additional feature [Nakanishi et al., 2005, Velldal and Oepen, 2006].
2.2.2STATISTICAL MR-TO-TEXT GENERATION
Because they permit modelling a wider range of transformations than grammars such as, for instance, aggregation, document planning, and referring expression generation, statistical approaches are generally favoured when the input meaning representation is deeper, i.e., when it abstracts away from surface differences