Automatic Text Simplification. Horacio Saggion
as type/token ratio (i.e., lexical variety) and percentage of words on different Italian word reference lists, etc., morpho-syntactic features such as probability distributions of POS tags in the text, ratio of the number of content words (nouns, verbs, adjectives, adverbs) to number of words in the text, etc., and syntactic features such as average depth of syntactic parse trees, etc. For sentence readability classification (easy-to-read vs. difficult-to-read), they prepared four different datasets based on the document classification task. Sentences from Due Parole are considered easy-to-read; however, assuming that all sentences from La Reppublica are difficult would in principle be an incorrect assumption. Therefore, they create four different sentence classification datasets for training models and assess the need for manually annotated data: the first set (s1) is a balanced dataset of easy-to-read and difficult-to-read sentences (1310 sentences of each class); the second dataset (s2) is an un-balanced dataset of easy-to-read (3910 sentences) and assumed difficult-to-read sentences (8452), the third dataset (s3) is a balanced dataset with easy-to-read (3910 sentences) and assumed difficult-to-read sentences (3910); and, finally, the fourth dataset (s4) also contains easy-to-read sentences (1310) and assumed difficult-to-read sentences (1310). They perform classification experiments with maximum entropy models to discriminate between easy-to-read and difficult-to-read sentences, using held-out manually annotated data. They noted that although using the gold-standard dataset (s1) provides the best results in terms of accuracy, using a balanced dataset of “assumed” difficult-to-read sentences (i.e., s3) for training is close behind, suggesting that one should trade off the efforts of manually filtering out difficult-sentences to create a dataset. TThey additionally study feature contribution to sentence readability and document readability, noting that local features based on syntax are more relevant for sentence classification while global features such as average sentence and word lengths or token/type ratio are more important for document readability assessment.
Vajjala and Meurers [2014] investigate the issue of readability assessment for English, also focusing on the readability of sentences. Their approach is based on training two different regression algorithms on WeeBit, a corpus of 625 graded documents for age groups 7 to 16 years that they have specifically assembled, which contains articles from the Weekly Reader (see above) and articles from the BBCBitesize website. As in previous work, the model contains a number of different groups of features accounting for lexical and POS tag distribution information, superficial characteristics (e.g., word length) and classical readability indices (e.g., Flesch-Kincaid), age-of-acquisition word information, word ambiguity, etc., 10-fold cross-validation evaluation, using correlation and means error rate metrics, is carried out as is validation on available standard datasets from the Common Core Standards corpus7 (168 documents), the TASA corpus (see Vajjala and Meurers [2014] for details) (37K documents), and the Math Readability corpus8 (120 web pages). The model achieves high correlation in cross-validation and reasonable correlation across datasets, except in the Math corpus probably because of the rating scale used. The approach also compares very favorably with respect to several proprietary systems. Where sentence readability is concerned, the model trained on the WeeBit corpus is applied to sentences from the OneStopEnglish corpus,9 a dataset in which original documents (30 articles, advanced level) have been edited to obtain documents at intermediate and beginner reading levels. Experiments are first undertaken to assess whether the model is able to separate the three different types of documents, and then to evaluate a sentence readability model. To evaluate sentence readability, each pair of parallel documents (advanced-intermediate, intermediate-beginner, advanced-beginner) is manually sentence-aligned and experiments are carried out to test whether the model is able to preserve the relative readability order of the aligned sentences (e.g., advanced-level sentence less readable than beginner-level sentence). Overall, the model preserves the readability order in 60% of the cases.
2.7 READABILITY AND AUTISM
Yaneva et al. [2016a,b] study text and Web accessibility for people with ASD. They developed a small corpus composed of 27 documents evaluated by 27 people diagnosed with an ASD. The novelty of the corpus is that, in addition to induced readability levels, it also contains gaze data obtained from eye-tracking experiments in which ASD subjects (and a control group of non-ASD subjects) were measured reading the texts, after which they were asked multiple-choice text-comprehension questions. The overall difficulty of the texts was obtained from quantitative data relating to answers given to those comprehension questions. Per each text, correct and incorrect answers were counted and text ranked based on number of correct answers. The ranking provided a way to separate texts into three difficulty levels: easy, medium, and difficult. The corpus itself was not used to develop a readability model; instead, it was used as test data for a readability model trained on the WeeBit corpus (see previous section), which was transformed into a 3-way labeled dataset (only 3 difficulty levels were extracted from WeeBit to comply with the ASD corpus).
Yaneva et al. grouped sets of features according to the different types of phenomena that account for: (i) lexico-semantic information such as word characteristics (length, syllables, etc.), numerical expressions, passive verbs, etc.; (ii) superficial syntactic information such as sentence length or punctuation information; (iii) cohesion information such as occurrence of pronouns and definite descriptions, etc.; (iv) cognitively motivated information including word frequency, age of acquisition of words, word imagability, etc.; and (v) information arising from several readability indices such as the Flesch-Kincaid Grade Level and the FOG readability index, etc. Two decision-tree algorithms, random forest [Breiman, 2001] and reduced error pruning tree (see [Hall et al., 2009]), were trained on the WeeBit corpus (see previous section) and cross-validated in WeeBit and tested in the ASD corpus. Feature optimization was carried out using a best-first feature selection strategy which identified such features as polysemy, FOG index, incidence of pronouns, sentence length, age of acquisition, etc. The feature selection procedure yields a model with improved performance on training and test data; nonetheless, results of the test on the ASD corpus are not optimal when compared with the cross-validation results on WeeBit. Worth noting is that although some of the features selected might be representative of problems ASD subjects may encounter when reading text, these features emerged from a corpus (WeeBit) that is not ASD-specific, suggesting that the selected features model general text difficulty assessment.
Based on the ASD corpus, a sentence readability assessment dataset was prepared composed of 257 sentences. Sentences were classified into easy-to-read and difficult based on the eye-tracker data associated with the texts. Sentences were ranked based on the average number of fixations they had during the readability assessment experiments and the set of sentences split in two parts to yield the two sentence readability classes. To complement the sentences from ASD and to control for length, short sentences from publicly available sources [Laufer and Nation, 1999] were added to the dataset. The labels for these sentences were obtained through a comprehension questionnaire which subjects with ASD had to answer. Sentences were considered easy to read if at least 60% of the subjects answered correctly the comprehension question associated with the sentence. Binary classification experiments on this dataset were performed using the Pegasos algorithm [Shalev-Shwartz et al., 2007] with features to model superficial sentence characteristics (number of words, word length, etc.), cohesion (proportion of connectives, causal expressions, etc.), cognitive indicators (word concreteness, imagability, polysemy, etc.), and several incidence counts (negation, pronouns, etc.). A cross-validation experiment achieved 0.82 F-score using a best-first feature selection strategy.
2.8 CONCLUSION
Over the years researchers have tried to come up with models able to predict the difficulty of a given text. Research on readability assessment is important for automatic text simplification in that models of readability assessment can help identify texts or text fragments which would need some sort of adaptation in order to made them accessible for a specific audience. Readability assessment can also help developers in the evaluation of automatic text simplification systems. Although traditional formulas which rely on simple superficial proxies are still used, in recent years, the availability of sophisticated natural language processing tools and better understanding of text properties accounting for text quality, cohesion, and coherence have fueled research in readability assessment, notably in computational linguistics.
This chapter covered several