The Concise Encyclopedia of Applied Linguistics. Carol A. Chapelle

The Concise Encyclopedia of Applied Linguistics

both listening and reading in tasks that require examinees to summarize or otherwise incorporate source material from a reading or listening passage into their writing. Finally, stages of the writing process can be isolated and assessed in tasks such as writing an outline (pre‐writing) or editing a paragraph or essay (revision).

Characteristics of Writing Tests: Tasks and Scoring

The most typical form of writing assessment appearing in current research and practice concentrates on direct assessment. Considerations for assessing such writing can be broadly divided into two categories: tasks (what the writer will respond to) and scoring (how the writing will be evaluated). From a theoretical perspective, scholars have long been interested in questions such as the effects of different task characteristics on writing test performance and how raters from different backgrounds evaluate writing samples. From a practical perspective, considerations of task and scoring provide guidance for teachers and administrators to design their own assessments or adopt commercially available tests.

Task Features

Weigle (2002, p. 63) provides a taxonomy of task dimensions for writing assessment. These can be divided into the features of the writing task itself (what test takers actually respond to) and features of the test, which include administrative and logistical considerations. Some important features of the test task include subject matter, discourse mode, and stimulus material, which are discussed briefly below.

Subject matter. Research on the effects of subject matter is limited, in part because it is difficult to separate subject matter from discourse mode or other topic variables (Hamp‐Lyons, 1990). However, two broad distinctions can be made with regard to subject matter. First, some topics are essentially personal (e.g., descriptions of self or family, discussion of personal likes and dislikes) and others are nonpersonal (e.g., argument essays about controversial social issues). Research and experience suggest that nonpersonal topics may be somewhat easier to score more reliably; however, personal topics may be more accessible to all test takers and tend to elicit a wider range of responses. Within nonpersonal topics, and specifically in assessing writing for academic purposes, another distinction can be made between topics that are more general and those that are discipline specific. Here some research suggests, not surprisingly, that students may perform better on topics related to their disciplines than on more general topics (Tedick, 1990).

Discourse mode. Discourse mode refers to the type of writing that candidates are expected to produce. The term discourse mode subsumes a cluster of task features such as genre (essay, letter, etc.), rhetorical task (e.g., narration, description, exposition), pattern of exposition (comparison/contrast, process, etc.) and cognitive demands (Huot, 1990). Research on the effects of specific features on writing test performance suggests that these factors may indeed influence performance in systematic ways; however, it is difficult to isolate individual factors or separate the effects of, for example, genre from cognitive demands. For test developers, perhaps the most important advice to keep in mind is to consider the authenticity of the task for specific test takers and, if test takers are offered a choice of tasks or if alternate forms of the test are used, that these discourse variables be kept as parallel as possible.

Stimulus material. While many traditional writing assessment tasks consist merely of the topic and instructions, it is also common to base writing tasks on stimulus material such as pictures, graphs, or other texts. Hughes (2003) recommends that writing tasks be based on visual materials (e.g., pictures) to ensure that it is only writing and not content knowledge that is being assessed. At the other end of the spectrum, many academic writing tests use a reading passage or other text as stimulus material for the sake of authenticity; since academic writing is nearly always based on some kind of input text. Considerations for choosing an appropriate input text can be found in Weigle (2002) and Shaw and Weir (2007).

In addition to factors involving the task itself, several other factors that can be considered more logistical or administrative need to be addressed when designing a writing test; some of these include issues such as time allotment, instructions, whether or not to allow examinees a choice of tasks or topics, and whether or not to allow dictionaries. For a summary of research related to these issues, see Weigle (2002, chap. 5). One issue that has gained prominence over the past two decades is whether candidates should write responses by hand or by using computers; clearly, the use of computers is much more prevalent than it was even 10 years ago, and several large‐scale tests have begun requiring responses to be entered on a computer. Pennington (2003) reviewed the literature for handwriting versus word processing; briefly, this literature suggests that, for students who have proficient keyboarding skills, using the computer leads to higher quality writing and more substantial revisions. On the other hand, some studies suggest that raters tend to score handwritten essays higher than typed ones, even if they are otherwise identical (e.g., Powers, Fowles, Farnum, & Ramsey, 1994).

Scoring Features

Two important considerations in scoring a writing assessment are (a) designing or selecting a rating scale or scoring rubric and (b) selecting and training people—or, increasingly, machines—to score the written responses. Scoring rubrics can generally be divided into two types: holistic, where raters give a single score based on their overall impression of the writing, and analytic, in which raters give separate scores for different aspects of the writing, such as content, organization, and use of language. A well‐known example of a holistic writing scale is the scale used for the TOEFL iBT® writing test (Educational Testing Service, 2004).

While both scale types have advantages and disadvantages, a holistic scale is generally preferred in situations where a large number of tests need to be scored in a short time, such as in placement testing. On the other hand, for classroom purposes, an analytic scale can provide more useful information to students. Thorough discussions of different types of rating scales can be found in Weigle (2002, chap. 5) and Shaw and Weir (2007, chap. 5).

In analytic scales, there is no consensus about what aspects of writing should be scored. Most analytic scales have at least one subscale for content/ideas, one for organization or rhetorical features, and one or more for aspects of language use. For example, the IELTS has scales for grammatical range and accuracy, lexical range and accuracy, arrangement of ideas, and communicative quality (Shaw & Falvey, 2008). The scale devised by Jacobs, Zinkgraf, Wormuth, Hartfiel, and Hughey (1981), one of the first well‐publicized analytic scales for second language writing, includes the categories of content, organization, vocabulary, language use, and mechanics. The rating scale for the Diagnostic English Language Needs Assessment (DELNA) used to identify English language needs of students at the University of Auckland includes three main categories: fluency, content, and form, each with three subcategories (Knoch, 2009).

Selecting, training, and monitoring raters is a central aspect of writing assessment. Useful procedures for rater training and monitoring can be found in White (1994), Weigle (2002), and Shaw and Weir (2007). Recent research has focused on the effects of rater background and training on scores; see Lumley (2005), Barkaoui (2007), and Eckes (2008) for summaries. This research suggests that training can mitigate but not eliminate differences in rater severity and consistency due to background variables. For this reason, many programs have begun using test analysis tools such as multifaceted Rasch measurement (MFRM) to adjust scores for differences between raters (see McNamara, 1996, for an introduction to MFRM, and Schaefer, 2008, for a review of studies in writing assessment that have used this approach).

Another important development with regards to scoring is the use of automated essay scoring (AES) systems such as e‐rater®, developed by Educational Testing Service (Attali & Burstein, 2006) and IntelliMetric™ and MY Access!® developed by Vantage Learning Systems (Elliott, 2003), in part to contain the costs and time involved in scoring large‐scale writing assessments. Research demonstrates that automated systems are at least as reliable in scoring standard essay tests as humans (see Shermis & Burstein, 2003; Dikli, 2006; and Shermis, 2014, for overviews of automated essay scoring). However, the use of AES systems is controversial; many writing instructors, in particular, are opposed to any machine scoring of writing, while others

Скачать книгу