In a 1983 survey by the National Testing Network in Writing (NTNW),
more than two-thirds of responding institutions used writing samples
to assess their incoming students' proficiency in composition.
Half of these tests (53%) relied solely on writing samples.
Researchers believe grades and multiple-choice writing tests fail
to measure writing competence, arguing that,
Multiple-choice tests cannot measure the skills that most writing teachers identify as the domain of composition: inventing, revising, and editing ideas to fit purpose and audience within the context of suitable linguistic, syntactic, and grammatical forms. (Greenberg, 1986, p. xiv)
However, shortcomings have plagued placement testing (White, 1986):
Moreover, tests of writing are volatile, subjective instruments: Writers have little time or help with prewriting, composing, and revising. Students may need special knowledge or be required to analyze a text. The tests seldom give cues to help students shape their responses (Ruth & Murphy, 1988). In addition, even well-trained readers fatigue rapidly (Fader, 1986, p. 86). Many readers disagree sharply regarding student uses of diction, convention, or genre (Hake, 1986, p. 158). Finally, both the human and administrative costs of placement tests are high (Greenberg, 1986, p. xi). Whatever their acknowledged virtues, then, even rigorously controlled placement exams achieve dubious reliability and have substantial problems.
As these shortcomings indicate, present methods of rating replacement tests fit H. A. Simon's (1960) definition of an unstructured task. Trained readers must rate a 200- to 400-word essay in two or three minutes. Of necessity, they rely heavily on heuristics, logic, trial and error, intuition, and common sense for this task.
To date, only two studies, both of which were conducted at Colorado State University, have used a computerized style checker, WRITER'S WORKBENCH (WWB), to measure essay quality. Stephen Reid and Gilbert Findlay correlated WWB analyses with essay quality, as measured by graders' holistic scores (1986). James Garvey and David Lindstrom later used WWB analyses to compare student writing with professional prose (1989). A year later, at the University of South Carolina-Aiken, I demonstrated how RIGHTWRITER (RW), a computerized style checker, could improve both the students' and my performance in the classroom. My study indicates by using RW's comments and my own "macro messages" to guide revisions of their papers, students improved their awareness of genre, topic, and purpose. This improvement is reflected in better RW indexes for readability, sentence length, and strength.
I decided to use RW to design a computerized procedure that simulated the ratings of placement tests by trained readers. The theory of Management Information Systems (MIS) labels this technique a Structured Decision System (SDS). My purpose was to design a prototype SDS to replace impressionistic human evaluations of placement tests with automatic, reliable, and inexpensive ratings. Once created and validated, an SDS carries out decision-making processes with no role for human discretion or error.
My design of a prototype SDS had a specific objective related
to the use of a computerized style checker in a college composition
course. By using RW's stylistic indexes to rate placement tests,
my SDS could place students in appropriate writing courses more
accurately and efficiently than trained readers could. College
writing programs could use the SDS to rate placement samples automatically
without subjecting their faculty to onerous training and grading
exercises. The RW analyses cannot interpret essay content, of
course, or evaluate an essay's general effectiveness. However,
RW's blindness to content cancels reader bias, and its precise
measurement of stylistic flaws and virtues can be linked to traits
such as fluency, completeness, and coherence.
Initially, I obtained a representative sample of 46 placement essays written by incoming first-year students at the University of Utah in the fall of 1990. The writing program at the University of Utah provides its placement test readers with written Criteria for Rating Placement Essays. These criteria base ratings of quality on the sensitivity with which student writers respond to audience, topic and purpose. Each fall, the Utah writing program asks about 2,000 incoming first-year students to respond to a set topic meant to assess their writing competence. Students are asked to write a division-and-classification essay supported by examples and reasons. They are given 45 minutes to describe a situation that disturbs them; they are also told to explain what changes they want to see made, and then draw conclusions about how people respond generally to unpleasant situations. They may use a dictionary and handbook.
Readers then score the essays on a 4-point scale, placing students
in basic remedial, remedial, regular composition, and advanced
composition courses. The Criteria define holistic standards
for each placement level: the ability to link the topic to readers;
to control the subject to support a point; and to address structure,
syntax, diction, and mechanics. The readers, mostly college faculty
and some high school writing teachers, use the Criteria
and a set of ranked anchor essays to standardize their ratings.
In addition, the Utah writing program has learned to estimate
the percentages within which the ratings will fall, as shown in
I typed both the anchor essays and the representative samples
into WORDPERFECT 5.0, analyzed them all with RW, and entered the
RW counts and indexes into a specially-designed QUATTRO spreadsheet.
Because I initially wanted to find out which stylistic traits
correlated most closely with the ratings of the Utah anchor exams,
I checked for all eight RW counts and indexes:
All eight RW analyses correlate wholly or partly with the 1-4 ratings of the Utah anchor exams, as charted in Table 2. Two of the RW stylistic measurements closely parallel all four ratings. As the syllable count rises with the 1-4 ratings, the percentage of unique words falls. That is, these traits correlate positively and negatively, respectively, with the quality of anchor papers. The accompanying graph illustrates these trends (see Figure 1).
|Total # of
|% unique words||66.1||51.4||48.6||31.6||0.49||0.32||0.66||0.12|
Figure 1. Global Indicators in Utah Placement Exams.
Figure 2. Paper Length in Utah Placement Exams.
Four other RW stylistic measurements track the most important 2-4 ratings. Total words and percentage of prepositions both rise in step with them, a positive correlation. The accompanying graph reveals the close link between paper length and quality of anchor exams (see Figure 2). On the other hand, average sentence length and RW's "descriptiveness" index both drop as ratings rise, a negative correlation. In addition, high and low readability levels set apart the best "4" rated samples and the worst "1" rated samples from the anchor pool. The advanced composition paper achieves nearly tenth grade readability (9.7) on the Flesh-Kincaid scale. The basic remedial sample barely reaches a fourth grade level (4.44). Finally, the "1" and "3" rated papers register higher "strength" ratings than the "2" and "4" rated papers, respectively, limiting the scope of this stylistic indicator.
Therefore, these correlations pass the "black box" test
(Murdick, 1986, p. 61), in that, regardless of any theoretical
link between stylistic traits and the holistic scores of anchor
exams, the outputs change regularly and predictably with the inputs.
That is, the dependent variables (the holistic ratings) closely
parallel the independent variables (RW's stylistic measurements).
Although these correlations can help rank samples, they cannot
reliably sort them: The RW stylistic measurements appear as points
on scales. They provide no ceilings or floors on numerical scales
to divide different ratings from one another.
To rank the 46 representative Utah placement exams from the Utah writing program by quality, I drew upon rhetorical and readability research, both experimental and theoretical. To sort papers by assigning ratings, I used Utah's apportionment of representative exams by percentage (as presented in Table 1). I based my quality ranking of the Utah placement essays on essay length. This choice is strongly confirmed by the link between this measurement and the holistic rankings of the 2-4 Utah anchor exams. Experimental and theoretical research also single out fluency as the primary indicator of writing excellence, particularly in timed exams.
In their correlation of WWB analyses with holistic ratings, Reid
and Findlay linked essay length most closely to the quality of
writing, both statistically and rhetorically:
The longer essays correlate significantly with quality writing because they demonstrate development within paragraphs, structural completeness, and scribal fluency (the skill of keeping the pen on the page, keeping the flow of prose going). (1986, p.12)
Writing teachers intuitively share Ruth and Murphy's view that
"the primary aim of the testing . . . is to see not only
how much but how well the student can write in response to the
topic" (1988, p.278). However, Thomas and Donlan (1982)
identified the number of words as the variable most highly correlated
with essay quality, regardless of the grade level of the writers.
Gordon Brossell has also found that "the longer an essay
was in a 200- to 400-word range, the likelier it was to get a
higher score [mostly because of] the amount of information presented
in the topics" (1986, p. 173).
To rate (or sort) the 46 Utah samples, I applied Frederick Taylor's theory of management by exception. To set standards, look first at exceptional (very good or very bad) performance. The "3" rated papers, which assign writers to regular first-year composition, comprise by far the greatest segment in any representative sample: about 60%. Therefore, I first extracted the two weakest papers (3% of the sample), those requiring basic remedial writing. Attachment B charts the ranking and sorting of all 46 samples. Those papers coded UTAH18 and UTAH44 managed a meager 112 and 150 total words, respectively. By comparison, the shortest "2" rated essay achieved 167 words. The dividing line between basic and regular remedial essays falls neatly between 150 and 167 words total: ² 160 words.
Moving from the extremely weak to the extremely strong end of the quality scale, I extracted seven papers for tentative "4" ratings (15% of the sample). However, a fluency floor of ³ 499 words is suspect. At this point, a meager 3-word margin separates the shortest "4" paper from the longest "3" paper (at 496 words). This negligible distinction requires other stylistic measurements to confirm or revise the use of number of words to sort the strongest "4" samples from the weakest "3" samples.
To confirm or override fluency as the sole dividing line between the "3" and the "4" rating, I adopted two other RW stylistic measurements: high syllable count and low percentage of unique words. To confirm a "4" rating, I placed the floor for average syllables per word at ³ 1.45; the ceiling for unique words rises no higher than ² 50%. On this basis, one paper on each side of the fluency dividing line exchanged ratings: UTAH19's rating rises to a "4," while UTAH10's rating drops to a "3." This adjustment also lowers the initial dividing line based on fluency by three words: A floor of ³ 496 words now separates the advanced composition papers from writers assigned to regular composition.
Like the ranking criterion--essay length--use of these sorting criteria is warranted by both experimental and theoretical research. However, the significance of RW's stylistic measurements--syllable length and percentage of unique--needs an explanation. Readability theory calls short, Anglo-Saxon terms "function" words. Users pick up these function words by speaking and hearing them often. On the other hand, readability theory applies the term "content" words to long, polysyllabic terms derived from Latin and Greek. These content words convey meaning, and users learn them systematically, usually from print (Gilliland 1972, p. 60-61).
Thus, as a writer increasingly relies on a more learned vocabulary, the percentage of unique "function" words drops, and the number of syllables per word rises. This change in style reflects an increasingly rich vocabulary derived more from reading than from speech. Therefore, measurements of syllable count and percentages of unique words are partly redundant. The following QUATTRO formula automatically extracts "4" entities from a representative group: [CELL]³496#AND#[CELL]³1.45#AND#[CELL]².50. This formula deals with fluency, average number of syllables, and percentage of unique words, respectively. It extracts seven "4" rated papers from the representative essays, well within the 12-16% Utah range for placement in advanced composition at the University of Utah.
Reid and Findlay's study finds significant correlations between
these two indexes and holistic ratings, especially for the best
[In] impromptu essays, the overall weight of longer words, indicating a mature lexicon, increases essay quality . . . [At the same time] these middle- and upper-range writers can manipulate abstractions better than can the more basic writers. Conversely, the vague word percentage . . . is highest for the low group and lowest for the high group. (1986, p. 14, 17)
As paper length rises, the proportion of unique words falls. The correlation between the paper length of the Utah samples and their percentage of unique words is a high 59.6%. Moreover, the Utah Criteria strongly favor a literate style. They endorse an awareness of "differences between the requirements for oral and written language" (p. 2). They also favor a polysyllabic style which "moves toward precision and abstraction" (p. 5).
After these extractions, a pool of 37 papers remains, ranked by fluency. Setting the dividing line at ³ 284 words reliably extracts eight "2" rated essays (16% of the total group of samples). No other stylistic measurements need be applied. By default, the remaining 29 papers rated "3" (for regular composition) comprise 63% of the total sample group. This percentage closely approaches Utah's estimate for regular composition placements (60%).
The length of the "3" rated papers ranges from a low
of 290 words to a high of 506; they average 390 words with a standard
deviation of 67.6 words. By comparison, the advanced composition
papers use longer words on the average than the regular composition
papers: 1.53 compared with 1.43. The advanced composition samples
also use a smaller average percentage of unique words, 46.96%
as opposed to 48.48% for the regular composition essays. To the
reader these differences appear slight. However, a holistic scorer
would notice writers' uses of precise or unusual diction. The
advanced composition samples also achieve an average readability
level nearly a grade and a half higher than the regular composition
papers. And the best writers use a higher average sentence length--by
nearly a word per sentence. Reid and Findlay found high correlations
between both these measures and holistic scores. Readability--though
not sentence length--also sets the weaker "3" rated
Utah anchor exams apart from the stronger "4" rated
This prototype SDS reliably ranks and sorts all 46 representative Utah samples on the basis of paper length alone--with the exception of the "4" rated advanced composition placements. Minor adjustments must be made to placements on either side of the dividing line between the "3" rating and the "4" rating. I applied two other stylistic measurements: high syllable counts and a low percentage of unique words. When this SDS is applied to the Utah anchor exams, the SDS's sorting criteria reliably match the holistic ratings of the "2," "3," and "4" rated anchor papers. However, one further set of adjustments is required--to the extreme low end of the quality scale. At 287 words, the "1" rated anchor exam far exceeds the ²160 word ceiling separating basic remedial papers from the representative pool. Yet, this weak anchor paper uses very short words and a very high percentage of unique words--1.2 average syllables and 66.1% unique words, respectively. These traits group the weak anchor paper with the two basic remedial papers extracted from the representative pool. Thus, these two additional traits must be incorporated in the criteria for extracting basic remedial papers.
To sum up, this prototype SDS evolves rather elegantly from Taylor's
theory of management by exception. Fluency alone cannot extract
papers with extremely low and high ratings: Adjustments must
be made. For the weakest basic remedial papers, low syllable
averages and high use of unique words must be brought into
play. To identify the strongest advanced composition papers,
high syllable averages and low use of unique words
affect sorting accuracy. The following formulas below are listed
in the order of their application:
The ratings assigned by this prototype SDS to the Utah placement exams reflect face and construct validity. That is, the ratings of the representative group of samples are consistent with holistic ratings of the Utah anchor exams. A growing body of experimental and theoretical research also confirms their validity. However, the concurrent validity of this SDS needs to be established. How consistent are its ratings with test scores--the verbal sections of the ACT or SAT, for example? Second, how much predictive validity does this SDS have? That is, how well do its scores predict the grades earned in first-year writing courses? Finally, this prototype needs a test rating all placement exams administered for a given period and school.
An SDS for rating placement exams offers significant tangible and intangible gains, both monetary and non-monetary. Its tangible monetary gains can be easily estimated. The greatest expense involves typing handwritten essays using a word processing package. A typist capable of 120 words a minute could type a 45-minute placement exam in 2-5 minutes; analyzing a sample with RW takes no more than a minute or two, and could easily be automated. Thus, creating each style-checked file need cost no more than $.90 to $1.00 apiece. Permitting or requiring students to write placement exams on computer could significantly reduce this expense. The rest of the work--transferring data to a spreadsheet, ranking, and sorting--can be fully computerized.
By comparison, one administration of the English Composition Test asked 85,000 students to each write 20-minute essays; each essay was scored twice. Gertrude Conlan estimates the cost of scoring the batch of essays at $500,000. Thus, rating each paper costs $5.88 (1986, p. 111). Conlan's estimate leaves out the administrative costs of recording grades, etc. Longer placement exams would drive the grading costs much higher.
In addition, whatever database is used, ranking and sorting the RW data involves some initial one-time programming. However, most colleges and universities have already committed themselves to computerization for research and classroom purposes--word processing for writing classes, spreadsheet programs for accounting, for example. The development and operation of an SDS for placement testing require only modest additional outlays. Although such programming is expensive, it is straightforward, uses existing hardware and software, and need be done only once. In MIS terms, the one-time programming costs are limited. Data transfers easily between programs, and the process requires less expensive batch processing than the more expensive online data processing. Then too, operators can run batch processing with minimal training. A computerized SDS fully justifies its usefulness by cheaply replicating ratings reached by trained readers.
However, a placement-rating SDS offers more than improved accuracy and economy. It offers significant intangible non-monetary gains for future college students and their high school teachers, for the instructors of college writing courses, and for researchers. It does so by increasing the value of placement exams as information. As Erika Lindemann points out, placement exams are now valued for administrative use only (1987, p. 203). Neither the writers nor their instructors get any feedback from them. A fully validated SDS could shift high school writing instruction away from impressionistic, localized criteria of writing quality to more precise measures.
In addition, SDS data could give college writing teachers accurate profiles of their incoming students' writing strengths and weaknesses, individually and by class. My article, "A Decision Support System for Improving First-Year Writing Courses," suggests ways instructors might use stylistic analyses of placement exams in their writing courses. Reliable and valid ratings of their incoming students' initial writing skills would give colleges a base point to assess their students' writing ability as they move toward graduation.
Finally, the quantified indexes of the quality of students' writing
products would help researchers in composition. No longer would
student essays . . . [sit] in millions of computers [while] many thousands of teachers are trying to determine how best to take advantage of this opportunity for writing analysis. (Wresch, 1988, p. 16)
However, the significant benefits of a placement exam SDS must
be balanced against faculty resistance to technological change.
Before committing themselves to computerized evaluation of placement
exams, a writing faculty must make several difficult decisions:
They must agree on the type of exam, criteria for a rating system,
representative samples, the apportionment of students among writing
courses levels, and other matters. The scarcity of working models
of such programs reflects the difficulties of these tasks. A
writing faculty needs to be experienced and comfortable with a
range of computerized word processing, error and style checkers,
reformatters, and grading utilities. Only then could they realistically
expect to make the conceptual leaps required by such a program.
Emil Roy teaches in the Department of English,
University of South Carolina-Aiken.
Brossell, G. (1986). Current research and unanswered questions
in writing assessment. In K. L. Greenberg, H. S. Wiener, &
R. A. Donovan (Eds.), Writing assessment: Issues and strategies
(p. 168-182). New York: Longman.
Conlan, G. (1986). "Objective" measures of writing
ability. In K. L. Greenberg, H. S. Wiener, & R. A. Donovan
(Eds.), Writing assessment: Issues and strategies (p.
109-125). New York: Longman.
Cooper, C. R. & Odell, L. (Eds.). (1977). Evaluating
writing: Describing, measuring, judging. Urbana, IL: National
Council of Teachers of English.
Fader, D. (1986). Writing samples and virtues. In K. L. Greenberg,
H. S. Wiener, & R. A. Donovan (Eds.), Writing assessment:
Issues and strategies (p. 79-92). New York: Longman.
Garvey, J. J. & Lindstrom, D. H. (1989). Pro's prose meets
writer's workbench: Analysis of typical models for first-year
writing courses. Computers and Composition, 6(2),
Gilliland, J. (1972). Readability. London: Hodder &
Hake, R. (1986). How do we judge what they write? In K. L.
Greenberg, H. S. Wiener, & R. A. Donovan (Eds.), Writing
assessment: Issues and strategies (p. 153-167). New York:
Harrison, C. (1980). Readability in the classroom. Cambridge,
MA: Cambridge University Press.
Klare, G. R. (1963). The measurement of readability.
Ames, IA: Iowa State University Press.
Lindemann, E. (1987). A rhetoric for writing teachers
(2nd ed). New York: Oxford University Press.
Murdick, R. C. (1986). MIS concepts and design (2nd ed).
Englewood Cliffs, NJ: Prentice Hall.
Neilsen, L. & Piche, G. (1981). The influence of headed
nominal complexity and lexical choice on teachers' evaluation
of writing. Research in the Teaching of English, 15,
Nold, E. & Freedman, S. (1977). An analysis of readers'
responses to essays. Research in the Teaching of English,
Reid, S. & Findlay, G. (1986). Writer's workbench analysis
of holistically scored essays. Computers and Composition,
Roy, E. A. (19??). Decision support system for improving first-year
writing courses. Computer-Assisted Composition Journal,
Ruth, L. & Murphy, S. (1988). Designing writing tasks
for the assessment of writing. Norwood, NJ: Ablex.
Simon, H. A. (1960). The new science of management decisions.
New York: Harper & Row.
Thomas, D. & Donlan, D. (1982). Correlations between
holistic and quantitative methods of evaluating student writing,
grades 4-12. Washington, DC: GPO (ERIC Document Reproduction
Service No. ED 211 976).
University Writing Program (1989). Criteria for rating placement
essays. Unpublished manuscript. University of Utah, University
Writing Program, Salt Lake City.
White, E. M. (1986). Pitfalls in the testing of writing. In
K. L. Greenberg, H. S. Wiener, & R. A. Donovan (Eds.), Writing
assessment: Issues and strategies (p. 53-78). New York:
Wresch, W. (1988). Six directions for computer analysis of student
writing. Computing Teacher, 15(7), 13-16.
|Note: Standard: =<160#or#=<1.2#or# =>66%|
|% of total||17.39%|
|Note: Standard: =<284 (Sort "2"-Rated from "3"-Rated)|
|% of total|
|Note: Standard: =>285 (Sort "3"-Rated from "2"-Rated)|
|% of total||15.22%|
|Note: Standard: =>496#and#=>1.45 --And-- =<50%|
(Sort "4"-Rated from "3"-Rated)