COMPUTERS and COMPOSITION 10(2), April 1993, pages 45-58
The January 1966 issue of Phi Delta Kappan contained an
article by a former high school English teacher, Ellis Page, who
was experimenting with the computer analysis of student essays.
Entitled "The Imminence of Grading Essays by Computer,"
the article described Page's initial computer analyses and summarized
his views on the role computers might play in the future. That
his ideas were controversial is clearly indicated by the introduction
the editor of Kappan gave the article:
Breakthrough? Or buncombe and ballyhoo? You should know, after reading this careful description of efforts at the University of Connecticut to rescue the conscientious English teacher from his backbreaking burden. It is authored by the researcher whose very first computer strategy for essay grading yielded marks indistinguishable from those of experts. Mr. Page, himself a refugee from English teaching, answers questions that will occur to the skeptical educator. (p. 238)
After raising the possibility that this could all be "buncombe" and that the author is a "refugee from English teaching," the editor does at least describe Page as "careful." In the years that followed, Page's work would earn him far worse labels than refugee, but more on that later. For the moment, Page, and his new application for computers, was in print and ready to take on the educational world.
The fact that 25 years later none of us are using Page Grading Machines or subscribing to Page Computer Evaluation Services tells us, of course, the ultimate outcome of the educational storm that followed Page's article. But the story of his research is an interesting one, and although the grading of essays by computer is no more imminent today than it was 25 years ago, there is much to be learned by reviewing his computer program and the ones that have followed.
In 1965, Page presented a plan to the College Entrance Examination
Board (CEEB) to have computers automatically evaluate student
writing samples. He had already completed two years of research
in which he found computers could grade an essay as effectively
as any human teacher--just by checking for such easily detectable
attributes as sentence length, essay length, and the presence
or absence of particular words. His plea at the time was that
computers take over much of essay grading, not because they were
superior to teachers, but because they could do as well and do
it cheaply and with few problems of fatigue. His hope was that
with computer grading programs available, teachers would assign
more writing, and so students would get the practice they needed
to develop as writers--practice that was not possible in most
classrooms because of the burden it placed on writing teachers.
Here is how Page (1968) stated the problem:
There are those who find great comfort in such surviving pockets of antiquity [manually grading essays]. Yet time is invariably cruel to the inefficient, and some cruelties seem visible today: Teachers in the humanities are often overworked and underpaid, harassed by mounting piles of student themes, or twinged with guilt over not assigning enough for a solid basis of student practice and feedback. (p. 211)
From this initial assessment of the problem faced by writing teachers,
and with the power of newly available computers, Page and a series
of others attempted to bring the statistical capabilities of computers
to the task of writing analysis. Considering the limitations of
the computers of their day, it was a tremendous effort, but one
that bore significant results. Every study published by Page and
others proved that computers could be profitably used for just
such analysis. Unfortunately, every study was derided or ignored.
Page's research focused on correlations between simple features of student texts and the grades assigned by teachers. To do his statistical analyses, he looked for computer variables that approximated the intrinsic values found by human raters. In his terms, he looked for proxes (computer approximations) to correlate with the human trins (intrinsic variables). A trin might be a human measure of value such as aptness of word choice. A prox would be whatever computer-identifiable trait might correlate with that trin value.
His experimental sample was 276 essays written by high school
students in Wisconsin. Each essay was read and evaluated by four
experienced high school English teachers. Page (1968) tried to
count features that correlated most highly with positive human
ratings. He used 30 features--the most statistically significant
are listed in Table 1.
|Proxes||Correlation with Criteria|
|Length of essay in words|
|Average sentence length|
|Number of commas|
|Number of prepositions|
|Number of spelling errors|
|Number of common words|
|Average word length|
His proxes combined to give a multiple correlation coefficient of .71, or in other words, combined to account for most of the attributes found to be important by the human judges. In fact, his combined correlation of proxes was high enough that when they were used on other essays the computer program predicted grades quite reliably--at least the grades given by the computer correlated with the human judges as well as the humans had correlated with each other.
Ignoring for the moment their combined weights, let's look at which measures seemed to matter most. Sentence length seemed to correlate little with human measures of quality. Essay length correlated highly, as did word length. Meanwhile, common words (from the Dale list of words commonly found in reading books) correlated negatively. The portrait of human ratings gained from the correlations is of people who look for fully developed themes and mature vocabulary, but have little concern for sentence complexity, at least complexity as measured by something as basic as length.
Several years later, Henry Slotnick (1972) of the National Assessment
of Education Progress (NAEP) examined these proxes again, but
in the context of specific trins. He enumerated specific intrinsic
values held by human raters. He then grouped proxes that helped
identify each of these values. His six values are presented in
|Fluency||Number of total words, number of different words, comma and sentence counts.|
|Spelling||Misspellings tabulated for common, uncommon, and difficult words|
|Diction||Mean and standard deviation of word length|
|Sentence Structure||Number of sentences, mean sentence length|
|Punctuation||Frequency of colons, semicolons, quotes and parentheses|
|Paragraph Development||Number of paragraphs and mean paragraph length|
Slotnick felt this more focused approach helped better identify what human raters were calling "quality." He established his correlations by analyzing the papers of 476 high school students in New York. Like Page, Slotnick's computer counts correlated highly to grades given by human raters.
An interesting third study was conducted several years later by Patrick Finn (1977). Rather than using gross measures such as sentence length or number of words, he chose to identify levels of quality by comparing student word choice to word usage in published texts. This effort was aided by the existence of several books that contain counts of words found in various places. Finn chose to use the Carroll, Davies, and Richman tabulation, which contains counts of words found in over one thousand textbooks used in American schools in 1969. The Carroll tabulation includes a Standard Frequency Index (SFI) for each word. An SFI of 90 means a word appeared once every 10 words (e.g., the). An SFI of 80 means a word appeared once every 100 words (e.g., is).
Finn's assumption was that better students, having a better vocabulary,
would use more low frequency (less common) words than their classmates.
He also expected to find growth in use of low frequency words
as students moved through the grades. He corrected for problems
such as slang (low frequency in textbooks, but not usually a sign
of brilliance) and topic imposed words (an assignment on pollution
might force even the most ordinary students to use advanced vocabulary
just to write the most basic theme), and discovered that there
was a clear difference in the written vocabularies of older students.
The fact that 25 years have now passed since the original studies were published and no high schools or colleges use computer essay grading sums up the reaction to these studies. Statistics or no, there is little interest in using computers in this way.
Ken Macrorie (1969) summed up the opposition to the original research in a roundtable essay in Research in the Teaching of English. He regarded the student essays themselves as the problem--essays written under strict constraints that remove any possibility of good writing. He referred to such contrained writing as "Engfish." Because the student essays examined were just the usual stuff written under the usual pressures (Engfish), comments about such essays, or judgments based on such essays, "no longer hit[s] the center of teaching English" (p. 235). English teaching is about to enter a new era to be called "New English which parallels the New Math" (p. 231). In this new era, "for five years now students have been abandoning Engfish and using their own voices" (p. 233). In such a world the computer has no place because it subscribes to the old rules. Computer programs for text analysis ". . . are, like all Engfish teachers and handbooks, essentially uptight about expression. And removed from life" (p. 235). Macrorie then goes on to point out a number of words that computers could never possibly be programmed to understand. A word such as busted, for instance, connotes a relationship to police that is beyond computer intelligence.
Macrorie might cringe these days at the comparison of his New English curriculum to New Math and might feel differently about the virtues of understanding a word such as busted, but his arguments held through the 1960s. The argument seemed to say that in canned situations where human expression has been eliminated, the computer might be able to correlate well with human graders. But when students were allowed to express themselves fully, the resulting text would be too creative for silicon chips to follow. It was folly even to try. "When language gets into action like that, correcting and rating are the wrong responses" (Macrorie, 1969, p. 236). Exit the computer (and exit human correcting and rating).
It is possible that we entered a world of New English 25 years
ago, and maybe it even lasted longer than New Math, but here is
the world of English (New or Old) seen by Page (1966):
The college English instructor, devoted to literature, seldom has the knowledge, desire, or incentive to perform the necessary laborious and searching correction. What often happens on college campuses would be laughable if it were not so ruinous: First, the English faculty is 'appalled' at the poor English ability of the incoming freshmen. Second, the English faculty uses some screening instrument to sort the students (unrealistically) into two or three groups. Certain 'remedial' students then undergo one or two semesters in composition, which often bears a close resemblance to the unsuccessful high school classes. Third, the English faculty, carefully avoiding any posttests of these 'remedied' students, passes them along into the educational stream, where they never receive any more direct help in composition. (p. 239)
Page's view of colleges is 25 years old, but it seems more accurate
than any vision of a New English. Nevertheless, Page's computer
analysis program was never used outside of a few research settings
and has largely disappeared except for some recent derivative
work. What would it look like if it were used? Here are some of
the places where the researchers themselves felt computer analysis
would be used.
Don Coombs (1969) felt the substance of the research had a different focus than most people realized. He said, "Rather than an investigation of how students' essays can be adequately graded as part of an ongoing instructional program, this is a study of the cognitive processes of experienced essay graders" [emphasis his] (p. 225). The approach taken by Page seems to emphasize this concern with the human grader. In each study, quality was some measure defined by humans and correlated by computers. It could be argued that the programs never examined the essays at all but examined what the human graders saw in them.
Once a correlation has been run and measures have been found to fit the grades given by experienced graders, the correlations shed a great deal of light on the processes of these graders. But they can also be used to compare graders. New graders could receive comparisons on Page's proxes as part of a discussion of their standards and processes. Being made aware how heavily they weigh word choice or value sentence structure could be a learning experience for all faculty. For those who are new to grading or who stray greatly from the practices of the majority, the counts of Page could be a method of learning not so much what the computer does, but what their peers do.
Slotnick (1972) referred to this indirectly when he asserted that
one use of computer tabluations might be to establish norms for
student writing. He was careful to point out that he was not advocating
having the computer set standards for writing. What he meant was
that the computer could be used to tabulate the qualities found
for students at a given grade. For instance, counts could clarify
what vocabulary levels we might expect at a certain grade or what
syntactic structures might be expected. Student essays could quickly
be compared with these norms and teachers apprised of the results.
This process would create a foundation for expectations teachers
might have. As a result, teachers might feel more confident they
were responding fairly to the writing and not damning or praising
students for reasons external to writing instruction (such as
handwriting or choice of essay subject).
Finn (1977) envisioned a system that used information on word
choice to make comments to students as they wrote drafts. For
instance, he felt a computer programmed to discern high and low
frequency words might respond to a student essay with questions
such as the following:
Finn would have his word frequency program also make comments
to an instructor. Assuming the program had not only checked for
high and low frequency words but also words that were imposed
by the assigned topic, he envisioned the program making comments
such as these:
Discovery of the Original
Most hallway discussion of computer assessment leads, at some
point, to the comment that the computer would have a leveling
influence. Typically we hear questions such as, "How could
it note creativity? What would it say about Hemingway?" Early
researchers responded to this directly. Page would try to define
what it means to be original and look for such traits in student
writing. Slotnick took a different perspective. He would program
for the norm and have unusual papers flagged. These unusual papers
might be the work of genius, or the work of fools, but he would
have humans make that determination--the computer would just point
out that they were unusual. But notice, in either case, the computer
would not be forcing students to a prescribed approach: It would
be looking for the original approach. Rather than eliminating
creativity, the computer would be actively searching for it.
In the world of computers, 25 years is a long time. Page did his
work by typing student essays onto punched cards (one line per
card) and then waiting for hours while the cards were processed.
These days, countless second graders use personal computers and
word-processing programs to write, edit, and publish everything
from book reports to birthday cards. Computers are ubiquitous.
Writing on computers is commonplace. Computer text analysis may
not be commonplace, but there has been some work in the area since
Page. They may not make computer essay grading imminent, but they
do show some of the possibilities in the field.
Grading Business Prose
Hal Hellwig of Idaho State University at Boise currently carries on research most closely connected to the Page tradition. His efforts have been to find a way of evaluating business writing. What is new in his work is that rather than relying totally on the variables originated by Page (word and sentence length, etc.), he has brought in the Semantic Differential Scale, a scale based on the "feel" of 1,000 commonly used words. Hellwig (1990) counts the words used from the scale, adds up a total for "potency" and "evaluation" for the words used, and puts them in the following formula:
He then combines this assessment of feel with a calculation of
total essay length and uses the result to generate a grade. Here
is his grading formula:
He compared the grades given by his formula with the grades given by an independent (human) grader and found 74% agreement (r squared = .7399, p < .0001).
Although Hellwig's formula and Semantic Differential Scale appeared
to work with one particular sample of business reports, an attempt
to replicate his work with more traditional college essays was
tried by the Alaska Assessment Project (described later). The
Semantic Differential Scores were not found to have any correlation
to the ratings given by human graders. Nevertheless, his work
is interesting in that it raises the possibility of correlating
rater judgments with subjective judgments founded on word choice.
Computers may not be able to feel, but they can quickly tally
scores humans have previously given particular words and guess
at a reaction to a passage.
Guiding Student Revision
Although automatic grading by a computer may be unusual, revision with the aid of a computer is becoming common. We are all used to spell checkers, and proofreading programs such as RIGHTWRITER and GRAMMATIK have gained a wide following, especially in business. Proofreading programs have come under some fire for their lack of accuracy (they usually correctly identify only 30% to 40% of errors in a paper); thus their classroom acceptance has been limited.
Even if proofreading programs were 100% accurate, they only operate at the sentence level. With great advances in computational linguistics research, they might someday catch a common splice, but they will never detect a malformed paragraph or an inconsistent argument--they don't attempt to do more than parse individual sentences and determine if the parse works.
There are, however, revision programs that operate on larger units
of text. WRITER'S WORKBENCH, developed by
AT&T, and WRITER'S HELPER, published by
Conduit, both provide a large amount of information to writers
about things such as coherence, development, style, and tone.
That such information can accurately direct writers in their revision
efforts has been demonstrated by independent research:
[T]he revision components of WRITER'S HELPER appear to be fairly reliable predictors of essay quality. . . . An assumption that might be established is that, if writers analyze their writing via these evaluation routines and follow the suggestions, their essays will be of higher quality and within an effective range of syntactic complexity. (Reed, 1989, p. 80)
'Style' 's index [a subtest within WRITER'S WORKBENCH] agreed with human readers--82% of the time. . . . The characteristics that influenced readers the most, according to this study, were the number of words per sentence, word length, and readability. . . . Whether the readers were aware of it or not, these variables influenced their judgments. (King, 1982, p. 10)
Another study by Reed (1990) quantified just how much better students could write when they used WRITER'S HELPER for revision. Comparing WRITER'S HELPER to word-processing packages alone, Reed found students who used WRITER'S HELPER achieved holistic scores of 5.5 versus 3.9 achieved by students who only used word processing. Such studies demonstrate that computer programs can help in revision.
The approach of these two programs is divergent from the original
intent of Page--rather than provide information on essay quality
to teachers for grading, WRITER'S HELPER and
WRITER'S WORKBENCH provide information to
students. But the source of the information, specific features
of text, is the same. The programs do word counts, measure sentence
lengths, look for markers of coherence, and infer style. Although
some in the field are troubled by such programs, the results of
these studies indicate that such information can be a help to
The Alaska Assessment Project
The Yukon-Koyukuk School District (Nenana, Alaska) includes some of the most rugged country in the United States and covers an area greater than the size of Wisconsin. A district attended primarily by Native American students (Athabascans), senior administrators worked hard to improve the quality of writing curricula, moving away from the grammar drills that too often dominate minority schools and adopting the use of computers for word processing and desktop publishing. After years of such effort, the U. S. Department of Education requested an assessment of the progress they had made. Standardized tests would yield numbers but would focused on a kind of writing instruction they had resolutely avoided--the focus on grammar and usage--on error. Holistic grading worked but required time and training, and it said less than they wanted about student progress. They wanted more information than a scale of 1 to 6.
So project administrators Nicki and Alan McCurry (1992) worked
with a statistician and a software developer to create an assessment
instrument that searched for and counted the features listed in
|text length||to be verbs|
|average paragraph length||'ion words|
|total number of paragraphs||pronouns|
|average sentence length||articles|
|total number of sentences||the|
|standard deviation of length (sentence)||localism/slang|
|unique words||conditional verbs|
|Flesch readability||vague words|
|coordinates||number common words|
|subordinates||number uncommon words|
They then began testing their computer tabulation program on several sets of essays that had also been holistically graded. The essays were from third, sixth, and ninth grade district students, and from three colleges: San Jose State University, University of Texas at El Paso, and City University of New York. Their statistician did a multiple regression analysis of the results, looking for corrrelations between features of these student papers and their holistic scores. What the statistican was looking for were computer-detectable features that consistently correlated with holistic scores and variables that seemed to increase as students moved through the grades.
The results were even better than the results achieved by Page. In all cases, the measures used correlated very highly to the holistic score given by a team of teachers. Correlations between the computer variables and human holistic scores ran as high as .96. The computer instrument also picked up areas of growth through the three graded samples. In short, the variables proved sensitive enough to account for the vast majority of difference between essay grades.
What is interesting is that the measures also detected differences
between the way writing samples had been rated. The differences
between the City University of New York ratings and those of San
Jose State University were most significant. The variables that
mattered most in each sample are shown in Table 4.
|San Jose State|
|Mean word length|
|Deviation of word length|
|Number to be|
|Number uncommon words|
|City University of New York|
|Mean sentence length|
It seems the human graders were using different standards, which reinforces exactly what Don Coombs (1969) said more than 20 years ago. To repeat his words, "Rather than an investigation of how students' essays can be adequately graded as part of an ongoing instructional program, this is a study of the cognitive processes of experienced essay graders" [emphasis his] (p. 225). The comparison of holistic scores made it clear that features such as vocabulary and spelling and slang were valued quite differently by different teams of graders. This difference is probably unknown to the raters themselves.
From the results of the Alaska Assessment Project, it appears that the kind of text analysis Page pioneered 25 years ago may prove to be useful in a different way than originally conceived. Rather than prescribe responses to student essays, such software can describe both the traits of student writing and the traits of writing evaluators. Consider, if you will, how a summary of essay traits and of rater evaluations might be used for normalizing sessions prior to holistic grading sessions.
We will surely never elect to abandon human grading. No matter how time consuming or difficult, we feel human grading procedures are our duty and our students' right. But the use of such tools as the Alaska Assessment Project can both inform human raters of their tendencies and improve their self awareness, and it can provide additional information about student writing--both for general research and for curricular assessment. In any case, it appears that the work of Ellis Page was not in vain. Computer essay grading may actually be less imminent than it was 25 years ago, but computer text analysis is still alive and well and making a contribution to our understanding of writing.
"Breakthrough? Or buncombe and ballyhoo?" In retrospect,
it appears Page's work was neither. It was early research that
explored the possible role of an emerging technology. It provided
initial results and hinted at possibilities worth pursuing. Twenty-five
years later we are still exploring. Nothing momentous is currently
imminent, but it has been an interesting trip just the same. I
think most of us are glad we came on board.
William Wresch is the Chair of the Department
of Mathematics and Computing at the University of Wisconsin--Stevens
Carroll, J., Davies, P., & Richman, B. (1971). Word frequency book. Boston, MA: Houghton Mifflin.
Coombs, D. (1969). Roundtable review. Research in the Teaching of English, 3, 225.
Finn, P. (1977). Computer-aided description of mature word choices in writing. In M. Cooper, & L. Odell (Eds.) Evaluating writing: Describing, measuring, judging (pp. 69-90). Urbana, IL: National Conference of Teachers of English.
Heise, D. (1965). Semantic differential profiles for 1000 most frequent English words. Psychological Monographs: General and Applied, 79, 1-31.
Hellwig, H. (1990, March). Computational text analysis for predicting holistic writing scores. Paper presented at Conference on College Composition and Communication, Chicago, IL.
King, W. (1982). Style's index compared with reader's scores. (Technical Report). Davis, CA: University of California.
Macrorie, K. (1969). Roundtable review. Research in the Teaching of English, 3, 228-236.
McCurry, N., & McCurry, A. (1992). Writing Assessment for the Twenty-First Century. Computer Teacher, 19, 35-37.
Page, E. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47, 238-243.
Page, E. (1968). Analyzing student essays by computer. International Review of Education, 14, 210-225.
Reed, W. (1989). The effectiveness of composing process software: An analysis of WRITER'S HELPER. Computers in the Schools, 6, 67-82.
Reed, W. (1990, April). The effect of composing process software on the revision and quality of persuasive essays. Paper presented at the Eastern Educational Research Association, Clearwater, FL.
Slotnick, H. (1972). Toward a theory of computer essay grading.
Journal of Educational Measurement, 9, 253-263.
Alaska Assessment Program, Yukon-Koyukuk School District, P. O. Box 80210, Fairbanks, Alaska 99708.
WRITER'S HELPER, Conduit, University of Iowa--Oakdale Campus, Iowa City, Iowa, 52242.
WRITER'S WORKBENCH, AT&T, P. O. Box 19901,
Indianapolis, IN 46219.