The Imminence of Grading Essays by Computer--25 Years Later

COMPUTERS and COMPOSITION 10(2), April 1993, pages 45-58

The Imminence of Grading Essays by Computer--25 Years Later

William Wresch

The January 1966 issue of Phi Delta Kappan contained an article by a former high school English teacher, Ellis Page, who was experimenting with the computer analysis of student essays. Entitled "The Imminence of Grading Essays by Computer," the article described Page's initial computer analyses and summarized his views on the role computers might play in the future. That his ideas were controversial is clearly indicated by the introduction the editor of Kappan gave the article:

Breakthrough? Or buncombe and ballyhoo? You should know, after reading this careful description of efforts at the University of Connecticut to rescue the conscientious English teacher from his backbreaking burden. It is authored by the researcher whose very first computer strategy for essay grading yielded marks indistinguishable from those of experts. Mr. Page, himself a refugee from English teaching, answers questions that will occur to the skeptical educator. (p. 238)

After raising the possibility that this could all be "buncombe" and that the author is a "refugee from English teaching," the editor does at least describe Page as "careful." In the years that followed, Page's work would earn him far worse labels than refugee, but more on that later. For the moment, Page, and his new application for computers, was in print and ready to take on the educational world.

The fact that 25 years later none of us are using Page Grading Machines or subscribing to Page Computer Evaluation Services tells us, of course, the ultimate outcome of the educational storm that followed Page's article. But the story of his research is an interesting one, and although the grading of essays by computer is no more imminent today than it was 25 years ago, there is much to be learned by reviewing his computer program and the ones that have followed.

In 1965, Page presented a plan to the College Entrance Examination Board (CEEB) to have computers automatically evaluate student writing samples. He had already completed two years of research in which he found computers could grade an essay as effectively as any human teacher--just by checking for such easily detectable attributes as sentence length, essay length, and the presence or absence of particular words. His plea at the time was that computers take over much of essay grading, not because they were superior to teachers, but because they could do as well and do it cheaply and with few problems of fatigue. His hope was that with computer grading programs available, teachers would assign more writing, and so students would get the practice they needed to develop as writers--practice that was not possible in most classrooms because of the burden it placed on writing teachers. Here is how Page (1968) stated the problem:

There are those who find great comfort in such surviving pockets of antiquity [manually grading essays]. Yet time is invariably cruel to the inefficient, and some cruelties seem visible today: Teachers in the humanities are often overworked and underpaid, harassed by mounting piles of student themes, or twinged with guilt over not assigning enough for a solid basis of student practice and feedback. (p. 211)

From this initial assessment of the problem faced by writing teachers, and with the power of newly available computers, Page and a series of others attempted to bring the statistical capabilities of computers to the task of writing analysis. Considering the limitations of the computers of their day, it was a tremendous effort, but one that bore significant results. Every study published by Page and others proved that computers could be profitably used for just such analysis. Unfortunately, every study was derided or ignored.

The Original Studies

Page's research focused on correlations between simple features of student texts and the grades assigned by teachers. To do his statistical analyses, he looked for computer variables that approximated the intrinsic values found by human raters. In his terms, he looked for proxes (computer approximations) to correlate with the human trins (intrinsic variables). A trin might be a human measure of value such as aptness of word choice. A prox would be whatever computer-identifiable trait might correlate with that trin value.

His experimental sample was 276 essays written by high school students in Wisconsin. Each essay was read and evaluated by four experienced high school English teachers. Page (1968) tried to count features that correlated most highly with positive human ratings. He used 30 features--the most statistically significant are listed in Table 1.

Table 1
Variables Used in Project Essay Grade I-A (Page, 1968, p. 216)


Proxes	Correlation with Criteria

Length of essay in words	.32
Average sentence length	.04
Number of commas	.34
Number of prepositions	.25
Number of spelling errors	-.21
Number of common words	-.48
Average word length	.51

His proxes combined to give a multiple correlation coefficient of .71, or in other words, combined to account for most of the attributes found to be important by the human judges. In fact, his combined correlation of proxes was high enough that when they were used on other essays the computer program predicted grades quite reliably--at least the grades given by the computer correlated with the human judges as well as the humans had correlated with each other.

Ignoring for the moment their combined weights, let's look at which measures seemed to matter most. Sentence length seemed to correlate little with human measures of quality. Essay length correlated highly, as did word length. Meanwhile, common words (from the Dale list of words commonly found in reading books) correlated negatively. The portrait of human ratings gained from the correlations is of people who look for fully developed themes and mature vocabulary, but have little concern for sentence complexity, at least complexity as measured by something as basic as length.

Several years later, Henry Slotnick (1972) of the National Assessment of Education Progress (NAEP) examined these proxes again, but in the context of specific trins. He enumerated specific intrinsic values held by human raters. He then grouped proxes that helped identify each of these values. His six values are presented in Table 2.

Table 2
Slotnick's Six Factors (Slotnick, 1972, p. 262)


Number	Quality	Characteristic Proxes

1.	Fluency	Number of total words, number of different words, comma and sentence counts.
2.	Spelling	Misspellings tabulated for common, uncommon, and difficult words
3.	Diction	Mean and standard deviation of word length
4.	Sentence Structure	Number of sentences, mean sentence length
5.	Punctuation	Frequency of colons, semicolons, quotes and parentheses
6.	Paragraph Development	Number of paragraphs and mean paragraph length

Slotnick felt this more focused approach helped better identify what human raters were calling "quality." He established his correlations by analyzing the papers of 476 high school students in New York. Like Page, Slotnick's computer counts correlated highly to grades given by human raters.

An interesting third study was conducted several years later by Patrick Finn (1977). Rather than using gross measures such as sentence length or number of words, he chose to identify levels of quality by comparing student word choice to word usage in published texts. This effort was aided by the existence of several books that contain counts of words found in various places. Finn chose to use the Carroll, Davies, and Richman tabulation, which contains counts of words found in over one thousand textbooks used in American schools in 1969. The Carroll tabulation includes a Standard Frequency Index (SFI) for each word. An SFI of 90 means a word appeared once every 10 words (e.g., the). An SFI of 80 means a word appeared once every 100 words (e.g., is).

Finn's assumption was that better students, having a better vocabulary, would use more low frequency (less common) words than their classmates. He also expected to find growth in use of low frequency words as students moved through the grades. He corrected for problems such as slang (low frequency in textbooks, but not usually a sign of brilliance) and topic imposed words (an assignment on pollution might force even the most ordinary students to use advanced vocabulary just to write the most basic theme), and discovered that there was a clear difference in the written vocabularies of older students.

Reaction to the Studies: Nightmares and Visions

The fact that 25 years have now passed since the original studies were published and no high schools or colleges use computer essay grading sums up the reaction to these studies. Statistics or no, there is little interest in using computers in this way.

Ken Macrorie (1969) summed up the opposition to the original research in a roundtable essay in Research in the Teaching of English. He regarded the student essays themselves as the problem--essays written under strict constraints that remove any possibility of good writing. He referred to such contrained writing as "Engfish." Because the student essays examined were just the usual stuff written under the usual pressures (Engfish), comments about such essays, or judgments based on such essays, "no longer hit[s] the center of teaching English" (p. 235). English teaching is about to enter a new era to be called "New English which parallels the New Math" (p. 231). In this new era, "for five years now students have been abandoning Engfish and using their own voices" (p. 233). In such a world the computer has no place because it subscribes to the old rules. Computer programs for text analysis ". . . are, like all Engfish teachers and handbooks, essentially uptight about expression. And removed from life" (p. 235). Macrorie then goes on to point out a number of words that computers could never possibly be programmed to understand. A word such as busted, for instance, connotes a relationship to police that is beyond computer intelligence.

Macrorie might cringe these days at the comparison of his New English curriculum to New Math and might feel differently about the virtues of understanding a word such as busted, but his arguments held through the 1960s. The argument seemed to say that in canned situations where human expression has been eliminated, the computer might be able to correlate well with human graders. But when students were allowed to express themselves fully, the resulting text would be too creative for silicon chips to follow. It was folly even to try. "When language gets into action like that, correcting and rating are the wrong responses" (Macrorie, 1969, p. 236). Exit the computer (and exit human correcting and rating).

It is possible that we entered a world of New English 25 years ago, and maybe it even lasted longer than New Math, but here is the world of English (New or Old) seen by Page (1966):

The college English instructor, devoted to literature, seldom has the knowledge, desire, or incentive to perform the necessary laborious and searching correction. What often happens on college campuses would be laughable if it were not so ruinous: First, the English faculty is 'appalled' at the poor English ability of the incoming freshmen. Second, the English faculty uses some screening instrument to sort the students (unrealistically) into two or three groups. Certain 'remedial' students then undergo one or two semesters in composition, which often bears a close resemblance to the unsuccessful high school classes. Third, the English faculty, carefully avoiding any posttests of these 'remedied' students, passes them along into the educational stream, where they never receive any more direct help in composition. (p. 239)

Page's view of colleges is 25 years old, but it seems more accurate than any vision of a New English. Nevertheless, Page's computer analysis program was never used outside of a few research settings and has largely disappeared except for some recent derivative work. What would it look like if it were used? Here are some of the places where the researchers themselves felt computer analysis would be used.

Grader Education

Don Coombs (1969) felt the substance of the research had a different focus than most people realized. He said, "Rather than an investigation of how students' essays can be adequately graded as part of an ongoing instructional program, this is a study of the cognitive processes of experienced essay graders" [emphasis his] (p. 225). The approach taken by Page seems to emphasize this concern with the human grader. In each study, quality was some measure defined by humans and correlated by computers. It could be argued that the programs never examined the essays at all but examined what the human graders saw in them.

Once a correlation has been run and measures have been found to fit the grades given by experienced graders, the correlations shed a great deal of light on the processes of these graders. But they can also be used to compare graders. New graders could receive comparisons on Page's proxes as part of a discussion of their standards and processes. Being made aware how heavily they weigh word choice or value sentence structure could be a learning experience for all faculty. For those who are new to grading or who stray greatly from the practices of the majority, the counts of Page could be a method of learning not so much what the computer does, but what their peers do.

Slotnick (1972) referred to this indirectly when he asserted that one use of computer tabluations might be to establish norms for student writing. He was careful to point out that he was not advocating having the computer set standards for writing. What he meant was that the computer could be used to tabulate the qualities found for students at a given grade. For instance, counts could clarify what vocabulary levels we might expect at a certain grade or what syntactic structures might be expected. Student essays could quickly be compared with these norms and teachers apprised of the results. This process would create a foundation for expectations teachers might have. As a result, teachers might feel more confident they were responding fairly to the writing and not damning or praising students for reasons external to writing instruction (such as handwriting or choice of essay subject).

Student Feedback

Finn (1977) envisioned a system that used information on word choice to make comments to students as they wrote drafts. For instance, he felt a computer programmed to discern high and low frequency words might respond to a student essay with questions such as the following:

You use the word 'workers' seven times. Could you combine some of the ideas about workers into the same sentences?
Would your argument be more convincing if you used 'cannot' insead of 'can't'?
You have used the word 'they' eleven times. Is the reference always clear? (p. 87)

Instructor Feedback

Finn would have his word frequency program also make comments to an instructor. Assuming the program had not only checked for high and low frequency words but also words that were imposed by the assigned topic, he envisioned the program making comments such as these:

Very few Topic Imposed Words. Has the student written on the assigned topic?
Extraordinarily high proportion of pronouns. Is the reference always clear?
Might sentence combining be in order? (Finn, 1977, p. 87)

Discovery of the Original

Most hallway discussion of computer assessment leads, at some point, to the comment that the computer would have a leveling influence. Typically we hear questions such as, "How could it note creativity? What would it say about Hemingway?" Early researchers responded to this directly. Page would try to define what it means to be original and look for such traits in student writing. Slotnick took a different perspective. He would program for the norm and have unusual papers flagged. These unusual papers might be the work of genius, or the work of fools, but he would have humans make that determination--the computer would just point out that they were unusual. But notice, in either case, the computer would not be forcing students to a prescribed approach: It would be looking for the original approach. Rather than eliminating creativity, the computer would be actively searching for it.

Three Current Computer Approaches

In the world of computers, 25 years is a long time. Page did his work by typing student essays onto punched cards (one line per card) and then waiting for hours while the cards were processed. These days, countless second graders use personal computers and word-processing programs to write, edit, and publish everything from book reports to birthday cards. Computers are ubiquitous. Writing on computers is commonplace. Computer text analysis may not be commonplace, but there has been some work in the area since Page. They may not make computer essay grading imminent, but they do show some of the possibilities in the field.

Grading Business Prose

Hal Hellwig of Idaho State University at Boise currently carries on research most closely connected to the Page tradition. His efforts have been to find a way of evaluating business writing. What is new in his work is that rather than relying totally on the variables originated by Page (word and sentence length, etc.), he has brought in the Semantic Differential Scale, a scale based on the "feel" of 1,000 commonly used words. Hellwig (1990) counts the words used from the scale, adds up a total for "potency" and "evaluation" for the words used, and puts them in the following formula:

(100*Potency - 80*Evaluation)/10.

He then combines this assessment of feel with a calculation of total essay length and uses the result to generate a grade. Here is his grading formula:

He compared the grades given by his formula with the grades given by an independent (human) grader and found 74% agreement (r squared = .7399, p < .0001).

Although Hellwig's formula and Semantic Differential Scale appeared to work with one particular sample of business reports, an attempt to replicate his work with more traditional college essays was tried by the Alaska Assessment Project (described later). The Semantic Differential Scores were not found to have any correlation to the ratings given by human graders. Nevertheless, his work is interesting in that it raises the possibility of correlating rater judgments with subjective judgments founded on word choice. Computers may not be able to feel, but they can quickly tally scores humans have previously given particular words and guess at a reaction to a passage.

Guiding Student Revision

Although automatic grading by a computer may be unusual, revision with the aid of a computer is becoming common. We are all used to spell checkers, and proofreading programs such as RIGHTWRITER and GRAMMATIK have gained a wide following, especially in business. Proofreading programs have come under some fire for their lack of accuracy (they usually correctly identify only 30% to 40% of errors in a paper); thus their classroom acceptance has been limited.

Even if proofreading programs were 100% accurate, they only operate at the sentence level. With great advances in computational linguistics research, they might someday catch a common splice, but they will never detect a malformed paragraph or an inconsistent argument--they don't attempt to do more than parse individual sentences and determine if the parse works.

There are, however, revision programs that operate on larger units of text. WRITER'S WORKBENCH, developed by AT&T, and WRITER'S HELPER, published by Conduit, both provide a large amount of information to writers about things such as coherence, development, style, and tone. That such information can accurately direct writers in their revision efforts has been demonstrated by independent research:

[T]he revision components of WRITER'S HELPER appear to be fairly reliable predictors of essay quality. . . . An assumption that might be established is that, if writers analyze their writing via these evaluation routines and follow the suggestions, their essays will be of higher quality and within an effective range of syntactic complexity. (Reed, 1989, p. 80)
'Style' 's index [a subtest within WRITER'S WORKBENCH] agreed with human readers--82% of the time. . . . The characteristics that influenced readers the most, according to this study, were the number of words per sentence, word length, and readability. . . . Whether the readers were aware of it or not, these variables influenced their judgments. (King, 1982, p. 10)

Another study by Reed (1990) quantified just how much better students could write when they used WRITER'S HELPER for revision. Comparing WRITER'S HELPER to word-processing packages alone, Reed found students who used WRITER'S HELPER achieved holistic scores of 5.5 versus 3.9 achieved by students who only used word processing. Such studies demonstrate that computer programs can help in revision.

The approach of these two programs is divergent from the original intent of Page--rather than provide information on essay quality to teachers for grading, WRITER'S HELPER and WRITER'S WORKBENCH provide information to students. But the source of the information, specific features of text, is the same. The programs do word counts, measure sentence lengths, look for markers of coherence, and infer style. Although some in the field are troubled by such programs, the results of these studies indicate that such information can be a help to students.

The Alaska Assessment Project

The Yukon-Koyukuk School District (Nenana, Alaska) includes some of the most rugged country in the United States and covers an area greater than the size of Wisconsin. A district attended primarily by Native American students (Athabascans), senior administrators worked hard to improve the quality of writing curricula, moving away from the grammar drills that too often dominate minority schools and adopting the use of computers for word processing and desktop publishing. After years of such effort, the U. S. Department of Education requested an assessment of the progress they had made. Standardized tests would yield numbers but would focused on a kind of writing instruction they had resolutely avoided--the focus on grammar and usage--on error. Holistic grading worked but required time and training, and it said less than they wanted about student progress. They wanted more information than a scale of 1 to 6.

So project administrators Nicki and Alan McCurry (1992) worked with a statistician and a software developer to create an assessment instrument that searched for and counted the features listed in Table 3:

Table 3
Items Tabulated by Alaska Assessment Project


text length	to be verbs
average paragraph length	'ion words
total number of paragraphs	pronouns
average sentence length	articles
total number of sentences	the
standard deviation of length (sentence)	localism/slang
unique words	conditional verbs
Fogg readability	prepositions
Flesch readability	vague words
transitions	opinionated words
coordinates	number common words
subordinates	number uncommon words

They then began testing their computer tabulation program on several sets of essays that had also been holistically graded. The essays were from third, sixth, and ninth grade district students, and from three colleges: San Jose State University, University of Texas at El Paso, and City University of New York. Their statistician did a multiple regression analysis of the results, looking for corrrelations between features of these student papers and their holistic scores. What the statistican was looking for were computer-detectable features that consistently correlated with holistic scores and variables that seemed to increase as students moved through the grades.

The results were even better than the results achieved by Page. In all cases, the measures used correlated very highly to the holistic score given by a team of teachers. Correlations between the computer variables and human holistic scores ran as high as .96. The computer instrument also picked up areas of growth through the three graded samples. In short, the variables proved sensitive enough to account for the vast majority of difference between essay grades.

What is interesting is that the measures also detected differences between the way writing samples had been rated. The differences between the City University of New York ratings and those of San Jose State University were most significant. The variables that mattered most in each sample are shown in Table 4.

Table 4
Correlations of Holistically Scored Essays


Variable	Cumulative Correlation

San Jose State
Total sentences	.07
Number punctuation	.18
Mean word length	.35
Deviation of word length	.48
Number to be	.61
Number subordinates	.63
Number transitions	.69
Number pronouns	.76
Number 'ion	.85
Number the	.92
Number uncommon words	.96

City University of New York
Total words	.17
Fogg Readability	.22
Mean sentence length	.26
Number prepositions	.28
Number slang	.53

It seems the human graders were using different standards, which reinforces exactly what Don Coombs (1969) said more than 20 years ago. To repeat his words, "Rather than an investigation of how students' essays can be adequately graded as part of an ongoing instructional program, this is a study of the cognitive processes of experienced essay graders" [emphasis his] (p. 225). The comparison of holistic scores made it clear that features such as vocabulary and spelling and slang were valued quite differently by different teams of graders. This difference is probably unknown to the raters themselves.

From the results of the Alaska Assessment Project, it appears that the kind of text analysis Page pioneered 25 years ago may prove to be useful in a different way than originally conceived. Rather than prescribe responses to student essays, such software can describe both the traits of student writing and the traits of writing evaluators. Consider, if you will, how a summary of essay traits and of rater evaluations might be used for normalizing sessions prior to holistic grading sessions.

We will surely never elect to abandon human grading. No matter how time consuming or difficult, we feel human grading procedures are our duty and our students' right. But the use of such tools as the Alaska Assessment Project can both inform human raters of their tendencies and improve their self awareness, and it can provide additional information about student writing--both for general research and for curricular assessment. In any case, it appears that the work of Ellis Page was not in vain. Computer essay grading may actually be less imminent than it was 25 years ago, but computer text analysis is still alive and well and making a contribution to our understanding of writing.

"Breakthrough? Or buncombe and ballyhoo?" In retrospect, it appears Page's work was neither. It was early research that explored the possible role of an emerging technology. It provided initial results and hinted at possibilities worth pursuing. Twenty-five years later we are still exploring. Nothing momentous is currently imminent, but it has been an interesting trip just the same. I think most of us are glad we came on board.

William Wresch is the Chair of the Department of Mathematics and Computing at the University of Wisconsin--Stevens Point.

References

Carroll, J., Davies, P., & Richman, B. (1971). Word frequency book. Boston, MA: Houghton Mifflin.

Coombs, D. (1969). Roundtable review. Research in the Teaching of English, 3, 225.

Finn, P. (1977). Computer-aided description of mature word choices in writing. In M. Cooper, & L. Odell (Eds.) Evaluating writing: Describing, measuring, judging (pp. 69-90). Urbana, IL: National Conference of Teachers of English.

Heise, D. (1965). Semantic differential profiles for 1000 most frequent English words. Psychological Monographs: General and Applied, 79, 1-31.

Hellwig, H. (1990, March). Computational text analysis for predicting holistic writing scores. Paper presented at Conference on College Composition and Communication, Chicago, IL.

King, W. (1982). Style's index compared with reader's scores. (Technical Report). Davis, CA: University of California.

Macrorie, K. (1969). Roundtable review. Research in the Teaching of English, 3, 228-236.

McCurry, N., & McCurry, A. (1992). Writing Assessment for the Twenty-First Century. Computer Teacher, 19, 35-37.

Page, E. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47, 238-243.

Page, E. (1968). Analyzing student essays by computer. International Review of Education, 14, 210-225.

Reed, W. (1989). The effectiveness of composing process software: An analysis of WRITER'S HELPER. Computers in the Schools, 6, 67-82.

Reed, W. (1990, April). The effect of composing process software on the revision and quality of persuasive essays. Paper presented at the Eastern Educational Research Association, Clearwater, FL.

Slotnick, H. (1972). Toward a theory of computer essay grading. Journal of Educational Measurement, 9, 253-263.

Software Described

Alaska Assessment Program, Yukon-Koyukuk School District, P. O. Box 80210, Fairbanks, Alaska 99708.

WRITER'S HELPER, Conduit, University of Iowa--Oakdale Campus, Iowa City, Iowa, 52242.

WRITER'S WORKBENCH, AT&T, P. O. Box 19901, Indianapolis, IN 46219.