• Visit
  • Apply
  • Give

School of Education

How reliable is automated evaluation software in assessing student writing?

To help educators better understand the advantages of common and novel methods for assessing student writing, UD associate professor Joshua Wilson and his co-authors examined the reliability of hand-scoring and automated evaluation software (AES) in assessing the writing of upper-elementary students with different writing abilities. Using multivariate generalizability theory and sample essays in three different genres from 113 students in grades 3 to 5, they found that both scoring methods were highly reliable, with both methods achieving a reliability score of .90 (with 1 indicating perfect reliability) across grade levels, genre and writing ability. But they also found that, in some cases, AES was more reliable for non-struggling students (those above the 25th percentile for their age or grade level), while hand-scoring was more reliable for struggling students (those at or below the 25th percentile). For example, in grade 5 informative writing from non-struggling students, AES achieved a reliability score of .96, while hand-scoring achieved a score of .93. But, among grade 4 struggling writers, AES reliability was weak for informative writing (0.45), moderate for persuasive writing (0.67), and acceptable for narrative writing (0.80). Given the results of their study, Wilson and his co-authors make recommendations for how educators can best use both scoring methods in their instructional decision-making and genre-specific writing assessment.

In “Examining Human and Automated Ratings of Elementary Students’ Writing Quality: A Multivariate Generalizability Theory Application,” published in American Educational Research Journal, UD alumna Dandan Chen, now of Duke University School of Medicine, Michael Hebert of University of California, Irvine, and Wilson are the first researchers to use multivariate generalizability theory (G theory) to examine the reliability of both hand-scoring and AES on the same group of student writing. This study specifically analyzed Project Essay Grade (PEG), a form of AES that provides writing quality ratings on a 1-5 scale for ideation, organization, style, sentence structure, word choice and conventions. In addition, this study is the first to analyze the reliability of these methods in student writing across three genres (persuasive, narrative and informative writing). In doing so, Wilson and his co-authors identify the complementary strengths of each scoring method so that educators can leverage these strengths.

Wilson and his co-authors make several recommendations for how educators can use both scoring methods in different contexts. First, they suggest that educators obtain multiple writing samples or use multiple raters when making decisions that have consequences for students, such as making instructional placements and disability identification. Their findings suggest that teachers will not have sufficiently reliable data from a single writing performance assessment to make correct decisions, regardless of the scoring method.

Second, they suggest that educators consider using PEG or another AES for periodic, class-wide formative assessment. Their findings show that PEG provides immediate and reliable evaluation of student writing overall, especially for non-struggling writers.

Third, they suggest that educators blend AES with hand-scoring, relying on human raters to make the most consequential educational decisions. PEG is highly reliable for average or higher-performing writers, and it can save valuable human resources, as assessing writing is both laborious and time-intensive. However, PEG is less reliable than hand-scoring for making decisions for vulnerable students. Thus, Wilson and his co-authors recommend that educators rely on hand-scoring to assess the performance of struggling writers.

“Efforts to establish prevention-intervention frameworks around writing have been stymied, in part, due to a lack of efficient and reliable assessments that educators can use to make timely and valid data-based decisions about their students,” Wilson said. “We illustrate a path forward that involves combining the strengths of technology and teachers, and we envision that this path will be increasingly feasible to implement as AES continues to expand in K–12 education in the future.”

Article by Jessica Henderson. Photography by Elizabeth Adams.

Joshua Wilson portrait

About Joshua Wilson

Dr. Joshua Wilson is an associate professor in CEHD’s School of Education at the University of Delaware. His research broadly focuses on ways to improve the teaching and learning of writing and specifically focuses on ways that automated writing evaluation systems can facilitate those improvements. His research has been supported by grants from federal, foundation, and industry sponsors and has been published in journals such as Computers & EducationJournal of Educational Psychology, Elementary School JournalInternational Journal of Artificial Intelligence in Education and Journal of School Psychology among others.


Literacy and Language Faculty at CEHD

Wilson’s research complements the work of the literacy and language faculty at CEHD who study writing, which include David Coker (assessment, development and instruction), William Lewis (secondary content area writing), Charles A. MacArthur (development and instruction for struggling writers), Kristen Ritchey (students with writing difficulties and disabilities) and Sharon Walpole (elementary writing curriculum).