On-line Grading of Student Essays: PEG goes on the World Wide Web


of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
On-line Grading of Student Essays: PEG goes on the World Wide Web
   Assessment & Evaluation in Higher Education, Vol. 26, No. 3, 2001 On-line Grading of Student Essays: PEGgoes on the World Wide Web MARK D. SHERMIS,  Department of Psychology, Indiana University PurdueUniversity Indianapolis, USA HOWARD R. MZUMARA,  IUPUI Testing Center, Indiana University PurdueUniversity Indianapolis, USA JENNIFER OLSON,  Department of Psychology, Indiana University PurdueUniversity Indianapolis, USA SUSANMARIE HARRINGTON,  Department of English, Indiana University PurdueUniversity Indianapolis, USA ABSTRACT  This study examined the feasibility of employing Project Essay Grade(PEG) software to evaluate web-based student essays that serve as placement tests at alarge Mid-Western university. The results of two experiments are reported. In the rst experiment, the essays of 1293 high school and college students were used to create astatistical model for the PEG software. PEG identied 30 proxes (observed variables)that could be incorporated into an evaluation of the written work. In the second experiment, the ratings from a separate sample of 617 essays were used to compare theratings of six human judges against those generated by the computer. The inter-judgecorrelation of the human raters was  r 5 0.62 and was  r 5 0.71 for the computer. Finally,the PEG software was an efcient means for grading the essays with a capacity for approximately three documents graded every second. Cycle time from the web-submission of the document to producing a report score was about 2 minutes. AlthoughPEG would appear to be a cost-effective means of grading written work of this type,several cautionary notes are included. Introduction Remediation in the basic skills continues to be a problem for colleges and universitieswith rates of remediation running as high as 90%, especially in states that lack acomprehensive community college (Chronicle of Higher Education, 1996; National ISSN 0260-2938 print; ISSN 1469-297X online/01/030247-13  Ó  2001 Taylor & Francis LtdDOI: 10.1080/0260293012005240 4  248  M. D. Shermis  et al.Center for Education Statistics, 1995). Inadequately prepared students can result in lowerretention rates, higher levels of dissatisfaction, and unnished student goals. While highschool counsellors spend a good deal of their time preaching the benets of takingchallenging coursework, these voices of concern are not always heard.Shermis  et al . (1997) introduced a programme designed to allow high school studentsto take a set of university placement tests as a way to gauge their level of preparednessto take college-credit courses. Starting in the tenth grade, students can take objectiveweb-based college placement tests in mathematics and reading. A written essay is alsoavailable as an assessment for English. The feedback provided suggests whether or notstudents are on a trajectory to take college-level coursework and what remedial optionsthe district offers for those who are not on that trajectory. Students can take the testsannually to determine a sense of their rate of progress.The mathematics test is a computerised adaptive instrument (Hsu & Shermis, 1989;Shermis & Chang, 1997) while the reading assessment is a linear-based test (Shermis  et al ., 1996). Both instruments are objectively scored and feedback is provided immediatelyto the student. The writing assessment consists of a short narrative sample of work thatrequires human intervention to rate and score. Feedback on writing requires the effortsof at least one rater, and typically takes between 3–4 days to generate. In the short run,the project is currently funded for the assessment of 10,000 1 essays produced annuallyin the high schools, but in the long run, some cost-effective alternative was desired.A technology that holds some promise for the long-term effort is Project Essay Grade(Page, 1994). This computer software was designed to grade prose based on stablestatistical models congured specically for the type of writing to be assessed. Thesoftware used here is not ‘intelligent’ in the sense of evaluating content, but ratheremulates the behaviour of raters. “PEG is not aimed so much at AI [ArticialIntelligence], … as at ‘IA’—‘Intelligent Assistance.’ PEG won’t replace the Englishteacher, but will serve as a useful, time-saving check on quality in writing” (Page, 1996,p. 2). Could PEG be employed to provide feedback to the high school students?Most of the previous studies involving PEG have been directed toward assessing theaccuracy of simulating expert human ratings. Page and Petersen (1995) note that human judges are notoriously unstable with judgements typically correlating at about 0.50–0.60with one another. In a study which utilised 1314 ETS Praxis writing samples, Page andPetersen (1995) compared the ratings of six independent judges against that of theProject Essay Grader computer programme. A total of 1014 essays were used to set upthe statistical model, and 300 essays to form the test of their hypothesis. The ETS judgeswere well trained and many had extensive experience in performing essay ratings. Theaverage correlation among the six judges was  r  5 0.65 while the average correlationbetween the computer program and the judges was  r  5 0.74.While the PEG technology was the rst to be developed (Page, 1966), there areseveral other competing programs which purport to evaluate essays using computertechnology. For example, the Intelligent Essay Assessor (www.knowledge-technolo-gies.com; Landauer  et al ., 1998), uses latent semantic analysis to determine relationshipsamong words in an essay. Vantage Technologies has also developed IntelliMetric Ô which is also designed to score essays (cf. www.intellimetric.com/). Finally, ETS hasdeveloped a natural language processor that serves the same purpose (Burstein &Kaplan, 1995). The main distinction between PEG and the competing systems has to dowith the focus of assessment. The competing systems are designed to evaluate contentcorrectness whereas PEG’s focus is directed towards the assessment of general writingability.  On-line Grading of Student Essays  249  Research Hypotheses The previous work of Page and Petersen (1995) suggests three research hypotheses:(1) The computer ratings would surpass the accuracy of the usual two judges. (Accuracyis dened as the average agreement with a larger population of judges.)(2) The essays would be graded much more rapidly, since not as many human ratingswould be required, if any at all.(3) Machine-readable essays would be graded more economically, saving perhaps up tohalf or more of the cost of the current procedures. Method ParticipantsStudy 1 (forming the model).  Participants were 1293 students drawn from a largeMid-Western university and a suburban high school. All entering students at theuniversity are required to take tests of mathematics, reading, and written English essaysin order to be placed in appropriate courses. Students from the high school wereparticipating as part of an experimental programme to determine if taking placementtests at the secondary school produces a higher proportion of better prepared collegestudents (Shermis  et al ., 1997; Shermis, 1997). Table 1 shows the demographiccharacteristics of both the university and high school samples. These characteristics arerepresentative of individuals enrolled in their respective institutions. Study 2 (testing the model).  Participants were 617 students drawn from the same largeMid-Western university and suburban high school as before. Table 2 shows thebackground characteristics of the test sample from both the university and high schoolsamples. T ABLE  1. Demographic characteristics of the samplewhich formed the statistical model (  N  5 1293)LocationUniversity High SchoolN 5 860 N 5 433Variable % %GenderMale 44.7 66.7Female 55.3 33.3EthnicityWhite 80.6 100.0Non-white 18.4 0Class levelFreshman 94.9 NASophomore 4.2 NAJunior 0.6 NASenior 0.4 NAMean SD Mean SDAge 22.5 6.9 NA NA  Note:  NA 5 Not ascertained  250  M. D. Shermis  et al. T ABLE  2. Demographic characteristics of the samplewhich formed the statistical test (  N  5 617)LocationUniversity High SchoolN 5 860 N 5 433Variable % %GenderMale 41.8 66.7Female 58.2 33.3EthnicityWhite 84.4 100.0Non-white 15.6 0.0Class levelFreshman 97.8 NASophomore 0.8 NAJunior 0.6 NASenior 0.4 NAGraduate 0.4 NAMean SD Mean SDAge 20.1 4.9 NA NA  Note:  NA 5 Not ascertained  Instruments English placement exam . The English placement exam is a one-hour exam that asksstudents to write an essay that explains and supports their opinion on a current socialissue. Students have a choice of two questions, each providing a brief explanation of theissue for the context in which the test question is posed (Harrington  et al ., 1998).Students are also asked to evaluate their answer and explain what changes they mightmake, had they the time to do so. This is considered a ‘low stakes’ test in that admissionto the university is not contingent upon test performance. However, students aregenerally motivated to do their best work because they want the highest possible courseplacement.The human rating system is based on models developed at the University of Pittsburgh(Smith, 1993) and Washington State University (Haswell & Wyche-Smith, 1995).Numeric scores are assigned monotonically to represent a spectrum of pre-college andcollege writing ability. Scores of 1–4 result in placements for a pre-basic writing course;5–11 indicate placements into basic writing; 12–18 indicate placements into rst-yearcomposition; and 19–22 indicate placements into honours.While placement rates may vary from year to year, on the whole 60% of the studentstaking the test are placed into rst - year composition, 35% are placed into basic writing,and roughly 5% are placed into either honours, English as a Second Language (ESL),or other special courses. Most ratings are provided by faculty who teach rst - yearcomposition and basic writing; honours placements are made by faculty who teachhonours courses. Each of the essays received a minimum of two ratings. Based on pilotwork with an earlier set of data that included 178 essays and 11 pairings of raters, themedian correlation among the raters was  r  5 0.65 with a range from  r  5 0.00 to 0.92.  On-line Grading of Student Essays  251Research on comparable scoring systems at other institutions suggests that trainingand shared teaching expertise creates acceptable levels of inter-judge agreement (cf.Smith, 1993; White, 1995). The predictive validity of the test has been computed withcorrelations in the low 0.20’s (Mzumara  et al ., 1996) with course grades as an outcomevariable. This gure represents a low correlation, but is not atypical for placementvalidity coefcients involving writing placement. Procedure How PEG works.  In much the same way as one might develop a statistical model withobserved and latent variables, the evaluation of writing could be expressed in terms of  trins  and  proxes . Trins are in trin sic variables of interest such as uency or grammar(Page & Petersen, 1995). Proxes are from ap  prox imations, that is, the observed variableswith which the computer works, and are statistically calculated in the various writingsamples. Examples of proxes might include the length of the essay or average wordlength. The statistical model for evaluating essays is formulated by optimising theregression weights for the proxes to best predict rater averages of these trins. The ratinggenerated by the statistical model is, in turn, compared against a new or pull-out sampleof average ratings among human judges.More recent work by Page  et al . (1997) studied whether human raters have higherlevels of agreement when ratings are provided holistically or when such trins areexplicitly identied. In that study, trins were specied at the trait level and included:content, organisation, style, mechanics, and creativity. Eight judges were asked toprovide both trait and holistic ratings on 495 essays in the NAEP essays from 1988. Theresults showed that the agreement coefcients for holistic ratings among human judgeswere higher than their corresponding trait agreement ratings. Moreover, for both holisticand trait ratings, PEG had coefcients that were as good or considerably higher than theratings between two judges, or more. Study 1.  In our study just completed, students entered their essays using a screen (or webform) similar to that shown in Figure 1. Once the essay was completed, studentssubmitted the text to a database that is controlled by a web server. Figure 2 illustratesa typical database entry. Six raters drawn from a pool of 15 instructional facultyprovided their assessments on-line by reading the essays and scoring them [1]. Thedatabase compared the ratings to determine an appropriate course placement. If a largeenough discrepancy between ratings existed, then a third rater was scheduled. For thepurposes of forming the statistical model, only the rst two ratings were used. Essaysfrom the rst sample were analysed to form the statistical model as part of Study 1. Inthis study, the proxes were identied and optimally weighted using the average judges’ratings as the outcome variable. Study 2.  In our second phase, essays were rst sent to the database and rated bythe instructors as before. PEG automatically queried the database to determine if newessays were present. If so, it transferred and processed the text, and returned the PEGscore to the database. PEG scores are given as whole numbers, but are converted toz-scores.
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks