Abstract: The benefits of giving and receiving peer reviews have been well documented (e,g Topping, 1998, 2005), but less is known about the format and content of effective peer reviews. If peer reviews are conducted formatively, that is before students submit their final products, an effective review should help students decide what to keep and what to revise. But, what kind of feedback triggers this meta-cognition? This study explored two research questions: 1) What type of peer review feedback (positive, negative or mixed) has the greatest effect on student learning (as measured by student revision to their work); and 2) What type of feedback do students value? Three sections of an undergraduate education course provided anonymous reviews using Blackboard’s peer assessment tool in order to compare three feedback conditions: positive, mixed, and negative. Some evidence indicated students who received negative feedback were slightly more likely to revise, but these findings were not conclusive and suggest other factors, perhaps internal to the student, are more predictive of a student’s decision to revise than the tone or content of the feedback they receive. In contrast, the findings overwhelming indicate that students prefer negative feedback over positive, which corresponds to early research showing that more experienced students seek out and benefit from negative feedback (Fishbach, Eyal & Finkelstein, 2010).


Abstract: This paper describes the multi-disciplinary collaboration of six faculty members who implemented peer review in order to improve student writing.
Each faculty member developed their own assignments and peer review process, but followed the same general guidelines. Students submitted drafts to peers who made comments and used a rubric to provide formative feedback. The instructors used a variety of tools to support peer review including Google Drive, Blackboard, and Expertiza, a dedicated peer-review system. Students reflected on the peer review process in an online survey after each round of peer review. The survey results varied considerably between the classes suggesting the importance of the instructor, assignment, and peer review process, but there were also common themes across courses. Quantitative results suggest students valued peer review more when they knew who they were reviewing and who reviewed them, and that students found the first round of peer review more helpful than subsequent. Qualitative results differed, showing many students appreciated the anonymity provided by computer-supported peer reviews. Students had little tolerance for technology that was not intuitive or reliable. In general, students appreciated receiving feedback from their peers, especially comments that were specific and pointed out areas for improvement. They were less receptive to being graded by peers, especially when low marks were not explained. Students liked seeing other students’ work as this helped them gauge their performance, but many felt uncomfortable giving critical feedback and expressed concern about their lack of expertise.


Luca de Alfaro and Michael Shavlovsky. Dynamics of Peer Grading: An Empirical Study. UC Santa Cruz Technical Report.

Abstract: Peer grading is widely used in MOOCs and in standard university settings. The quality of grades obtained via peer grading is essential for the educational process. In this work, we study the factors that influence errors in peer grading. We analyze 288 assignments with 25,633 submissions and 113,169 reviews conducted with CrowdGrader, a web based peer grading tool. First, we found that large grading errors are generally more closely correlated with hard-to-grade submission, rather than with imprecise students. Second, we detected a weak correlation between review accuracy and student proficiency, as measured by the quality of the student’s own work. Third, we found little correlation between review accuracy and the time it took to perform the review, or how late in the review period the review was performed. Finally, we found a clear evidence of tit-for-tat behavior when students give feedback on the reviews they received. We conclude with remarks on how these data can lead to improvements.


de Alfaro, L., Polychronopoulos, V., & Shavlovsky, M. (2015, September). Reliable aggregation of boolean crowdsourced tasks. In Third AAAI Conference on Human Computation and Crowdsourcing

Abstract: We propose novel algorithms for the problem of crowdsourcing binary labels. Such binary labeling tasks are very common in crowdsourcing platforms, for instance, to judge the appropriateness of web content or to flag vandalism. We propose two unsupervised algorithms: one simple to implement albeit derived heuristically, and one based on iterated bayesian parameter estimation of user reputation models. We provide mathematical insight into the benefits of the proposed algorithms over existing approaches, and we confirm these insights by showing that both algorithms offer improved performance on many occasions across both synthetic and real-world datasets obtained via Amazon Mechanical Turk.


de Alfaro, L., Polychronopoulos, V., & Shavlovsky, M. (2015). Incentives for Truthful Peer Grading. UC Santa Cruz Technical Report.  

Abstract: Peer grading systems work well only if users have incentives to grade truthfully. An example of non-truthful grading, that we observed in classrooms, consists in students assigning the maximum grade to all submissions. With a naive grading scheme, such as averaging the assigned grades, all students would receive the maximum grade. In this paper, we develop three grading schemes that provide incentives for truthful peer grading. In the first scheme, the instructor grades a fraction p of the submissions, and penalizes students whose grade deviates from the instructor grade. We provide lower bounds on p to ensure truthfulness, and conclude that these schemes work only for moderate class sizes, up to a few hundred students. To overcome this limitation, we propose a hierarchical extension of this supervised scheme, and we show that it can handle classes of any size with bounded (and little) instructor work, and is therefore applicable to Massive Open Online Courses (MOOCs). Finally, we propose unsupervised incentive schemes, in which the student incentive is based on statistical properties of the grade distribution, without any grading required by the instructor. We show that the proposed unsupervised schemes provide incentives to truthful grading, at the price of being possibly unfair to individual students.


Edward F. Gehringer and Ferry Pramudianto, “Research in student peer review: a cooperative web-services approach,” Proc. AAAS/NSF Symp. on Envisioning the Future of Undergraduate STEM Education, Washington, DC, April 27–29, 2016. (to appear)

Abstract: Peer review is effective at providing students with formative and summative feedback, as well as improving their metacognitive skills. However, current systems are lacking in (i) giving reliable summative feedback and (ii) improving the quality of often haphazard formative feedback. Our research addresses both of these issues with a common set of web services that can be used by any peer-review system, and new visualizations that identify students’ strengths and weaknesses, and gauge improvement over time.


Yang Song, Zhewei Hu, and Edward F. Gehringer, “Closing the circle: use of students’ responses for peer-assessment rubric improvement,” Proc. ICWL  2015, 14th Int’l. Conference on Web-Based Learning, Guangzhou, China, Nov. 5–8, 2015.

Abstract: Educational peer assessment has proven to be a useful approach for providing students timely feedback and allowing them to help and learn from each other. Reviewers are often expected both to provide formative feedback─textual feedback telling the authors where and how to improve the artifact─and peer grading at the same time. Formative feedback is important for the authors because timely and insightful feedback can help them improve their artifacts, and peer grading is important to the teaching staff, as it provides more input to help determine final grades. In a large class or MOOC when the help from teaching staff is limited, formative feedback from their peers is the best help that the authors may receive. To guarantee the quality of the formative feedback and reliability of peer grading, instructors should keep on improving peer-assessment rubrics. In this study we used students’ feedback from the last 3 years in the Expertiza peer-assessment system to analyze the quality of 15 existing rubrics on 61 assignments. A set of patterns on peer-grading reliability and comment length were found and a set of guidelines are given accordingly.


Yang Song, Zhewei Hu, and Edward F. Gehringer, “Pluggable reputation systems for peer review: a web-service approach,” Frontiers in Education 2015, 45th Annual Conference, El Paso, TX, October 21–24, 2015.  

Abstract: Peer review has long been used in education to provide students more timely feedback and allow them to learn from each other’s work. In large courses and MOOCs, there is also interest in having students determine, or help determine, their classmates’ grades. This requires a way to tell which peer reviewers’ scores are credible. This can be done by comparing scores assigned by different reviewers with each other, and with scores that the instructor would have assigned. For this reason, several reputation systems have been designed; but until now, they have not been compared with each other, so we have no information about which performs best. To make the reputation algorithms pluggable for different peer-review system, we are carrying out a project to develop a reputation web service. This paper compares two reputation algorithms, each of which has two versions, and reports on our efforts to make them “pluggable,” so they can easily be adopted by different peer-review systems. Toward this end, we have defined a Peer-Review Markup Language (PRML), which is a generic schema for data sharing among different peer-review systems.


Ravi K. Yadav and Edward F. Gehringer, “Metrics for automated review classification: what review data show,” PRASAE 2015, 2nd Workshop on Peer Assessment and Self-Assessment in Education, associated with International Conference on Smart Learning Environments, Sinaia, Romania, Sept. 23–25, 2015, in New Horizons in Web-Based Learning, Springer.

Abstract: Peer review is only effective if reviews are of high quality. In a large class, it is unrealistic for the course staff to evaluate all reviews, so a scalable assessment mechanism is needed. In an automated system, several metrics can be calculated for each review. One of these metrics is volume, which is simply the number of distinct words used in the review. Another is tone, which can be positive (e.g., praise), negative (e.g., disapproval), or neutral. A third is content, which we divide into three subtypes: summative, advisory, and problem detection. These metrics can be used to rate reviews, either singly or in combination. This paper compares the automated metrics for hundreds of reviews from the Expertiza system with scores manually assigned by the course staff. Almost all of the automatic metrics are positively correlated with manually assigned scores, but many of the correlations are weak. Another issue is how the review rubric influences review content. A more detailed rubric draws the reviewer’s attention to more characteristics of an author’s work. But ultimately, the author will benefit most from advisory or problem detection review text. And filling out a long rubric may distract the reviewer from providing textual feedback to the author. The data fail to show clear evidence that this effect occurs.


Edward F. Gehringer, Kai Ma, and Van Duong, “What peer-review systems can learn from online rating sites,” PRASAE 2015, 2nd Workshop on Peer Assessment and Self-Assessment in Education, associated with International Conference on Smart Learning Environments, Sinaia, Romania, Sept. 23–25, 2015, in New Horizons in Web-Based Learning, Springer.

Abstract: As their core functionality, peer-review systems present ratings of student work. But online ratings are not a new concept. Sites rating products or services have a long history on the Web, and now boast hundreds of millions of users. These sites have developed mechanisms and procedures to improve the accuracy and helpfulness of their reviews. Peer-review systems have much to learn from their experience. Online review sites permit users to flag reviews they consider inappropriate or inaccurate. Peer-review systems could do the same. Online review systems have automated metrics to decide whether reviews should be posted. It would be good for peer-review systems to post only reviews that pass an automatic quality check. Online rating sites give recognition to their best reviewers by means of levels or badges. Recognition is often dependent on upvotes by other users. Online review sites often let readers see helpfulness ratings or other information on reviewers. Peer-review systems could also allow authors to see ratings of the students who reviewed their work.


Edward F. Gehringer, “Automated and scalable assessment: present and future,” Paper #12821, American Society for Engineering Eduation Annual Conference, Seattle, WA, June 14–17, 2015

Abstract: A perennial problem in teaching is securing enough resources to adequately assess student work. In recent years, tight budgets have constrained the dollars available to hire teaching assistants. Concurrent with this trend, the rise of MOOCs, has raised assessment challenges to a new scale. In MOOCs, it’s necessary to get feedback to, and assign grades to, thousands of students who don’t bring in any revenue. As MOOCs begin to credential students, accurate assessment will become even more important. These two developments have created an acute need for automated and scalable assessment mechanisms, to assess large numbers of students without a proportionate increase in costs. There are four main approaches to this kind of assessment: autograding, constructed-response analysis, automated essay scoring, and peer review. This paper examines the current status of these approaches, and surveys new research on combinations of these approaches to produce more reliable grading.


Lakshmi Ramchandran and Edward F. Gehringer, “Identifying content patterns in peer reviews using graph-based cohesion,” FLAIRS-28 (Florida Artificial Intelligence Res. Society), May 18−20, 2015, Hollywood FL.

Abstract:  Peer-reviewing allows students to think critically about a subject and also learn from their classmates’ work. Students can learn to write effective reviews if they are provided feedback on the quality of their reviews. A review may contain summative or advisory content, or may identify problems in the author’s work. Reviewers can be helped to improve their feedback by receiving automated content-based feedback on the helpfulness of their reviews. In this paper we propose a cohesion-based technique to identify patterns that are representative of a review’s content type. We evaluate our pattern-based content identification approach on data from two peer-reviewing systems —Expertiza and SWoRD. Our approach achieves an accuracy of 67.07% and an f-measure of 0.67.


Edward F. Gehringer, “A survey of methods for improving review quality,” in Peer-Review, Peer-Assessment, and Self-Assessment in Education, associated with the 13th International Conference on Web-Based Learning, Talinn, Estonia, August 14–17, 2014, in New Horizons in Web-Based Learning, Springer.

Abstract:For peer review to be successful, students need to submit high-quality reviews of each other’s work. This requires a certain amount of training and guidance by the review system. We consider four methods for improving review quality: calibration, reputation systems, meta-reviewing, and automated meta-reviewing. Calibration is training to help a reviewer match the scores given by the instructor. Reputation systems determine how well each reviewer’s scores track scores assigned by other reviewers. Meta-reviewing means evaluating the quality of a review; this can be done either by a human or by software. Combining these strategies effectively is a topic for future research.