Example: Calibrating Student Exam Scores

In this example we consider the exam results of 15 students (labelled 1-15) - these are the objects, \(o\) - as evaluated by 8 teachers (labelled A-H) - these are the assessors, \(a\). For each student, the teacher has given a score, \(s\), and a confidence in their score, \(c\), based upon their expertise. There are a total of 45 evaluations which are shown in the example spreadsheet (Figure 1); this can be downloaded here.

The Calibrate with Confidence algorithm uses the assessor-object scores, \(s_{ao}\), and confidences, \(c_{ao}\) to solve the linear equation:

\[\sum_{a'} \left( \sum_{o}\frac{c_{ao}c_{a'o}}{C_o} \right) b_{a'}-C'_ab_a=\sum_o\frac{c_{ao}V_o}{C_o}-B_a\]

where \(b_a\) is a teacher's bias and \(C_o\) and \(C_a\) are the total confidences in the exam result and of the teachers respectively. \(V_o\) and \(B_a\) are vectors respresenting the 'true' values of the exams and the teachers biases respectively.

For more details on the algorithm please visit the Model page or download a version of the latest paper on arXiv which also includes several case studies and supplementary information.

The Problem

15 student exams are evaluated by 8 teachers with a total of 45 evaluations. These evaluations are shown in the example spreadsheet (Figure 1), Columns A:D, which can be downloaded here. The confidences have been set to high (h), medium (m) or low (l) which equate to {\(\lambda^2\), 1, \(\lambda^{-2}\)} where \(\lambda=1.75\). Here we use a weighted degeneracy-breaking condition \( \left(\sum_a C'_a b_a = 0 \right) \).

The results are shown in Columns F:M and show the Student labels (Column F), 'True' exam results (Column G), Robustness of exam results (Column H), Total confidence in exam result (Column I), Teacher labels (Column J), Teacher bias (Column K), Robustness of bias (Column L) and Total confidence expressed by each teacher (Column M). The confidence weighted rms |dv,db| is displayed in cell "O38".

The results are calculated by solving the above linear equation to evaluate the similarities between teacher assessments to normalise the exam results. We can examine these relationships visually via network graphs (Figure 2 & 3)

Figure 1 - Template Spreadsheet


The Assessor-Object Graph

To examine the relationships between teachers and the students' exam results we can examine the assessor-object network graph (Figure 2).

The nodes on the left represent the teachers and the nodes on the right represent students exam results. The edges show if a certain teacher has evaluated a given exam paper. The weight of the edge indicates how confident a teacher is in a given evaluation; a thicker line indicates more confidence. The total confidence in the exam result/teacher are represented by the size of the nodes - a larger node indicates a greater confidence - whilst the node colour represents the bias of the teachers - scaled red to blue indicating negative to positive bias - and the 'true' values of the exam results - scaled light to dark green indicating low to high scores.

All of the teachers have evaluated at least one exam paper and all of the exam papers have been evaluated by at least one teacher. Furthermore, all of the exam papers have at least one high confidence score and all of the teachers have marked at least one exam for which they are highly confident.

We can see from the graph that teacher A is the hardest marker with the most negative bias (dark red node) whilst teacher C is the easiest marker with the most positive bias (dark blue node). The overall teacher confidences are similar except for teacher B who only assessed one exam paper which belonged to student 1; the teacher did this with high confidence. We can also see that students 2, 3 and 8 achieved the highest results (darkest green nodes) but we can be most confident in the results of student 2 as s/he had the largest total confidence (largest node size).

Figure 2 - Assessor-Object Graph


The Assessor Graph

We must now determine where teachers have evaluations in common and their relative confidences in these evaluations. This is shown graphically in the assessor graph (Figure 3).

As above, the total confidence in the teacher's assessments are represented by the size of the nodes - a larger node indicates a greater confidence - whilst the node colour represents the bias of the teachers - scaled red to blue indicating negative to positive bias. The edges represent the connections between the teachers due to common evaluations. The weight of the edges indicates teachers confidences in the common evaluations as a proportion of the total exam result confidences.

From the graph we can see that there are particularly strong confidences in the markings between teachers A-D, C-G, D-E D-G and G-H. It is also evident that there are few similarities between teacher B and the other teachers with teacher B only having evaluated one exam paper in common with teachers A and C.

Figure 3 - Assessor Graph