Calibrate with Confidence

Frequently, a set of objects has to be evaluated by a panel of assessors, but not every object is assessed by each assessor and different assessors may have different standards or use the marking scale differently, which we term “assessor bias”. In addition to assessor bias, the levels of confidence in each assessment may vary according to the match in expertise between assessor and object and the stringency of assessment.

Examples include the evaluation of:

A problem facing such panels is how to account for different levels of bias and confidence amongst panel members. Here we propose a method to achieve robust calibration assessors and hence robust scores for outputs. It is based upon the premise that we may infer something about assessors' relative standards by examining discrepancies between their evaluations when their evaluations have objects in common, also taking into account possible differences in declared confidences in their evaluations.

Please feel free to download and test our algorithm which has been designed to run on both Windows and Unix based operating systems via Python and Microsoft Excel macro packages. In order to test and improve our algorithm we would be very grateful if you could please leave feedback on your findings.

You can download the latest version of the paper on arXiv which also includes several case studies and supplementary information.

How does the model work?

A common approach to assessment of a range of objects by a panel is to assign to each object the average of the scores assigned by the assessors who evaluate that object. But this ignores the likely possibility that different assessors have different levels of stringency, expertise and bias (Meadows, 2006). Some panels shift the scores for each assessor to make the average of each take a normalised value, but this ignores different levels of confidence amongst assessors and the possibility that the set of objects assigned to one assessor may be of a genuinely different standard from that assigned to another. For an experimental scientist, the issue is obvious: calibration.

One approach is to seek to calibrate the assessors beforehand on a common subset of objects, perhaps disjoint from the set to be evaluated (Paul, 1981). This means that they each evaluate all the objects in the subset and then some form of rescaling is agreed to bring the assessors into line as far as possible. This would not work well, however, in a situation where the range of objects is broader than the expertise of an assessor. It also ignores the possibility of bias, such as due to familiarity effects (Fuchs and Fuchs, 1986). Regardless of how well the assessors are trained, differences between individuals’ assessments of objects remain in such ad hoc approaches (Naes, Brockhoff and Tomic, 2010).

If the expertise of two assessors overlap on some subject, however, any discrepancy between their evaluations can be used to infer information about their relative standards. Thus if the graph of the set of assessors, formed by linking two whenever they assess a common object, is sufficiently well connected, one can expect to be able to infer a robust calibration of the assessors and hence robust scores for the outputs.

One approach to achieving such calibration was developed by Fisher (1940), in the context of trials of crop treatments. Fisher’s approach is known as additive incomplete block analysis and a large body of associated literature has since been developed. We are not aware, however, of approaches which allow for the inclusion of assessor-provided confidences in their scores, which we believe is an important extension. Here, such a method, which incorporates confidence, is proposed and tested. We demonstrate that the method can achieve a far greater degree of accuracy with fewer assessors than the simple averaging approach.



The basic model

Let us suppose that each assessor is assigned a subset of the objects to evaluate. Denote the resulting set of (assessor, object) pairs by \(E\). Let us further suppose that the score \(s_{ao}\) that assessor, \(a\), assigns to object, \(o\), is a real number related to a 'true' value \(v_o\) for the object by

\begin{equation} s_{ao}=v_o+b_a+\varepsilon_{ao} \end{equation}

where \(b_a\) can be called the bias of assessor, \(a\), and \(\varepsilon_{ao}\) are independent zero-mean random variables. Such a model forms the basis for additive incomplete block analysis (see e.g. Giesbrecht, 1986). This was also proposed by Thorngate, Dawes and Foddy (2008) but without a method to extract the 'true' scores. Here we will make a significant improvement, namely the incorporation of varying confidences in the scores.

To take into account the varying expertise of the assessors, we propose that in addition to the score, \(s_{ao}\), each assessor is asked to specify a level of confidence for that evaluation. This is could be in the form of a rating such as 'high', 'medium', 'low', as requested by some funding agencies, but we propose something more general and akin to experimental science. Confidence should be understood not as a measure of how close the assessor believes their score is to the 'true' value, but as a measure of their uncertainty about their score, e.g. how tightly distributed the scores would be if the assessor provided repeated evaluations for the same item. This can be done by asking them to specify a standard deviation, \(\sigma_{ao}\) for their score. So let us suppose that

\begin{equation} \varepsilon_{ao}=\sigma_{ao}\eta_{ao} \end{equation}

with \(\eta_{ao}\) independent zero-mean, random variables of common variance, \(w\). Here we set \(w=1\). Thus our basic model is

\begin{equation} s_{ao}=v_o+b_a+\sigma_{ao}\eta_{ao} \end{equation}



Solution of the basic model

Given the data \(\{(s_{ao} , \sigma_{ao} ) : (a, o) \in E\}\) for all assigned assessor-object pairs, we wish to extract the 'true' values, \(v_o\), and assessor biases, \(b_a\). The simplest procedure is to minimise the sum of squares

\begin{equation} \sum_{(a,o)\in E}\eta^2_{ao}=\sum_{(a,o)\in E}c_{ao}(s_{ao}-v_o-b_a)^2 \end{equation}

where the confidence level is defined as

\begin{equation} c_{ao}=1/\sigma^2_{ao} \end{equation}

This procedure can be justified if the \(\eta_{ao}\) are assumed to be normally distributed, because then it gives the maximum-likelihood values for \(v_o\) and \(b_a\). It can also be viewed as orthogonol projection of the vector of scores to the subspace of the form \(s_{ao}=v_o+b_a\) in the metric given by \(\sqrt{\sum_{ao}c_{ao}s^2_{ao}}\).

Now expression (4) is minimised with respect to \(v_o\) if and only if

\[ \sum_{a:(a,o)\in E}c_{ao}(s_{ao}-v_o-b_a)=0 \]

and with respect to \(b_a\) if and only if

\[ \sum_{o:(a,o)\in E}c_{ao}(s_{ao}-v_o-b_a)=0 \]

It is notationally convenient to extend the sums to all assessors (respectively objects) by assigning the value \(c_{ao} = 0\) to any assessor-object pair that is not in \(E\) (i.e. for which a score was not returned). Then the above conditions can be written as

\begin{equation} C_o v_o+\sum_ab_ac_{ao}=V_o \end{equation}

\begin{equation} \sum_oc_{ao}v_o+C'_ab_a=B_a \end{equation}

Here, \(V_o=\sum_ac_{ao}s_{ao}\) is the confidence-weighted total score for object \(o\), \(B_a=\sum_oc_{ao}s_{ao}\) is that for assessor \(a\),

\begin{equation} C_o=\sum_ac_{ao} \end{equation}

is the total confidence in the assessment of an object and

\begin{equation} C'_a=\sum_oc_{ao} \end{equation}

is the total confidence expressed by assessor, \(a\).

Equations (6) and (7) form a linear system of equations for the \(v_o\) and \(b_a\). It has an obvious degeneracy, in that one could add a constant \(k\) to all the \(v_o\) and subtract \(k\) from all the \(b_a\) and obtain another solution. We can remove this degeneracy by, for example, imposing the condition

\begin{equation} \sum_ab_a = 0 \end{equation}

This is the simplest possibility and corresponds to a translation that brings the average bias over assessors to zero.

Define a graph, \(\Gamma\), linking assessor, \(a\), to object, \(o\) if and only if \((a, o) \in E\). The question of whether the set of equations (6) and (7) has a unique solution after breaking the degeneracy depends on the connectivity of \(\Gamma\). Define a linear operator, \(L\), by writing equations (6) and (7) as

\begin{equation} L\begin{bmatrix}v\\b\end{bmatrix}=\begin{bmatrix}V\\B\end{bmatrix} \end{equation}

where \(v,b,V\) and \(B\) denote the column vectors formed by the \(v_o,b_a,V_o\) and \(B_a\) respectively. \(L\) has null space of dimension equal to the number of connected components of \(\Gamma\) (this follows from Perron-Frobenius theory, see e.g. Meyer, 2001). Thus if \(\Gamma\) is connected, the the null space of \(L\) has dimension one, so corresponds precisely to the null vectors \(v_o = k \forall o\), \(b_a = −k \forall a\), which we already noticed and dealt with. This condition ensures that if (11) has a solution then there is a unique one satisfying (10).

It remains to check is that the right hand side of equation (11) lies in the range of \(L\), thus ensuring that a solution exists. This is true if all null forms of the adjoint operator \(L^\dagger\) send the right hand side to zero. The null space of \(L^\dagger\) has the same dimension as that of \(L\), because \(L\) is square, and an obvious non-zero null form is defined by

\begin{equation} \alpha(v,b)=\sum_ov_o-\sum_ab_a \end{equation}

It follows from the definition of \(V\) and \(B\) that \(\alpha(V,B)=0\).

Thus under the assumption that the assessor-object graph, \(\Gamma\), is connected, equations (8) and (9) have a unique solution \((v, b)\) satisfying equation (12). Note that it is clear that connectedness of \(\Gamma\) is necessary for uniqueness, otherwise one could follow an analogous procedure, adding and subtracting constants independently in each connected component of \(\Gamma\) and thereby produce more solutions.

The equations have a special structure, due to the bipartite nature of \(\Gamma\), that is worth exploiting. The first equation can be written as

\begin{equation} v_o=\frac{V_o-\sum_ab_ac_{ao}}{C_o} \end{equation}

This can be substituted into the second equation (7) to obtain

\begin{equation} \sum_o\frac{c_{ao}(V_o-\sum_{a'}b_{a'}c_{a'o})}{C_o}+(\sum_oc_{ao})b_a=B_a \end{equation}

so that

\begin{equation} \sum_{a'} \left( \sum_{o}\frac{c_{ao}c_{a'o}}{C_o} \right) b_{a'}-C'_ab_a=\sum_o\frac{c_{ao}V_o}{C_o}-B_a \end{equation}

The dimension of the system (15) is the number of assessors (rather than the sum of the numbers of assessors and objects). Replacing one of the equations in (15) by equation (10) gives a system with a unique solution that can be solved for \(b\) by any method of numerical linear algebra, e.g. LUP decomposition. Then \(v\) can be recovered from equation (13).



Robustness

A key question with any black-box solution like the one presented here is how robust the outcome is to mistakes or anomalous judgements. For \(s = (s_{ao})_{(a,o) \in E}\), define the operator \(K\) by

\begin{equation} Ks=\begin{bmatrix}V\\B\end{bmatrix} \end{equation}

as a shorthand for the definitions in (6) and (7), so that the equations can be written as

\begin{equation} L\begin{bmatrix}v\\b\end{bmatrix}=Ks \end{equation}

Thus, if a change \(\delta s\) is made to the scores, we obtain changes \(\delta v\), \(\delta b\) of size

\begin{equation} \left\Vert \begin{array}{c}\delta\nu\\\delta b\end{array}\right\Vert \leq\left\Vert L^{-1}K\right\Vert \left\Vert \delta s\right\Vert \end{equation}

where \(L^{−1}\) is defined by restricting the domain of \(L\) to (10) and its range to \(\alpha (V,B)=0\), and appropriate norms are chosen. We propose that appropriate choices are

\begin{equation} \left\Vert \delta s\right\Vert _{scores} = \sqrt{\sum_{ao} c_{ao} \delta s^2_{ao}} \end{equation}

and

\begin{equation} \left\Vert (\delta v, \delta s) \right\Vert _{results} = \sqrt{\sum_{o} C_o \delta v^2_o + \sum_a C'_a \delta b^2_a} \end{equation}

With a preferred degeneracy-breaking condition \(\sum C'_ab_a=0\) instead of (10) we obtain

\begin{equation} \left\Vert L^{-1}K \right\Vert \leq \frac{\sqrt{2}}{\sqrt{\mu_2}} \end{equation}

where \(\mu_2\) is the second smallest eigenvalue of a certain matrix formed from the confidences. In particular, this gives

\begin{equation} \left\vert \delta v_o \right\vert \leq \frac{1}{\sqrt{C_o}} \frac{\sqrt{2}}{\sqrt{\mu_2}} \sqrt{\sum_{ao}c_{ao} \delta s^2_{ao}} \end{equation}

The task for the designer of \(E\) is to make none of the \(C_o\) much smaller than the others and to make \(\mu_2\) significantly larger than 0. The former is evident (no object should receive significantly less assessment or less expert assessment than the others), but the latter depends on the graph.

Robustness Measures Implemented

The robustness measure of the results depends on whether confidences or standard deviations have been entered and the degeneracy-breaking condition.

If standard deviations have been entered (for a weighted degeneracy-breaking condition) the robustness of the 'true' values of the examinations, \((\left\vert \delta v_o \right\vert)\), is calculated as:

\[\left\vert \delta v_o \right\vert \leq \sqrt{\frac{2N}{\mu_2 C_o}} \]

where \(N\) is the number of evaluations and \(\mu_2\) is the second smallest eigenvalue of a certain matrix formed from the confidences. We also calculate the confidence-weighted root mean square bound:

\[\frac{1}{\sqrt{\frac{\mu_2 C}{N}}}\]

where \(C=\sum_aC'_a\) is the total confidence.

If confidences are entered instead, then (for a weighted degeneracy-breaking condition) the robustness of the 'true' values of the examinations, \((\left\vert \delta v_o \right\vert)\), is calculated as:

\[\left\vert \delta v_o \right\vert \leq \sqrt{\frac{2Nw}{\mu C_o}}\]

where the posterior variance, \(w\), is given by:

\[w=\frac{R}{N-N_o-N_a-1}\]

\(R\) is the residual from the least squares fit \((\bar v, \bar b)\) for \((v,b)\) and is given by:

\[R=\sum_{ao} c_{ao}(s_{ao}-\bar v_o - \bar b_a)^2\]

The confidence-weighted root mean square bound is given as:

\[ \sqrt{\frac{w}{2C}\sum \mu_j^{-1}} \]

where \(\mu_j\) are the non-zero eigenvalues of a certain matrix formed from the confidences.

If a simple degeneracy breaking condition is used then the robustness measures are divided by \(\sqrt{2}\). The robustness of the assessor biases, \((\left\vert \delta b_a \right\vert)\), are calculated in a similar manner to the objects.



References

  • Meadows, M. Can we predict who will be a reliable marker? (http://www.iaea.info/documents/paper_1162a18fb7.pdf) (2006)
  • Paul, S.R. Bayesian methods for calibration of examiners, British Journal of Mathematical and Statistical Psychology 34, 213-223 (1981)
  • Fuchs, D and Fuchs, L.S. Test procedure bias: A Meta-Analysis of Examiner Familiarity Effects, Review of Educational Research 56, 243-262 (1986)
  • Naes, T, Brockhoff, P and Tomic, O Statistics for Sensory and Consumer Science Wiley (2010)
  • Fisher, R.A. An Examination of the Different Possible Solutions of a Problem in Incomplete Blocks, Annals of Eugenics 10, 5275 (1940)
  • Giesbrecht, F.G. Analysis of data from incomplete block designs, Biometrics 42, 437-448, (1986)
  • Thorngate, W. Dawes, R.M. and Foddy, M. Judging merit, Psychology Press (2008)
  • Meyer, C.D. Matrix analysis and applied linear algebra, SIAM (2001)