Understanding Cohen’s Kappa
Each new sport event, whether it is a football match or a water polo contest, tends to come with its fair share of controversy when it comes to referee’s decision. In sports using a ranking score system such as Olympic ice skating, you need to have several judges to ensure a score’s stability. If only one judge was responsible for the final score, a competitor would be at risk of having a bad result just because the judge is having a bad day — which would be pretty unfair.
However, even with several judges, you are still far from a reliable scoring system: we “relies upon human observers maintaining a great degree of consistency between observers. If even one of the judges is erratic in their scoring system, this can jeopardize the entire system and deny a participant their rightful prize¹.”
Therefore, it is important to ensure that the generated score “meet the accepted criteria defining reliability¹”. “In statistics, inter-rater reliability […] is the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges⁷.” So, reaching a good level of inter-rater reliability is what we need to ensure our ice skaters are fairly judged.
Percentage agreement
There are multiple techniques to measure inter-rater reliability: “a partial list includes percent agreement, Cohen’s kappa (for two raters), the Fleiss kappa (adaptation of Cohen’s kappa for 3 or more raters) the contingency coefficient, the Pearson r and the Spearman Rho, the intra-class correlation coefficient, the concordance correlation coefficient, and Krippendorff’s alpha (useful when there are multiple raters and multiple possible ratings)²”.
Percentage agreement is the traditional and most straightforward technique, “it is calculated by dividing the number of cases in which the raters agreed by the total number of ratings. For instance, if 100 ratings are made and the raters agree 80% of the time, the percent agreement is 80/100 or 0.80. A major disadvantage of simple percent agreement is that a high degree of agreement can be obtained simply by chance; thus, it is difficult to compare percent agreement across different situations when agreement due to chance can vary⁴. In 1960, Jacob Cohen — a psychologist and statistician — criticized the use of percentage agreement for its inability to account for chance agreement and presented the kappa statistic as a solution to this problem.
Kappa’s statistic
The Cohen’s kappa is symbolized by the lower Greek letter κ and “was originally devised to compare two raters or tests and has since been extended for use with larger number of raters. […] In supervised machine learning, one “rater” reflects ground truth (the actual values of each instance to be classified), obtained from labeled data, and the other “rater” is the machine learning classifier used to perform the classification. Ultimately it doesn’t matter which is which to compute the kappa statistic. [The Kappa statistic] has two uses: as a test statistic to determine whether two sets of ratings agree more often than would be expected by chance (which is a dichotomous, yes/no decision) and as a measure of the level of agreement⁴”.
Kappa has a range of -1 to 1:
- κ = 1 if all cases are in agreement between the two raters
- κ = 0 if observed agreement are the same as chance agreement
- κ can be negative if there is “no effective agreement between the two raters or the agreement is worse than random⁴”. According to the kappa’s author, it is unlikely to occur.
“The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric⁵.”
Calculating Kappa
To calculate Kappa, we first need to know Po the observed agreement, and Pe the expected agreement.
- Po: number of cases in agreement divided by the total number of cases⁴. It is similar to the accuracy metric
- Pe: number of cases in agreement expected by chance⁴
To illustrate the Kappa calculation, let’s take the example from the Kappa’s Wikipedia page:

- First step: Calculate Po
- Po is similar to accuracy, so (20 + 15) / 50 = 0.7
2. Then, Calculate Pe
Pe is done in 3 steps. First we calculate Pyes : the probability that they both say “Yes” at random:
- A said “Yes” 50% of the time and B said “Yes” 60% of the time
- Pyes = ((a+b) / (a+b+c+d)) * ((a+c) / (a+b+c+d)) = 0.5 * 0.6 = 0.3
Then we calculate Pno : the probability that they both say “No” at random
- Pno = ((c+d) / (a+b+c+d)) * ((b+d) / (a+b+c+d)) = 0.5 * 0.4 = 0.2
Last step, we sum up Pyes and Pno. So the probability that they agreed on either Yes or No is equal to:
- Pe = Pyes + Pno = 0.3 + 0.2 = 0.5
3. Finally, Calculate Kappa

κ = (0.7–0.5) / (1–0.5) = 0.4
Interpretation
We have simple guideline to follow by Landis and Koch (1977):
- < 0 Poor
- 0–0.20 Slight
- 0.21–0.40 Fair
- 0.41–0.60 Moderate
- 0.61–0.81 Substantial
- 0.81–1.0 Almost perfect
So with 0.4, the Kappa statistic is considered as “fair”. Another way of thinking about it is to think that the classifier achieved a rate of classification 40% of the distance between the expected accuracy (50%) and 100% accuracy. So our classifier performed 40% (k = 0.4) of 50% (100% — 50%) above random chance.
“If expected accuracy was 80%, that means that the classifier performed 40% (because kappa is 0.4) of 20% (because this is the distance between 80% and 100%) above 80% (because this is a kappa of 0, or random chance), or 88%.⁵”
Pitfalls
- “The kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation⁵.”
- “Acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional⁵”.
- Kappa is considered as appropriate as a test statistic (to determine that agreement exceeds chance levels); however, its second use to measure of the level of agreement is subject to controversies. “Kappa’s calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable⁶.”
Sources
- Interrater Reliability, by Martyn Shuttleworth
- Interrater reliability: the kappa statistic, by Mary L. McHugh
- Wikipedia Cohen’s kappa
- Statistics in a nutshell — a desktop quick reference, Sarah Boslaugh, O’Reilly
- Stackexchange: Cohen’s kappa in plain English
- Kappa Coefficients: A Critical Appraisal, by John Uebersax
- Wikipedia Inter-rate reliability