The higher the prevalence, the lower the overall match rate. The level of agreement tends to decrease as prevalence increases. At the observer accuracy level .90, there are 33, 32 and 29 perfect match for equiprobability, moderately variable and extremely variable. A good example of the source of concern about the importance of the kappa results obtained is shown in an article that compared the visual detection of abnormalities in biological samples by humans with automated detection (12). The results showed only moderate agreement between human and automated evaluators for kappa (κ = 0.555), but the same data gave an excellent percentage of agreement of 94.2%. The problem with interpreting the results of these two statistics is: how are researchers supposed to decide whether evaluators are reliable or not? Do the results obtained indicate that the vast majority of patients receive accurate laboratory results and therefore correct or incorrect medical diagnoses? In the same study, the researchers chose a data collector as the standard and compared the results of five other technicians with the standard. Although sufficient data to calculate a percentage match is not specified in the article, the kappa results were only moderate. How is the lab manager supposed to know if the results represent high-quality readings with little disagreement between trained lab technicians or if there is a serious problem and additional training is needed? Unfortunately, kappa statistics do not provide enough information to make such a decision. In addition, a kappa can have such a wide confidence interval (CI) that it understands everything from the good to the bad game. However, previous research has shown that several factors influence the kappa value: the accuracy of the observer, the number of codes overall, the prevalence of specific codes, the bias of the observer, the independence of the observer (Bakeman & Quera, 2011).
Therefore, interpretations of kappa, including definitions of what makes a good kappa, should take into account the circumstances. If statistical significance is not a useful guide, what size of kappa reflects an appropriate match? Guidelines would be helpful, but factors other than matching can affect their size, making interpretation of a certain magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equipped or do they vary their probabilities) and bias (are the marginal probabilities similar or different for the two observers). When other things are the same, the kappas are higher when the codes are equipped. On the other hand, kappas are higher when codes are distributed asymmetrically by both observers. Unlike variations in probability, the distorting effect is greater when the kappa is small than when it is large. [11]:261-262 On the other hand, if there are more than 12 codes, the increment of the expected kappa value becomes flat. So if you just calculate the percentage of the agreement, it could have already been used to measure the degree of agreement. In addition, the increase in the values of the sensitivity performance measures also reaches the asymptotes of more than 12 codes. The formula for calculating Cohen`s kappa for two evaluators is: where: Po = the relative agreement observed between the evaluators.
Pe = the hypothetical probability of random matching, which we find that in the second case it shows a greater similarity between A and B compared to the first. Indeed, although the percentage match is the same, the percentage match that would occur “randomly” is significantly higher in the first case (0.54 compared to 0.46). For example, if you want to calculate the percentage of correspondence between the numbers five and three, take five minus three to get the value of two for the counter. Cohen suggested interpreting Kappa`s result as follows: the values ≤ 0 as no agreement and 0.01-0.20 as none at low, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial and 0.81-1.00 as a near-perfect match. However, this interpretation allows very little agreement between the evaluators to be described as “substantial”. With a percentage of approval, 61% approval can immediately be considered problematic. Nearly 40% of the data in the dataset is erroneous. In health research, this could lead to recommendations for practice change based on erroneous evidence. For a clinical laboratory, it would be an extremely serious quality problem if 40% of sample evaluations were incorrect. This is the reason why many texts recommend 80% approval as an acceptable minimum agreement. Given the reduction in the percentage of agreement typical of kappa results, some lowering of the standards compared to the percentage of approval seems logical. However, assuming that 0.40 to 0.60 is “moderate” may mean that the lowest value (0.40) is 0.40.
A more logical interpretation is suggested in Table 3. Given that any agreement that is not perfect (1,0) is not only a measure of agreement, but also the inverse disagreement between the evaluators, the interpretations of Table 3 can be simplified as follows: any kappa below 0,60 indicates insufficient agreement between the evaluators, and little confidence should be placed in the results of the study. Figure 1 illustrates the concept of research datasets, which consist of both correct and incorrect data. For kappa values below zero, although they are unlikely to occur in research data, this result is an indicator of a serious problem. A negative kappa represents a worse-than-expected agreement, or disagreement. Low negative values (0 to −0.10) can generally be interpreted as “no match”. A big negative kappa represents a big disagreement between the evaluators. The data collected under the conditions of such disagreement between the evaluators are not significant. They look more like random data than properly collected research data or high-quality clinical laboratory measurements. It is unlikely that these data will represent the facts of the situation (whether research or clinical data) with a significant degree of accuracy. Such an outcome requires measures to retrain evaluators or rethink instruments.
In the past, percentage match (number of match values/total score) was used to determine the reliability of the interrater. However, a random match based on the rate of the evaluators is still a possibility – just as a random “correct” answer is possible in a multiple-choice test. Kappa statistics take this element of chance into account. Both the percentage of agreement and the kappa have strengths and limitations. Percentage match statistics are easy to calculate and can be interpreted directly. Its main limitation is that it does not take into account the possibility that the evaluators have guessed the scores. It may therefore overestimate the true agreement between the evaluators. Kappa is designed to take into account the possibility of guessing, but the assumptions it makes about the independence of the evaluator and other factors are not well supported, and therefore it can unduly lower the match estimate. In addition, it cannot be interpreted directly, and so it has become common for researchers to accept low levels of kappa in their evaluator reliability studies.
A low level of reliability of evaluators is unacceptable in healthcare or clinical research, especially when study results may alter clinical practice in a way that leads to poorer patient outcomes. Perhaps the best advice for researchers is to calculate both the percentage of agreement and the kappa. .