Imagine two ophthalmologists measuring intraocular pressure with a tonometer. Each patient therefore has two measured values – one of each observer. CCI provides an estimate of the overall concordance between these measures. It is somewhat similar to „analysis of variance“ in that it examines intercouple variances expressed as a proportion of the overall variance of observations (i.e., the total variability of „2n“ observations, which should be the sum of variances within and between pairs). The CCI can take a value from 0 to 1, where 0 indicates no match and 1 indicates a perfect match. This percentage of match is criticized for its inability to account for random or chance-expected matching, which is the proportion of match you expect from two reviewers simply based on chance. In most applications, there is usually more interest in the size of the kappa than in the statistical significance of the kappa. The following classifications have been proposed to interpret the strength of the chord as a function of Cohen`s Kappa value (Altman 1999, Landis JR (1977)). Cohen`s kappa can be used for two categorical variables, which can be either two dummy variables or two ordinal variables. Other variants exist, including: κ = (observed agreement [Po] – expected agreement [Pe])/(expected agreement 1 [Pe]). The very significant Maxwell test statistics above show that evaluators have significantly different opinions in at least one category. McNemar`s widespread statistics show that disagreement is not evenly distributed.
Cohen`s kappa is a commonly used match measure that eliminates this random match. In other words, it takes into account the possibility that evaluators may guess at least some variables due to uncertainty. For this reason, many texts recommend 80% approval as the minimum acceptable agreement between evaluators. Any kappa below 0.60 indicates insufficient agreement among the evaluators and little confidence should be placed in the results of the study. If two instruments or techniques are used to measure the same variable on a continuous scale, Bland-Altman diagrams can be used to estimate the match. This diagram is a scatter plot of the difference between the two measurements (Y axis) compared to the average of the two measures (X axis). Thus, it provides a graphical representation of distortion (average difference between the two observers or techniques) with compliance limits of 95%. The latter are given by the formula: in the case of two evaluators, this function gives Cohen`s kappa (weighted and unweighted), Scott`s Pi and Gwett`s AC1 as measures of inter-evaluator agreement for categorical evaluations of two evaluators (Fleiss, 1981; Fleiss, 1969; Altman, 1991; Scott, 1955). For three or more evaluators, this function results in extensions of Cohen`s Kappa method, due to Fleiss and Cuzick (1979) in the case of two possible reactions per evaluator and Fleiss, Nee and Landis (1979) in the general case of three or more responses per evaluator. Fleiss kappa was calculated to evaluate the agreement between three doctors in the diagnosis of psychiatric disorders in 30 patients. There was a fair agreement between the three doctors, kappa = 0.53, p < 0.0001.
The individual kappas for „depression“, „personality disorder“, „schizophrenia“, „neurosis“ and „other“ were 0.42, 0.59, 0.58, 0.24 and 1.00, respectively. Consider a situation in which we want to evaluate the correspondence between hemoglobin measurements (in g/dl) with a bedside hemoglobinometer and a formal photometric laboratory technique in ten people [Table 3]. The Bland-Altman graph for these data shows the difference between the two methods for each person [Figure 1]. The mean difference between the values is 1.07 g/dL (with a standard deviation of 0.36 g/dL) and the compliance limits of 95% are 0.35 to 1.79. This implies that a particular person`s hemoglobin level, measured by photometry, can vary from the bedside hemoglobin level measured by the method from as low as 0.35 g/dl higher to 1.79 g/dl higher (this is the case in 95% of individuals; in 5% of individuals, variations could be outside these limits). This, of course, means that the two techniques cannot be used as a substitute for each other. It is important to note that there is no single criterion for what constitutes acceptable limits of agreement; This is a clinical decision that depends on the variable to be measured. The R Function Kappa() [vcd packet] can be used to calculate the unweighted and weighted kappa. The unweighted version corresponds to Cohen`s Kappa that we talk about in this chapter.
Weighted kappa should only be considered for ordinal variables and is extensively described in Chapter @ref (weighted kappa). .