Repeatability
 DOI:
 10.1093/med/9780198814726.003.0012
12.1 Introduction
The previous chapter focused on the challenges in obtaining valid information from participants within epidemiological studies in ‘free living’ populations. An equally relevant consideration is the difficulty in obtaining consistency of the results. Several terms are used in describing this phenomenon. They are often used interchangeably, although some authors have attempted to distinguish between them (Box 12.1).
Common sense determines that there may be a variety of reasons for failure to obtain consistent or reproducible results.
◆ First, there may be true subject variation. This is particularly relevant in physical measurements such as blood pressure, heart rate and similar variables that, in normal individuals, vary with time, even over short periods. There is also normal variation in lifestyle habits such as amount of exercise or diet, as well as physical and psychological symptoms.
◆ The greater concern, however, is that of a lack of consistency in the result obtained in the absence of true subject change. This lack of consistency may be due either to a single observer being inconsistent in the method of measurement or to inconsistencies between one or more observers. If this occurs to any major extent, data obtained from studies may be difficult to interpret because apparent differences between subjects might reflect differences in measurement.
12.2 Reducing subject variation
In any study the aim is to ascertain real individual measures that have biological meaning but exclude those that contribute random error, sometimes referred to as ‘noise’. What constitutes ‘noise’ depends on what is the aim of the investigation.
A study aimed to compare the blood pressure between adolescents from different ethnic groups. In planning the study, the researchers were aware that several factors can affect blood pressure including time of day, relationship to time of last meal and of last exercise. Thus, in addition to standardizing the approach to measurement, the participants were also measured within a narrow time of day and with limits to recent exercise and food. By contrast, if the research wished to examine time of day as a predictor of blood pressure then the participants might still need to have similar restraints in terms of time of last meal and exercise, but time of day would be a relevant predictor to record.
12.3 Reducing observer variation
As mentioned, individual observers may vary in their results due to variation in their methods of measurement, even when keeping subject level variation to a minimum. Training to assess this and continuous observation during a study can help. Ensuring observers follow a rigid protocol, which equally applies to administering a questionnaire as it does to physical measures, is also recommended. In general, a single observer can reduce the variability compared to when multiple observers are used.
In many studies it may be impossible to use a single observer to obtain all the necessary data and therefore multiple observers are required. However, a single observer may prove to be just as inconsistent over time as multiple observers. In theory, with rigorous training and standardization of approach, it should be possible to measure and minimize inconsistencies. Training may need to be repeated during the data collection period of a study to ensure maintenance of quality. Some illustrative examples are shown next.
A study involved the use of a stadiometer to measure height. In pilot studies of a single observer undertaking repeated measurements on the same subjects, some inconsistency was noted. After extra training this inconsistency disappeared.
In the same study as in Example 12.ii, a second observer had dificulty in obtaining acceptably close results to the first observer when assessing the same test subject. Again, training removed this inconsistency. As a quality control against ‘drift’, the two observers were checked every three months during the study by duplicate measures on six, randomly selected subjects.
There is, however, the potential problem of subject–observer interaction such that even in response to an interview, different subjects respond differently to minor differences between observers: this requires very close attention during training.
It is not always necessary to formally test consistency within observers.
In an interviewbased study that involved a quality of life assessment, a pilot study suggested that certain questions were answered differently depending on the sex of the interviewer. The answer to this was to use interviewers of one gender only.
If betweenobserver agreement is good, it can be assumed that withinobserver agreement is also satisfactory. Clearly, the converse is not the case.
In a study assessing perceived severity of eczema on a 0–4 scale, the investigator was concerned about inconsistency in assessment by an observer. It was not sensible for the observer to undertake duplicate assessments on the same patients because it would be impossible not to be influenced by the results of the first assessment, given the difficulty in ensuring blindness of assessment. The investigators overcame this problem by undertaking a training session using photographs, which included repeats that showed good betweenobserver consistency.
12.4 Observer bias and agreement
These are different concepts, but both may explain inconsistencies or lack of reproducibility in results between observers. Bias occurs if one observer systematically scores in a different direction from another observer, whereas poor agreement can also occur when the lack of precision by a single observer is random.
In a study that required observers to score Xrays by making measurements of the images based on specific landmarks on the images, it was found that one observer routinely measured from different landmarks on the films. This led to substantial disagreement and a systematic bias between that observer and the others. However, there was also variation due to the inherent difficulty on precisely locating the landmark on some of the films because of the quality of the images.
The possibility of bias is relatively easily assessed from differences in the distribution of measures obtained. Thus, (i) for continuous variables it is relatively simple to compare the results from two observers by using a paired ttest or similar approach, and (ii) for categorical variables the frequencies of those scored in the different categories can be compared with a Chisquared or similar test (see also Section 12.5.1).
The assessment of agreement is discussed next in detail.
12.5 Study designs to measure repeatability
In any study in which multiple observers are to be used, it is necessary to measure their agreement before the start of information gathering. The same principles apply when there is a concern about a lack of consistency within a single observer. The essence of any study to investigate this phenomenon is to obtain multiple observations on the same subjects by the different observers (or replicate measures by the same observer) done suficiently closely in time to reduce the likelihood of true subject change. An ideal approach is to enlist some subject volunteers who are willing to be assessed in a single session by multiple observers. It is important to take account of any order effect (i.e. where there is a systematic difference in measurement response with increasing number of assessments). One strategy to allow for this is to use the socalled ‘Latin square’ design (Table 12.1). In the example illustrated, five subjects are assessed by five observers in a predetermined order. With this kind of design, it is relatively simple statistically, by using an analysisofvariance approach, to separate the variation between different observers from that due to order and, of course, that due to the subjects themselves. A similar approach may be used when testing for reproducibility within an observer. One problem, however, particularly in assessing interview schedules, is that both the subject and the observer may remember the response. In such circumstances the replicate interviews need to be spaced in time, but not to such an extent that the true state of the subject has changed. The particular details of individual studies will determine what an appropriate interval is.
Table 12.1 The ‘Latin square’ design for a study of repeatability: five subjects (1–5) and five observers (A–E) giving the order in which the observers assess the subjects
Observer 
Subject 


1 
2 
3 
4 
5 

A 
1st 
2nd 
3rd 
4th 
5th 
B 
5th 
1st 
2nd 
3rd 
4th 
C 
4th 
5th 
1st 
2nd 
3rd 
D 
3rd 
4th 
5th 
1st 
2nd 
E 
2nd 
3rd 
4th 
5th 
1st 
12.5.1 Analysis of repeatability
The analytical approach is different for measures that are categorical and those that are continuous.
Categorical measures
For categorical measures, the kappa (K) statistic is the appropriate measure of agreement. It is a measure of level of agreement in excess of that which would be expected by chance. It may be calculated for multiple observers and across measures which are dichotomous or with multiple categories of answers. For the purposes of illustration, the simplest example is of two observers measuring a series of subjects who can be either positive or negative for a characteristic. There will be patients that both observers agree have the characteristic and similarly do not have the characteristic. There will be other subjects where there is disagreement.
The kappa statistic is calculated as follows.
Judgement by Observer B 
Judgement by Observer A 


Observer B 
Positive 
Negative 
Total 
Positive 
a 
b 
a + b 
Negative 
c 
d 
c + d 
Total 
a + c 
b + d 
a + b + c + d = N 
Proportion that A scored positive = $\frac{a+c}{N}$
Proportion that B scored positive = $\frac{a+b}{N}$
Therefore, by chance alone it would be expected that the proportion of subjects that would be scored positive by both observers = $\left[\frac{a+c}{N}\times \frac{a+b}{N}\right]$
Proportion that A scored negative = $\frac{b+d}{N}$
Proportion that B scored negative = $\frac{c+d}{N}$
Therefore, by chance alone it would be expected that the proportion of subjects that would be scored negative by both observers = $\left[\frac{b+d}{N}\times \frac{c+d}{N}\right]$
Therefore, total expected proportion of agreement = $\left[\frac{a+c}{N}\times \frac{a+b}{N}\right]+\left[\frac{b+d}{N}\times \frac{c+d}{N}\right]={P}_{e}$
Maximum proportion of agreement in excess of chance = $1{P}_{e}$
Total observed proportion of agreement = $\frac{a+d}{N}={P}_{o}$
Therefore, proportion of observed agreement in excess of chance = ${P}_{o}{P}_{e}$
The observed agreement in excess of chance, expressed as a proportion of the maximum possible agreement in excess of chance (kappa), is:
In this example two observers were asked to score Xrays of 120 subjects as to whether they thought a specific abnormality was present. The results were:
Observer B 
Positive 
Observer A 


Negative 
Total 

Positive 
57 
13 
70 
Negative 
16 
34 
50 
Total 
73 
47 
120 
The two observers had the same judgement on 91 subjects (57 + 34).
Just knowing the actual positive rates for each observer (73/120 for Observer A and 70/120 for Observer B) then just by chance alone it would have been expected that the product of these two proportions would give the expected number in which both scored positive. Adding in the expected number where both scored negative, the overall agreement is then:
The use of kappa is important, as the oftenused proportion of total agreement does not allow for the fact that some agreement is due to chance. The interpretation of kappa values is subjective, but as a guide Table 12.2 may be useful.
Table 12.2 Interpretation of kappa
Value 
Strength of agreement 

<0.20 
Poor 
0.21–0.40 
Fair 
0.41–0.60 
Moderate 
0.61–0.80 
Good 
0.81–1.00 
Very good 
Mathematically kappa can range from – 1 to + 1. Values below zero suggest negative agreement (i.e. which means that if one observer scores positive the other is more likely to score it negative than positive!). This is not normally of relevance unless circumstances are bizarre. Values close to zero suggest that the level of agreement is close to that expected by chance.
Bias can be assessed by examining the marginal totals. In Example 12.vii, the proportions scored positive by the two observers were similar (70/120 vs. 73/120), excluding any serious systematic bias even though the agreement is only moderate.
Continuous measures
For continuous measures, the simplest initial approach is to determine for each individual subject the absolute level of disagreement between observers. Several measures of agreement can then be obtained. First, calculation of the mean disagreement and the standard deviation around that mean can give an estimate of the range of disagreements. Secondly, calculation of the standard error of the mean disagreement can be used to provide a 95% confidence range for the likely mean disagreement. Finally, the closer the mean disagreement is to zero, the less likely is the presence of systematic bias.
Forty subjects had their waist circumference measured by two observers. The following results were obtained.
Circumference (cm) measured by: 


Subject number 
Observer A 
Observer B 
Difference, d 
1 
64.2 
64.6 
–0.4 
2 
71.3 
71.0 
+0.3 
3 . . . . 
80.4 . . . . 
84.2 . . . . 
–3.8 . . . . 
40 
66.2 
65.4 
+0.8 
Mean 
70.6 
71.0 
–0.4 
The average difference between the two observers was (A–B) –0.4 cm (and assume the standard deviation = 0.25).
An assessment of agreement can then be calculated:
Limits of agreement
The 95% range of disagreements (observer A–B) = – 0.4 ± (1.96 × 0.25)
= – 0.9 to + 0.1.
Thus, the disagreement between these two observers for 95% of subjects will be between –0.9 and + 0.1 cm.
In addition, the data can be plotted graphically. On the x axis is plotted the mean of the two values and on the y axis the difference (a ‘Bland–Altman’ plot). The plot can then display the range of agreements but also if there is a systematic bias between the two observers. The plot can also illustrate if there is a relationship between the level of the score and the amount of agreement.
Further reading
Dunn G (1992). Design and analysis of reliability studies. Stat Methods Med Res, 1(2), 123–57.Find this resource:
Landis JR, Koch GG (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–74.Find this resource:
White E, Armstrong BK, Saracci R (2009). Principles of Exposure Measurement in Epidemiology: Collecting, Evaluating and Improving Measures of Disease Risk Factors. Oxford Scholarship Online (Chapter 4: Validity and reliability studies); DOI: 10.1093/acprof:oso/9780198509851.001.0001Find this resource: