The purpose of this chapter is to explain how the diagnostic and treatment selection process can be made ‘evidence-based’.
The first chapter explained how a diagnostic lead can be used to provide a differential diagnosis; examples of such diagnostic leads and their differential diagnoses are shown throughout this book.
Each page shows how other findings can be used to ‘suggest’ a diagnosis by probably differentiating between diagnoses in a list so that one becomes more probable and others less probable.
A diagnosis can be ‘confirmed’ by the presence of findings that occur by definition only in patients with that diagnosis and not in others. Such a definition is a matter of convention and the term used is also the title to what is imagined or predicted after suitable follow-up with and without treatment or advice.
A diagnosis is the title to a group of predictions about a number of phenomena in the past, present, and future, with or without interventions of various kinds. The purpose of most diagnostic predictions is to help the patient but the predictions arising from a diagnosis may also be of relevance to other parties such as social services, insurance companies, disability pension services, the police, and the courts. Post-mortem diagnoses also provide groups of predictions, none of which benefit the patient of course.
A diagnosis, e.g. ‘Type 2 diabetes mellitus’, means that its recognized treatments, advice or other actions should be considered but these should only be suggested to the patient if other findings also indicate a significant chance of benefit (e.g. by being treated with insulin). The combination of findings that suggest that an action may benefit (called an indication) are sometimes also regarded as criteria for a sub-diagnosis, e.g. ‘Type 2 diabetes mellitus with severe insulin deficiency’.
If a diagnosis is to be of some practical value to the patient then patients with its ‘sufficient’ diagnostic criteria should have a significant chance of benefit from at least one of the treatments, advice, or other actions suggested by that diagnosis. If some patients are labelled with a diagnosis but do not stand to benefit in any practical way (and may even be exposed to unnecessary harm from the labelling), then this would be an example of ‘over-diagnosis’. Over-diagnosis and under-diagnosis represent the two sides of ‘mis-diagnosis’.
Conversely if a diagnosis is excluded by the absence of its ‘necessary’ criteria then no patients should lose out by not being offered any of its treatments, advice, or other actions. Therefore such ‘necessary’ criteria should not exclude patients with some prospect of benefit. If such patients are excluded, this would represent ‘under-diagnosis’.
In order to make each of the above ∙steps ‘evidence-based’ we have to show some track record for each prediction based on the combination of findings (symptoms, signs, or test results) used to make each prediction. Ideally this track record would be the proportion of times each such prediction was correct.
In other words, we must match the feeling of certainty about each prediction with an observed proportion of times that prediction was correct. This means that we must have a clear understanding of the arithmetic of proportions and probabilities.
Probabilities are abstract representations of mental processes. Thinking about thinking may strike some as being excessively introspective and doing so with mathematics would be too much to bear.
It is simpler than it sounds, however, because the mathematics of probability obeys the same rules as the arithmetic of proportions. For example, a probability of 0.33 that a patient has appendicitis is the degree of certainty that one would experience if there were 300 people admitted to an emergency ward in a month, 100 had appendicitis and all one knew about a person was that he or she was one of the people admitted to that ward.
The correspondence between observed proportions and probability means that we can reason with proportions instead of abstract degrees of belief. For example assume that 60 patients admitted to an emergency ward had the findings of localized right-lower quadrant pain and guarding, and of these 57 had appendicitis. If we came across a patient from this group and were only told that he or she had these findings then the probability of appendicitis would be 57/60 = 0.95.
The mental feeling of certainty of 0.95 would be based upon the external physical ‘evidence’ of an observed population of 57/60 = 95%. Thus, a belief of 0.95 would have ‘given substance’ or ‘substantiated’ by the real tangible population of 57/60. If there were 100 patients with appendicitis in the ward and 57 of these had the three findings, then the converse probability (known as the ‘likelihood’) of one of these patients having the three findings would be 57/100 = 0.57.
If a new patient was admitted to the ward with the three findings then the number would increase from 60 to 61. If the patient had appendicitis then the new proportion would become 58/61 = 0.951 = PH and if not it would be 57/61 = 0.934 = PL so that the ‘future probability’ would be either 0.934 or 0.951, the pair of values reflecting the uncertainty. If the highest of the two probabilities is called PH and the lowest is PL, then this pair of values allows one to calculate the original proportion on which that ‘future probability’ pair of values is based. The original numerator could be calculated as (1/(PH/PL))–1 = 57, the denominator being (1/(PH–PL))–1= 60. This information in turn would allow confidence intervals and other statistics to be calculated for the ‘future probability’.
In the remainder of this chapter we will do all arithmetic with proportions instead of probabilities. When we apply the result of such reasoning to an individual patient, we will convert the proportion, e.g. 57/60 = 95% to a probability, e.g. 0.95. Sometimes we will use guesswork based on careful assumptions. By doing this we aim to maintain good judgement in that if we check the different predictions that we make with a probability of 0.95, then in the long run, we should be correct about 95% of the time. We could do this as part of audit. If we found that for all 0.95 probabilities we were only correct 75% of the time for example, then this would be poor judgement and we could investigate where we are getting it wrong and why.
Proportions and their associated probabilities can be represented pictorially by a Venn diagram, as shown in Fig. 13.1.
The big box is the set of 300 patients studied. The 11 numbers inside each of the 11 sectors sector add to 300. The total number with ‘guarding’ is 34+3+3+57+23 = 120. The total number with localized right-lower quadrant (LRLQ) pain is 120+3+57+18+2 = 200. The total number with non-specific abdominal pain is 120+3+3+24 = 150 and the total number with appendicitis is 18+57+23+2 = 100.
It can be seen from the Venn diagram in Fig. 13.1 (and by reading the top arrow from left to right in the ‘P’ map in Fig. 13.2) that of patients with LRLQ pain, a proportion of 75/200 have appendicitis. Reading the P map from right to left, of the patients with appendicitis a proportion of 75/100 had LRLQ pain, and (reading down from right to left) a proportion of 80/100 had guarding. Of the patients with guarding (reading up from left to right), 80/120 had appendicitis. Finally, of patients with LRLQ pain, 60/200 had guarding and of the patients with guarding, 60/120 had LRLQ pain.
If we examine Fig. 13.2, we will see that it displays 6 proportions. If we multiply the 3 proportions in a clockwise direction we get 75/200 x 80/100 x 60/120 = 0.15. If we multiply the 3 inverse proportions in an anti-clockwise direction we get 75/100 x 60/200 x 80/120 = 0.15, so the product is the same in the clockwise and anti-clockwise direction. This is because the numerators 60, 75, and 80 appear once and the denominators 100, 120, and 200 also appear once in both directions. This will be true for any number of proportions or probabilities multiplied together in the same circuit in a clockwise and anti-clockwise direction. We shall call this the ‘inverse probability circuit rule’.
If we examine Fig. 13.3, then the same ‘inverse probability circuit rule’ applies. However, the proportion of the sub-set of those with LRLQ pain who are in the study set is 200/200 and the proportion of the sub-set of those with appendicitis who are in the study set is 100/100. By rearranging this special case and cancelling out, we get Bayes’ rule that the proportion of those with appendicitis in those with LRLQ pain (the ‘converse’ of those with LRLQ pain in appendicitis) is:
If those with appendicitis and LRLQ pain are also a subset of those with LRLQ pain, as in Fig. 13.4, by rearranging and cancelling out again, this gives ‘Aristotle’s syllogism’, that the proportion of all those studied in those with RLQ pain and appendicitis is:
We can also reason that the ‘converse’ (the proportion with LRLQ pain and appendicitis in all those studied) is:
If a patient complained of acute abdominal pain, this would suggest a number of possible differential diagnoses: a spurious complaint due to some misunderstanding, non-specific abdominal pain (NSAP), appendicitis, cholecystitis, peptic ulceration, pancreatitis, diverticulitis, etc. If the patient had localized right-lower quadrant abdominal tenderness and this only occurred in appendicitis and NSAP but never by definition in any of the other conditions, then the latter are ‘eliminated’. If the patient had ‘guarding’ that occurs often in appendicitis and never by definition in NSAP, then the latter is eliminated so that the only remaining possibility is appendicitis; the diagnosis is then certain.
In practice, the ‘eliminating’ findings in the above example do not form part of the definitions of diseases, so the reasoning has to be qualified by saying that each of the diagnoses is ‘probably’ eliminated and that the diagnosis is ‘probably’ appendicitis. This also applies when the reasoning process is applied by Sherlock Holmes! The same reasoning is used when trying to predict a patient’s response to some treatment or when trying to explain the results of a scientific study.
If a patient with Type 2 diabetes mellitus was found to have a 24h urinary albumin excretion rate (AER) of over 20 micrograms/min, then this could be due to a spurious chance result (that could be eliminated ‘probably’ if a repeat measurement was also high). It could also be due to recent excessive exercise, prolonged standing, a urinary infection, fever, heart failure, high blood pressure, chronic nephritis, any recent severe illness, or a chronic glomerular albumin leak due to diabetes mellitus (who would benefit from an angiotensin converting enzyme inhibiting or an angiotensin receptor blocking drug). In order to show that the latter is as probable as possible, we have to show that the other possibilities are as improbable as possible by a process of ‘probable elimination’. This process of identifying patients responsive to treatments is now called ‘stratified’ or ‘personalized’ medicine.
If we read in a paper describing a scientific study on patients with type 2 diabetes mellitus and treated hypertension that only 1 out of 77 patients had an AER between 20 and 40 micrograms/min, then this could mean that if we went on to study 77 patients in our own clinic, then we might be able to replicate the result by observing 3/77 or less again (or fail to do so by getting a result greater than 3/77). This might occur due to chance or because our patients are very different to those in the paper, or because the methods section of the paper was inaccurate so that we could not repeat the study properly, or because there were other studies with results of greater than 3/77 that we had not taken into account. In order that the probability of replication is high, we have to show that the probability of non-replication for all other reasons (including chance) is low.
Reasoning by probable elimination uses the idea that if a single finding occurs infrequently in those with a diagnosis (or some other phenomenon) then a combination of findings that includes that single finding will occur even less frequently. If a finding never occurs in a diagnosis, then a combination of findings that includes that finding will never occur in the diagnosis either.
It is important therefore to understand clearly the reasoning process of probable elimination. It is easier to illustrate it with an example from differential diagnosis, e.g. of RLQ abdominal pain.
If a patient has LRLQ pain then this is a good short lead, the differential diagnosis being probably appendicitis or non-specific abdominal pain (NSAP). If there is guarding, then as this is likely to occur in appendicitis but less likely in NSAP, the diagnosis is probably appendicitis.
This can be stated in terms of proportions too. Most patients with LRLQ pain have appendicitis or NSAP. Guarding occurs commonly in appendicitis but less commonly in NSAP, so in those with LRLQ pain and guarding the diagnosis is usually appendicitis. We have thus ‘stratified’ patients with LRLQ pain to improve a prediction. There is intense research interest in such ‘stratified’ or ‘personalized’ medicine at present in order to improve response to treatments.
We can also repeat this reasoning process in more detail using the data from Fig. 13.1. Most patients arriving in a surgical admission unit with LRLQ abdominal pain would have appendicitis (e.g. 75/200) or NSAP (e.g. 123/200) with a few (e.g. 2/200) having something else. If the proportion with LRLQ pain in the studied population was 200/300, then the proportion with ‘something else’ in those with LRLQ pain and guarding can be no more than 2/200 x 200/300 = 2/300.
If the patient had ‘guarding’ and this only occurred in 6/150 patients with NSAP, then the combination of ‘guarding’ with LRLQ pain can occur in no more more than 6/150 of those with NSAP. If the proportion with NSAP in the surgical admission unit was 150/300, then the proportion with NSAP and guarding and LRLQ pain and anything else could be no more than 6/150 x 150/300 = 6/300 (the lowest frequency).
If ‘guarding’ occurred in 80/100 of those with appendicitis and LRLQ pain occurred in 75/100, then both must occur together in at least 80/100 + 75/100 – 100/100 = 55/100 even if they occur together as infrequently as possible. If the proportion with appendicitis in the study population was 100/300, then the proportion with appendicitis, guarding, and LRLQ pain would be at least 100/300×55/100 = 55/300.
If the proportion with LRLQ pain and guarding and appendicitis in the study population was exactly 55/300, the proportion with LRLQ pain and guarding and NSAP was exactly 6/300, and the proportion with LRLQ pain and guarding and ‘something else’ was exactly 2/300, the proportion of patients with LRLQ pain and guarding who had appendicitis would be: 55/300 / (55/300 + 6/300 + 2/300) = 55/63 = 0.87. But as 55/300 was a minimum value and as 6/300 and 2/300 were maximum values, then at least 87.3% must have appendicitis. (The actual proportion in this example data set is 57/60 = 0.95 = 95% as shown in Fig. 13.1). This reasoning, using the lowest available ‘eliminating’ frequencies always, will produce a proportion of at least:(A)
The corresponding probability would be at least:(B)
The formal proof is given in Proof for the reasoning by probable elimination theorem, p.[link].
It is not always possible to count all proportions. For example, we can count the proportion of men and women who are taller than 1.5m but it would be difficult to count the proportions that are exactly 1.50000m as such people would be very rare or might not even exist. However, we can estimate the proportion by plotting the continuous distribution of heights in men and women and using this distribution to estimate the proportion at 1.50000m. The proportion would be very low of course and if we used them in the calculations for reasoning by elimination ( A worked example, p.[link]) then we would only get a result of ≥ 0. However, we would have known this already.
We can use an estimate during reasoning by elimination by assuming that a ratio of 4% (e.g. the proportion with guarding in NSAP) to 80% (e.g. the proportion with guarding in appendicitis) has the same effect on the probability of a diagnosis as if the ratio was 0.0004% to 0.008% or if it was a ratio of (0.04/0.8 = 0.05) to 1. In the same way, a ratio of 0.01 to 0.75 could be assumed to have the same effect on the probability of a diagnosis as a ratio of (0.01/0.75 = 0.013) to 1. If we make these assumptions about ‘a ratio to 1’, then the expression:(B)
can be replaced by the expression:(C)
By rearranging (C) we get (see Proof for the reasoning by probable elimination theorem, p.[link], for the formal proof):(D)
Expression (D) gives an estimate of 0.908, which is a closer to the actual value of 0.95 shown in Fig. 13.1.
From Bayes’ rule in Fig. 13.3:
Thus by replacing [0.5x0.04] by [0.4x0.05] and replacing [0.33x0.8] by [0.4x0.67] in expression (D) we get:
The 0.4s cancel each other out to an expression with no ‘likelihoods’:(E)
The conventional wisdom is that it is the ‘likelihood ratio’ that is the best ‘evidence’ of the performance of a finding in diagnosis or other predictions. However, there are two types of ‘likelihood ratio’.
Type 1. The first is the ‘overall’ likelihood ratio, which is the frequency of a finding in those with a diagnosis divided by the frequency of the same finding in those without the diagnosis (also called the ‘sensitivity’ of a test divided by its ‘false positive rate’ (i.e. 1 minus the specificity).
Type 2. The second type is the ‘differential’ likelihood ratio, which is the frequency of the finding in those with a diagnosis divided by the frequency of the same finding in a rival differential diagnostic possibility. This involves one ‘sensitivity’ being divided by another ‘sensitivity’.
It is the second type—the differential likelihood ratio that allows a finding to be assessed for use in diagnosis by probable elimination. This ratio needs to be as high as possible; if it is infinity, then it shows that the rival diagnosis is impossible. When a likelihood ratio is used in calculations, it is the rival diagnosis’s sensitivity that is put on top, so that the likelihood ratio of infinity becomes a likelihood ratio of zero.
This means that the evidence for a finding’s role in reasoning by probable elimination is the magnitude of its differential likelihood ratio—the lower the better. The finding with the lowest ratio available should be used in the eliminating process—a ratio of zero, if available. The differential likelihood ratio will only be zero, however, if it forms a part of the definition of a diagnosis, e.g. if it has been deemed a ‘necessary’ condition of that diagnosis. For example, when considering the differential diagnosis of proteinuria, if there is blood in the urine, then that patient cannot have diabetic microalbuminuria as its defintion specifies that all such patients must not have blood in the urine. Provided that the diagnosis being considered (e.g. glomerulonephritis) sometimes has blood in the urine, the likelihood ratio will be zero.
A finding (F1) that provides the lowest differential likelihood ratio for a pair of diagnoses (D1 and D2) will also provide the best differential probability ratio (or differential odds) for that pair of diagnoses because according to Bayes’ rule:
Before a differential likelihood ratio can be used, there must be a finding that suggests a list of differential diagnoses so that the probability of each diagnosis adds to as near to 1 as possible (i.e. so that the probability of something not in the list is very low). This is the second type of evidence of a finding’s performance in reasoning by probable elimination—its ability to act as a good ‘lead’.
If a finding is a good lead then the probability of each differential diagnosis in its list will be higher than the probability of the diagnoses not in the list. These ratios are the differential probability ratios (or the differential odds). A finding that acts as a good lead will, therefore, be a finding that has a low likelihood ratio between each diagnosis in the list and all other diagnoses not in that list.
Assume that localized LRLQ tenderness (as opposed to LRLQ pain) occurs in 50/100 patients with appendicitis, in 100/150 patients with ‘non-specific abdominal pain’ (NSAP), and in 0/50 patients (i.e. never) with ‘all other diagnoses’.
Assume also that a raised neutrophil count (RNC) occurs in 25/100 patients with appendicitis, in 0/150 patients (i.e. never) with NSAP, and in 50/50 patients with ‘all other diagnoses.
This means that LRLQ tenderness AND a RNC can only occur together in appendicitis, which makes the diagnosis of appendicitis certain. Note that the list of 2 diagnoses linked to LRLQ tenderness is different to the list of 2 diagnoses linked to RNC, so their lists are ‘independent’. This is analogous to solvable simultaneous equations being ‘independent’. So, each list ‘complements’ the other by eliminating different diagnoses.
However, the overall likelihood ratio of LRLQ tenderness regarding appendicitis is 50/100 divided by 100/200 = 1 suggesting that it is a useless test! Furthermore, the overall likelihood ratio of a RNC regarding appendicitis is 25/100 divided by 50/200 = 1, also suggesting that it is a useless test!
If we assume statistical independence between the occurrence of both these findings in those with and without appendicitis, then their combined likelihood ratio is 1 x 1 =1 and by Bayes’ rule the probability of appendicitis, given both findings, will be the same as its frequency in the study population of 100/300 = 0.33. But this is wrong as we know the diagnosis is certain. This means that using the overall likelihood ratio with Bayes’ rule gives us a false result because it fails to model ‘complementing’ differentiation.
The differential likelihood ratio for LRLQ tenderness between NSAP and appendicitis is 100/150 divided by 50/100 = 1.33 and between appendicitis and ‘all other diagnoses’ it is 0/50 divided by 50/100 = 0. If LRLQ tenderness is used as a lead, then the probability of appendicitis is 50/150 = 0.33, the probabilility of NSAP is 100/150 = 0.67, and the probability of ‘all other diagnoses’ is 0/150 = 0.
The differential likelihood ratio for a RNC between NSAP and appendicitis and NSAP is 0/150 divided by 25/100 = 0 and between appendicitis and ‘All other diagnoses’ it is 50/50 divided by 50/100 = 2.
When the lowest differential likelihood ratios or odds are used in the expression (D) to calculate the probability of a diagnosis by elimination and thus using ‘complementing’ differentiation, the probability of appendicitis is:
which by inserting the numbers is:
The use of differential likelihood ratios with the reasoning by probable elimination theorem to model diagnostic thinking (instead of Bayes’ rule) gives the correct result. This is because unlike Bayes’ rule with ‘overall likelihood ratios’, it makes use of ‘complementing’ differentiation.
The ‘overall likelihood ratio’ is the frequency of a finding occurring in a diagnosis (e.g. ‘guarding’ in appendicitis) divided by the frequency of that finding in all those without the diagnosis (e.g. ‘guarding’ in those without appendicitis).
In 300 patients admitted to a surgical department over a month (see Fig. 13.1), 120 had ‘guarding’ and 100 patients turned out to have appendicitis. Eighty patients had both appendicitis and ‘guarding’. This meant that of those 100 patients with appendicitis, 80/100 = 80% had ‘guarding’. Also, of the 120 patients with ‘guarding’, 80/120 = 66.67% also had appendicitis.
The frequency of ‘guarding in those without appendicitis was 40/200 = 20%, corresponding to a probability of 0.20 (because many patients had other conditions that can cause guarding). Therefore, the ‘overall’ likelihood ratio would be 0.80/0.20 = 4.
If the 300 patients had been admitted to the surgical ward in a month and 100 of these had appendicitis, then the incidence per month would be 100/300 = 0.33. If 120 of these 300 patients had ‘guarding’, then its incidence would be 120/300 = 0.40%. However, if all patients with appendicitis and all patients with ‘guarding’ during the same month had been sent to hospital from a catchment area of 300,000, then the incidence of appendicitis in the catchment population would be 100/300,000 = 0.0033 and the incidence of guarding would be 0.0040 per month.
In the catchment area, the proportion of those with appendicitis who had guarding would also be 80/100 = 80%. The proportion of those without appendicitis who had ‘guarding’ would be 40/299,900 = 0.0013. Therefore, the likelihood ratio would be 0.8/0.0013 = 5998 (five thousand nine hundred and ninety eight!)—compared to a ratio of 4 inside the surgical department. Despite this, the probability of any patient in the catchment area having appendicitis who is known to have guarding would still be 80/120 = 66.67%, which is exactly the same as for patients in the hospital. If the frequency of guarding in those with NSAP is also 4% in the community, the differential likelihood ratio, differential odds and probability of the diagnosis will also the same.
So, the overall likelihood ratio will be greater if the population contains larger numbers of patients without the diagnosis or the finding (e.g. healthy people). It is best used as evidence for the ability of a test to screen such populations for asymptomatic diseases and should only be applied in the population from which it was derived.
The above example assumed that all patients with guarding and appendicitis seen in the community were sent into hospital. However, if the primary care physicians had been able to send all patients with appendicitis but only those with severe forms of NSAP and other diagnoses into hospital, then this may create a difference in the differential likelihood ratios between the community and the hospital. It is important, therefore, to check whether all likelihood ratios—overall and differential, are the same in other populations.
Diagnoses are often pursued actively by looking for findings that are ‘likely’ to occur in patients with the diagnosis that one is trying to confirm, and ‘unlikely’ to occur in those with diagnoses that one is trying to ‘eliminate’. However, another approach is to think of each of the patient’s findings in turn (e.g. LRLQ pain, guarding, etc.), and to consider if there is only one diagnosis that is common to some lists of differential diagnoses. This approach only depends on ‘differential probability ratios’ (see equation (E) in Using very low frequencies or probability densities, p.[link]).
If only a single diagnosis (e.g. appendicitis) occurs commonly in a number of leads (e.g. LRLQ pain and guarding), it follows that that single diagnosis will become probable, i.e. it will occur very frequently in a group of patients with those lead findings (e.g. appendicitis will occur frequently in those with a combination of LRLQ pain and guarding). The frequency with which the diagnosis will be found can be estimated by using equation (E).
In order to estimate the frequency given a combination of findings by using observed frequencies given single lead findings, choose the best lead (i.e. the finding with shortest list of differential diagnoses). Thus the shortest list of differential diagnoses is that of LRLQ pain—the list is appendicitis (in 37.5%) or NSAP (in 61.5%). These account for 99% of patients with LRLQ pain, the other 1% not being in the list. For each other diagnosis in the list (i.e. NSAP alone in this case), choose another finding that provides the best (i.e. the lowest) ‘differential probability ratios’. For example, guarding is associated with NSAP in only 6/120 = 5% of cases and appendicitis occurs in 80/120 = 66.67% of patients with guarding, so the differential probability ratio is 0.05/0.67 = 0.075.
It is this lowest differential probability ratio that is required in order to probably eliminate each rival diagnoses (e.g. NSAP) and thus to estimate the probability of the diagnosis to be confirmed (e.g. appendicitis). This is done by first adding up the lowest differential probability ratio for each differential diagnosis as follows:
If the sum of all the lowest differential probability ratios is zero, then the probability of the diagnosis will be one, of course—certainty.
The lowest differential probability ratio for NSAP is 0.075 (provided by guarding) and the lowest differential probability ratio for the ‘unlisted’ diagnoses is provided by LRLQ pain is 1/37.5 = 0.027. Therefore, the estimated probability of appendicitis give LRLQ pain and guarding is 1/(1 + 0.075 + 0.027) = 0.908 (see equation (E) in Using very low frequencies or probability densities, p.[link]). Note that this calculation does not use incidence or prevalence and only a small number of frequencies, minimizing the task of data collection when seeking evidence for the usefulness of tests.
Data are therefore needed to find the frequency with which patients with each differential diagnosis occur in those with a lead finding in different clinical settings (i.e. to calculate the various differential probability ratios). Data are also needed on the frequency of findings in patients with diagnoses to calculate the differential likelihood ratios. These may vary between different communities, hospitals, parts of the country, etc.
Statistical independence plays an important role in the arithmetic of probability. For example, if the probability of throwing an even number on a die is 1/2 and the probability of throwing a ‘five’ is 1/6, then the probability of throwing an even number followed by a ‘five’ is 1/2 x 1/6 = 1/12. This is because it is assumed that the result of each throw is statistically independent of any other result. The same assumption is sometimes made in diagnosis, so that if the frequency of LRLQ pain in NSAP is 1/2 and the frequency of guarding is 1/6, then the frequency of both is guessed as being 1/2 x 1/6 = 1/12.
During reasoning by probable elimination, we do not guess but rely on the fact that the frequency of LRLQ pain and guarding in NSAP can be no more than 1/6. However, if there was statistical dependence, then we could assume that both findings did occur together 1/6 of the time. If we did this we would probably overestimate the true frequency of both. It also has the advantage of only using a small amount of information, i.e. one finding per rival diagnosis to be eliminated.
If we assume statistical independence, then we usually include all the patient’s findings and multiply together all their frequencies of occurrence in each possible diagnosis. This is well known to over-estimate probabilities, e.g. so that the probability of appendicitis would be 0.999 and its rivals would be 0.001 or less. Also it is not feasible to find the frequency of all findings that exist in all diagnoses that exist. A compromise would be to use the known best two likelihood ratios for the rival diagnosis, especially if each differential likelihood ratio is weak. For example, if rebound tenderness occurs in 0.1 of those with NSAP and 0.9 of those with appendicitis, the ratio is 0.1/0.9 = 0.111. If we keep to a dependence assumption then the estimated probability of appendicitis remains (using equations (D) and (E) in Using very low frequencies or probability densities, p.[link]):
If we assume statistical independence between rebound tenderness and guarding in those with appendicitis and NSAP, then the estimated probability of appendicitis is (using equations (D) and (E)):
It should be noted that the estimate is now already slightly higher than the true value of 0.95. Again by using equation (E) in Using very low frequencies or probability densities, p.[link], the incidence or prevalence of a diagnosis is not used in the calculation.
A better approach would be to combine weak diagnostic findings to try to form more useful numerical tests and to test their ability to create differential likelihood or probability ratios.
Fig. 13.5 shows the proportion of patients with acute abdominal pain at different ages that turn out to have each diagnosis. In the range 0–9 years of age, 48.6% of patients will have appendicitis, 49.8% of patients will have ‘non-specific abdominal pain’ (NSAP), and 1.6% will have small bowel obstruction. Acute abdominal pain under the age of 10 is thus a very good lead with only 3 differential diagnoses. However, over the age of 70 years, the differential diagnosis was appendicitis (4.6%), diverticulitis (7.6%), perforated duodenal ulcer (4.4%), ‘non-specific abdominal pain’ (32.2%), cholecystitis (35.2%), small bowel obstruction (9.1%), and pancreatitis (6.9%). In the intervening ages, the proportions change as shown in Fig. 13.5.
Fig. 13.5 is a histogram prepared by dividing the ages into ranges and counting the proportions with each diagnosis in each range. However, it is also possible to create smooth curves to display the probabilities. This can be done by plotting the distribution of the ages in a group of patients with each diagnosis; for example, by using kernel spline functions. This can be done by placing a ‘Gaussian bell’ over each data point and then adding the height of the curve for each ‘bell’ at each value (each age of the patient in this case). If the summit of each Gaussian kernel ‘bell’ is set to a value of 1, then the sum of the height of each Gaussian kernel distribution gives the estimated number of the patients whose data contributed to that result. The smoothness of the curve is controlled by varying the standard deviation of each Gaussian curve kernel – the wider the standard deviation, the smoother the curve and the greater will be the estimated number of patients contributing to each point on the curve.
The estimated number of patients at each value for a diagnosis is then divided by the total number of patients with that diagnosis. This gives the ‘estimated’ likelihood of that value (e.g. of a patient being that age). For the sake of simplicity, only three of these distributions are shown in Fig. 13.6; those for non-specific abdominal pain (NSAP), appendicitis, and cholecystitis. The age distribution of patients with NSAP and appendicitis peak at about 14 years of age, whereas the peak age distribution for cholecystitis is at about 70 years of age.
If the estimated number of patients with a diagnosis, e.g. appendicitis, is divided by the total estimated number of patients at each age, then this gives the proportion with each diagnosis at that age. These values are plotted in Fig. 13.7. It can be seen that at the age of 10, the proportion with appendicitis and NSAP are almost the same but they diverge at older ages.
The likelihoods in Fig. 13.6 and the probabilities in Fig. 13.7 can be used in reasoning by probable elimination, as described in previous pages. The methods used here can be applied to numerical test results, and clinical scores (e.g. the Wells’ score). The same approach can also be used on estimated probabilities (plotted on the X axis) to see if they are equal to the proportion of correct predictions (plotted on the Y axis) as part of an audit of the accuracy of probabilities.
The diagnosis of ‘diabetic albuminuria’ is based on ‘stratifying’ patients based on the presence of ‘diabetes mellitus’ and ‘albuminuria’, and also on the absence of other causes of ‘albuminuria’, such as transient rises in urine albumin, recent exercise, prolonged standing, urinary tract infection, other infectious illnesses, other severe illnesses, dehydration, hypertension, heart failure, nephritis, or other chronic renal disease.
‘Diabetes mellitus’ is assumed to be present if at least two fasting or random blood glucose levels exceed 7mmol/L or 11mmol/L, respectively, so that a transient or self-limiting glucose rise is probably eliminated. ‘Microalbuminuria’ is assumed to be present if 2 out of 3 albumin excretion rates (AER) exceeds 20 micrograms/min (or an albumin–creatine ratio of >2.5 in females or >3.5 in males), which means that a transient self-limiting rise in albumin excretion is probably eliminated.
Collecting a specimen overnight or on rising means that ‘recent prolonged standing’ or ‘heavy exercise’ is probably eliminated. The absence of any symptoms, signs, or test results solely attributed to the other diagnoses (e.g. no breathlessness or ankle swelling, a controlled BP, negative urine tests for blood, or leucocytes, etc.) means that the other diagnoses too are probably eliminated.
This means that there is ‘persistent albuminuria’ in the presence of diabetes mellitus that can be explained by a leakage of albumin from renal glomeruli due to damage from persistently raised blood glucose. It is also imagined that lowering the blood pressure within each glomerulus with a drug will reduce this leakage and prevent progression until the albumin excretion is over 200 micrograms/min (which would mean by definition they have ‘diabetic nephropathy’). These patients are thus identified by ‘stratification’.
Some patients may not progress on placebo, some may progress unless they are treated, and some may progress despite being treated (these are ‘triage’ groups as used in emergency situations). If patients with the ‘other causes’ of microalbuminuria (e.g. excessive exercise) are not excluded from treatment, then the treatment to prevent progressive glomerular leak will be given to more patients with little prospect of benefiting because they would not have progressed in the first place or because they will progress despite the treatment.
A combination of findings assembled by reasoning by probable elimination that predicts probable benefit from treatment can by common agreement be regarded as a ‘sufficient’ diagnostic criterion. All those with this criterion would then ‘definitely’ have the diagnosis.
If there is another combination of findings (e.g. that includes the albumin–creatinine ratio (ACR) instead of the AER), then this combination could be another ‘sufficient’ criterion. Patients with the diagnosis would be those with at least one of these ‘sufficient’ criteria. In order to exclude diagnoses we need a ‘necessary’ criterion that includes virtually all those who benefit so that if that finding is absent, virtually no patients miss out. These criteria depend on careful analysis and ‘stratification’ of clinical trials based on the principles of ‘triage’.
Patients with provisional criteria of ‘diabetic microalbuminuria’ were randomized to have an angiotensin receptor blocker (ARB) or placebo included in their BP control treatment1. The proportion developing diabetic nephropathy within 2 years were ‘stratified’ to those starting with an AER of 20–40, 41–80, 81–120, and 121–200 micrograms/min and were plotted as shown in Fig. 13.8. Only 1/77 developed nephropathy on placebo after starting with an AER between 20 and 40 micrograms/min and 1/127 developed nephropathy on an ARB, so that only 0.5% gained from treatment. However, above 40 micrograms/min, more progressed to nephropathy without treatment and more benefited from treatment. This suggests that the cut-off point should have been 40 micrograms/min and that about 1/3 of the patients already offered an IRB do not benefit much within 2 years. The risk of nephropathy and the proportion benefiting from treatment is more at higher levels of AER; this also shows that AER is a good predictor of outcome. This approach is the aim of ‘stratified or ‘personalized’ medical research.
The response to treatment used here was based on the same biochemical measurement (AER) that was used to establish the diagnosis and treatment indication. The same analysis can be carried out on clinical trials where the outcome was based on symptoms or a well-being score. However, not all patients with a diagnosis are offered all the treatments linked to it. Patients with a diagnosis make up a set that encloses sub-sets of patients who benefit from various actions. In the case of ‘diabetic microalbuminuria’, each level of severity (each AER) can be regarded as a ‘stratified’ or ‘personalized’ diagnostic sub-set with its own probability of benefit from 0.5% to 27%. Few patients with the diagnosis would opt for a treatment with a probability of benefit that was about 0.5%.
The role of a doctor is to recommend treatments to patients by arriving at diagnoses and advising them on the probability of success. This ‘stratification’ can be improved by predicting more accurately the ‘triage’ groups of (1) those who will get better without treatment, (2) those who will get better only with treatment, and (3) those who will fail to get better and may have some other progressive illness that might respond to something else, e.g. an unknown renal disease. Its discovery might result in another diagnosis being added to the list of a raised AER. It is also possible that the differential diagnoses might change for different values of AER as in Fig. 13.4, Fig. 13.5, and Fig. 13.6. This would allow more patients who do not respond to treatment with an ARB being excluded from the criterion for ‘microalbuminuria’.
The clinical trial results in Fig. 13.8 still showed a high probability of benefit between 120 and 200 micrograms/min but the data were sparse at high levels. If more data had been available, then it might have been possible to create a set of curves as shown in Fig. 13.9. The broken curve of those on treatment shows that the proportion with an adverse outcome is lower than the unbroken curve of those on placebo.
If the treatment had been ineffective, then the broken sigmoid curve would have been superimposed on the continuous curve. However, if the test result did not predict the outcome, then both curves would have been flat. If the treatment was effective, then the flat, broken line would be below the continuous flat line as shown; if not they would be superimposed. A poorer test would result in a shallower curve between the flat and sigmoid curve. A better test with fewer untreatable causes would produce a sigmoid curve with longer horizontal segments at the top and bottom with a steeper rise between. A perfect test would give a vertical rise. A number of such tests could be compared in a RCT, e.g. by doing an AER, albumin–creatinine ratio, etc.
The current cut-off for diagnosing and treating patients with diabetic microalbuminura is 20 micrograms/min. Fig. 13.8 showed that many patients are diagnosed and treated between 20 and 40 micrograms/min when there is no difference between treatment and placebo. In Fig. 13.9 the treatment and placebo curves are also close between ‘160’ and ‘200’ where the condition is therefore too advanced to benefit from treatment.
If a cut-off was placed at 40, then few patients would lose out by not being offered the treatment. However, if the test result was above the cut-off point, then it is important to consider the actual test result, e.g. ‘100’, and to estimate the probability of benefit by subtracting the probability of the outcome on placebo from the probability of the outcome on treatment. At a value of ‘100’ in Fig. 13.9, the probability of the outcome on placebo is about 0.22, whereas on treatment it is 0.09, the difference being 0.22–0.09 = 0.13. This means that the number needed to treat is about 1/0.13 = 7.7 for one to benefit. This would be put to the patient in shared decision-making.
If ‘100’ had been chosen as the cut-off for not considering treatment and the result of a study of the proportion with each outcome plotted as shown in Fig. 13.10, then the marked discontinuity at ‘100’ indicates that the cut-off is too high. This would result in ‘under-diagnosis’ so that many patients would miss out. If the same study was conducted with a cut-off at ‘20’, then the curve would follow the course of the broken line in Fig. 13.9 with no discontinuity, suggesting ‘under-diagnosis’ or that the treatment was ineffective in this range at least. A series of studies could be done by moving the cut-off point in stages to assess treatment effectiveness at various cut-off points. This could be used to check that the result of a published RCT was borne out in other centres or to compare tests.
If a patient presents with acute abdominal pain, then there will be a list of possible explanations. These include appendicitis, a self-limiting condition, and a ‘spurious symptom’, so that when the patient is asked again, it is met by a denial. In the latter case, the finding has failed to be replicated. A doctor listening to a story or looking at test results should consider their reliability—or their probability of replication. The same happens with scientific studies. There may be many reasons for a low probability of replicating a study result, one important cause being that if the number of observations in the study were few, then chance variation would probably lead to a different repeat result.
Chance variation is often assessed by asking what ‘true’ AER value in the total population would result in the observed proportion of 1/77, or something more extreme (i.e. 0/77), being seen 2.5% of the time if 77 patients were drawn at random from that total population. The lower 2.5% confidence interval (CI) in this case would be 0.0003. The upper 2.5% CI calculated in the same way would be 0.0702.
Bayesians think that this approach is wrong and maintain that one should first guess ‘subjectively’ the prior probability of all the possible results in the total population (i.e. of it being 0%, 1%,...15%,...100%) on the basis of the evidence of other studies, personal experience, etc. The likelihood of getting the observed result of 1/77 from each of these possibilities is calculated and then Bayes’ rule is used to get an estimate of the probability of each possible outcome (or by adding several outcomes, a range of them, e.g. from 0/77 to 4/77).
There are objections to both these approaches. Another approach would be to regard the observed outcome of 1/77 developing nephropathy as a member of the set of all observations of 1/77 (about asthma, deep vein thrombosis, etc.), the combined set of all these sets containing 1/77 = 0.013% getting the predicted outcome. We can then calculate the proportion of times we would get an outcome reasonably near to 1/77 (e.g. 0/77, 1/77, 2/77, 3/77) if we selected 77 at random from the pooled population with an outcome of 0.013%. Thus the probability of getting a repeat result between 0/77 and 3/77 inclusive would be 0.98 from the binomial theorem. The probability of getting a repeat result between 0/77 and 2/77 would be 0.92 and the probability of replication between 0/77 and 4/77 would be 0.99.
This 0.01 probability of non-replication between 0/77 and 4/77 based on an observed result of 1/77 would ‘probably eliminate’ this possibility if it were used in the ‘probable elimination expressions’. If we could also probably eliminate the other causes of non-replication: inaccurate description of methods and results, the author’s patients or population being very different, the absence of contradictory results in other studies, etc, then the probability of replication would be high. The differential probability or differential likelihood ratios for this reasoning would have to be guessed in most cases (unless, for example, some journal editors collected such data from past experience).
Hypotheses are guesses or predictions about currently inaccessible phenomena that are still being investigated. In medical practice, these are ‘working’ diagnoses. A ‘theory’ is a prediction that is not currently being investigated (this is ’final’ diagnosis in clinical reasoning).
If a patient presents with acute abdominal pain there will be past experience of the diagnoses being considered, so that if all but one are ‘probably eliminated’, the high probability of the remaining diagnosis can be supported by a track record. With novel scientific hypotheses we may not have thought of all the possibilities, so that even if all but one of those being considered is shown to be improbable, we cannot conclude that the remaining hypothesis is probably correct.
Karl Popper made this point by saying that it was not possible to confirm hypotheses but only to ’refute’ or ‘falsify’ them This implied that such reasoning used definitive criteria. For example, if we postulated that undiagnosed diabetes had caused nephropathy in a group of patients, then this hypothesis could be falsified if a high HbA1c was a ‘necessary’ criterion for diabetes and the HbA1c was normal in each member of the group. If a high HbA1c was simply frequent in diabetes then the hypothesis would be ‘probably eliminated’, not falsified.
If a diagnosis or novel hypothesis is still possible (or even ‘confirmed’ in the case of a diagnosis) there is added uncertainty. This is because diagnoses and hypotheses are titles to many things that we imagine (i.e. predict) about the present, past, and future in terms of phenomena that can or cannot be verified directly by observation.
All probabilities represent a degree of uncertainty about a predicted event. The only highly certain thing about probabilities is that if they are derived appropriately they can predict accurately the frequency of correct predictions of various kinds in the long run. (This is why bookmakers make a profit.) This may well happen if the probabilities are based on past experience. However, probabilities are usually estimates, e.g. the probability of nephropathy based on an AER of 102 micrograms/min when no past patients may have had this actual result.
The only way to assess the accuracy of all probabilities is to check how often all similar probabilities (e.g. of 0.8) are linked to correct predictions in the long run (it should be 80% of the time). This would be a ‘probability audit’. If all probabilities have a corresponding predictive success rate, then a plot of probability against correct predictions should be a straight line from 0 to 1 (or a line of identity). If it is not, then the plot can act as a calibration curve to correct the probability.
A ‘probability audit’ can be done for all predictions made by a person, or those only connected to medicine, or a speciality in medicine, or even a single diagnosis. However, some predictions cannot be verified to see if they are ‘correct’ (e.g. molecular changes). We then have to assume that our probabilities connected with non-verifiable events are equally accurate to our verifiable probabilities.
People appear to conduct informal ‘probability audits’ subconsciously during their day to day lives by modulating their sense of certainty to avoid being over-confident or under-confident. If we fail to do this, we would make more misjudgements than necessary.
The aim of the arguments below is to prove expression (1), which is explained in footnote *. When (a) Fi is any of the ‘findings’ F1, F2,...Fn actually observed (e.g. symptoms), (b) Dj is a hidden phenomenon (i.e. making up a diagnostic criterion), (c) when Ďx = ‘not Dx’ such that p(Ďx) = 1–p(Dx) and when Dx is a suspected diagnosis chosen from the list D1, D2, Dm, (d) when FL is one of the findings represented by Fi chosen as a ‘diagnostic lead’ FL being chosen so that the value of ‘m’ (the number of diagnostic possibilities linked to it) and p(D0/FL) (the proportion of patients without one of these diagnostic possibilities) are both as low as possible and so that, (e) when p(Fi/Dj) for each Dj other than Dx can use any Fi although the lowest p(Fi/Dj) for each Dj will give the highest lower bound for p(Dx/(F1∙...Fn):
Substituting (5) in (4) gives:
Substituting (6) in (2) gives:
[see footnote *]
Make a ‘likelihood ratio equivalence assumption’ that a differential likelihood ratio of xi/yi and (xi.ki)/(yi.ki) result in the same value of p(Dx/F1∙...Fn) and provided that 1 ≥ p(Fi/Dj).ki ≥ 0; 1 ≥ p(Fi/Dx).ki ≥0 and 1≥ p(FL/D0).kL ≥0 and 1≥ p(FL/Dx).kL ≥0 then:
Substituting (10) in (9):
Now let ki = 1/p(Fi/Dx) and kL = 1/p(FL/Dx) so that:
Substituting (12) in (11) when the term ‘likelihood ratio’ means the ‘differential likelihood ratio’, which is the same as a ratio of two ‘sensitivities’:
Substituting (15) in (14):
It is noteworthy that expression (16) does not use ‘likelihoods’ at all. Equations (14) and (16) always provide estimates of p(Dx/F1∙...Fn) that are greater than or equal to zero even when the values of the denominator likelihoods are low (e.g. in probability densities). Any Fi can be chosen to ‘probably eliminate’ a Dj by showing that each p(Dj/F1∙...Fn) is low but in expressions (1) and (7), the closest upper bound for each p(Dj/F1∙...Fn) and the closest lower bound for p(Dx/F1∙...Fn) will be obtained from using the lowest p(Fi/Dj) for each Dj in the numerator. However, in the simpler expressions (14) and (16), the closest estimate for p(Dx/F1∙...Fn) will be obtained from using the lowest ratio of p(Fi/Dj) / p(Fi/Dx) in expression (14) or the lowest probability ratio of p(Dj/Fi)/p(Dx/Fi) in expression (16). When these ratios are actually zero, then p(Dx/F1∙...Fn) = 1 providing a perfectly accurate result.
If p(Fi/Dj) or p(Dj/Fi) is not known for some values of ‘i’ then the expressions still hold true for the value of ‘i’ that provides the lowest known likelihood or likelihood ratio. It should be noted that expression (14) provides an estimate of the upper bound of the likelihood ratio of the total evidence (i.e. all the symptoms, signs and test results etc.) and thus expressions (14) and (16) estimate the lower bound of the probability of the diagnosis given the total evidence even though many of these findings will not be used in the calculation.
Expressions (14) and (16) are inequalities but if the lowest known likelihood ratio is assumed (by a ‘dependence’ assumption) to be EQUAL to the likelihood ratio for the total evidence, then expression (14) provides an estimate of the probability of the diagnosis given the total evidence. Thus the ‘inequality’ (14) after a dependence assumption becomes an ‘equality’:
and the ‘inequality’ (17) gives rise to the ‘equality’ (18):
The principle of using the most highly predictive combination of findings (which can be described as the ‘central’ or ‘most relevant’ evidence) as an estimate of the probability given the total evidence can be regarded as a ‘heuristic’ that simplifies the interpretation of the ‘total evidence’.
A further assumption of statistical independence can be made between p(Fi/Dj) / p(Fi/Dx) and p(Fi+1/Dj) / p(Fi+1/Dx) in expression (17) when Fi+1 provides the next lowest likelihood ratio to that provided by Fi. Expression (17) will then become:
By applying the same assumption to expression (18) we get:
An assumption of statistical independence can be made between any number of p(Fi/Dj) and p(Fi/Dx) up to ‘n’. When the maximum of ‘n–1’ such assumptions are made we get the following result from expression (17):
The corresponding result from expression (18) is:
The value of p(Dx/F1∙...Fn) can be estimated for each Dj in turn using any of the above approaches and each p(Dx/F1∙...Fn) divided by the sum of all the p(Dx/F1∙...Fn) for different values of x from 1 to m to give a normalised estimate pne(Dx/F1∙...Fn) where they all sum to 1.
The probability estimates arising from any of the expressions given here can be calibrated against the frequency of correct predictions.
1. Llewelyn DEH, Garcia-Puig J (2004) How different urinary albumin excretion rates can predict progression to nephropathy and the effect of treatment in hypertensive diabetics. J Renin Angiotensin Aldosterone Syst 5, 141–5.Find this resource:
2. Llewelyn, DEH (1979). Mathematical analysis of the diagnostic relevance of clinical findings. Clin Sci 57, 477–9.Find this resource:
3. Llewelyn, DEH (1981). Applying the principle of logical elimination to probabilistic diagnosis. Med Inform 6, 25–32.Find this resource:
4. Llewelyn, DEH (1988). Assessing the validity of diagnostic tests and clinical decisions. MD thesis. University of London.Find this resource:
* Expressions (1) and (9) are inequality identities based on probability axioms alone and involve no potentially false assumptions but the denominator can be zero or less when the denominator likelihoods are low e.g. when they are probability densities, giving a correct but unhelpful result e.g.
† Expressions (14) and (16) are approximations based on an assumption of ‘likelihood equivalence’, which is only known to be true when p(Fi/Dj) or p(Dj/Fi) or p(FL/D0) or p(D0/FL) are zero. The assumption of ‘likelihood equivalence’ assumes that a likelihood ratio of p(Fi/Dj)/p(Fi/Dx) = x/y or x.k/y.k give the same probability of p(Dx/F1∙...Fn). This will also be true for example when there is statistical independence between p(Fi/Dj) and the remaining findings in p(F1∙...Fn) and p(Fi/Dx) and the remaining findings in p(F1∙...Fn).