Show Summary Details
Page of

Randomized controlled trials 

Randomized controlled trials
Randomized controlled trials

Sube Banerjee

Page of

PRINTED FROM OXFORD MEDICINE ONLINE ( © Oxford University Press, 2016. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Medicine Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 18 August 2018

There are three main questions in health care: ‘what is going on?’, ‘why?’ and ‘what do we do about it?’. ‘What is going on?’ forms the basis for clinical assessment including history taking, examination, and diagnosis. The question ‘why?’ underlies all aetiological research from laboratory science to epidemiology. The cross-sectional, case–control and cohort methodologies discussed in other chapters in this book provide the methodology for addressing ‘why?’ questions. However, just as medicine is more than diagnosis it also covers treatment, medical research is more than aetiology: it also necessarily extends to the evaluation of interventions.

Aetiological research which cannot be translated into health benefits through new or improved interventions is at best sterile and at worse selfindulgent, begging another important question: ‘so what?’. Flawed evaluations of interventions can be even more problematic since these may harm rather than help. Intervention studies (of which randomized controlled trials (RCTs) are the most important type, on the basis of quality of evidence available from them) take aetiological insights into action and provide the best evidence upon which to found clinical practice.

In this chapter we will consider some of the more important factors in the design, conduct, analysis, and interpretation of RCTs.

What is an RCT?

Last (1995) defines an RCT as:

An epidemiologic experiment in which participants in a population are randomly allocated into groups, usually called ‘study’ and ‘control’ groups, to receive or not to receive an experimental, preventative or therapeutic procedure, manoeuvre or intervention. The results are assessed by rigorous comparison of rates of disease, death, recovery, or other appropriate outcome in the study and control groups respectively. Randomized controlled trials are generally regarded as the most scientifically rigorous method of hypothesis testing available in epidemiology.

An RCT therefore sets out to find out the effect of an intervention. Stages in this process are summarized in Box 10.1. While this process may be clear it is not necessarily simple. RCTs require a substantial investment of resources in terms of time, expertise, personnel, and finance. This is not to say that the constituent components of assessment, intervention, reassessment, analysis, and interpretation need themselves be complex. Indeed much of the rigour in trial design revolves around ensuring that these components are simple, meaningful, and explicit before the start of the experiment.

What are RCTs for?

Treatment is fundamental to health care and we need information to decide what treatment to give, and to invest in. At its simplest, an RCT answers the question ‘does Treatment A work?’; one level of complexity higher is the design which answers the question ‘does new Treatment B work better than established Treatment A?’. Studies that simply observe the effects of treatment without randomization or control groups are difficult to interpret, in particular with respect to the size and direction of effect. The use of historical controls (e.g. comparing two case series, one before a new drug and one after) is also problematic (Altman and Bland 1999), generally resulting in an overestimation of the effect of the new intervention (Sacks et al. 1982).

The orthodoxy for the past 40 years has therefore been that RCTs rather than observational studies are the best way to judge the effect of an intervention. Two recent studies have challenged this (Benson et al. 2000; Concato et al. 2000), these compared data from observational studies and RCTs studying the same questions and concluded that observational studies yielded very similar odds ratios to RCTs with less variability of response. Ioannidis et al. (2001) in an editorial response questioned these findings, in particular highlighting: (a) the relatively small degree of actual overlap between the two methods; (b) some major negative correlations not quoted; (c) reconsideration of the data showing a lower level of agreement than was presented; and (d) the potential problems of pooling data. The authors concluded with a sensible statement of the particular values and roles of the two study types. They state that where observational studies have shown large harmful effects RCTs are likely to be unethical, as may also be the case where large beneficial effects have been demonstrated (e.g. risk ratios less than 0.4). Interventions with modest effect sizes (risk ratios from 0.4 to 0.9) are particularly suited to RCTs, while for those with very small effect sizes (0.9–1.0), or for very rare outcomes, RCTs may be difficult to perform and observational evidence may be the best data available. On balance there is a role for both study types. However for the majority of disorders and interventions RCTs potentially provide the most clear and unbiased evaluation of effect.

The methodology and practice of RCTs cannot be divorced from their impact upon clinical practice and health policy, and the wider financial and industrial context. There are increasing moves worldwide to base clinical practice on good quality evidence. This is a reaction to the realization that practice has often varied widely and doctors of all specialties have very often been unable to support their actions with anything other than protestations of established practice, anecdote or peer group consensus. Quality of evidence for clinical effectiveness forms a hierarchy with consistent findings from several well conducted RCTs at the top.

So on one level the purpose of RCTs is to allow clinicians to select the best treatment for a patient and to allow patients to receive the treatment with the best benefit to risk ratio for their condition. However there are other agencies interested in the data produced by RCTs. Those who purchase health care, be they health insurance agencies or governmental health authorities use evidence of effectiveness to focus or ration care within a limited financial envelope. These agencies therefore factor an assessment of the financial and political costs of sanctioning treatment into their interpretation of RCTs.

The pharmaceutical industry has a particularly strong interest in the findings of RCTs. Billions of dollars are spent in research and development of novel compounds by drug companies each year. It is they who design, conduct, analyse, and promote the vast majority of RCTs, albeit within a framework of governmental control agencies. While the betterment of humanity may occasionally be a by-product, the data from RCTs are the primary channel through which an experimental compound can be turned into a marketable commodity, so that investment can be turned to income and profit. The complex and competing inter-relationships of clinicians, patients, researchers, health purchasers, and drug companies is of profound importance in any consideration of an RCT. Occurrences such as the discontinuation of the FAME trial of a lipid lowering statin in older adults for purely commercial reasons raise both scientific and ethical issues (Evans and Pocock 2001; Lievre et al. 2001).

Fundamental design issues

As with all other epidemiological studies, primary objectives in the design of RCTs are to exclude bias and to measure confounding so that its effect can be controlled for in the analysis. Important aspects in the design are summarized in Box 10.2. Lind's elegant mid-eighteenth century comparative trial of different treatments for scurvy contains many of these methodological components (Lind 1753; Bull 1959; Lilienfield 1982). In this study he took 12 patients with scurvy on board the Salisbury who were ‘as similar as I could have them’, he gave them a common diet, divided them into six intervention groups and supplemented the groups’ diets in different ways. The two people in each group received either: a quart of cider a day; ‘25 gutts of elixir vitriol’; two spoonfuls of vinegar; a course of sea water; ‘the bigness of a nutmeg’; or two oranges and a lemon. After six days the group on the oranges and lemon had recovered to such an extent that one returned to duty and the other became the nurse for the remaining patients. However the translation of research findings into clinical practice was just as much of a problem in the eighteenth century as it is in the twenty first, and it took a further 50 years before the Royal Navy provided its sailors with lemon juice (Pocock 1983).

One way to understand the particular challenges posed by the carrying out of an RCT is to set up a perfect scenario and then to see just how far this differs from clinical circumstances in general and evaluations of interventions in psychiatry in particular.

A perfect trial

The perfect conditions for a trial would involve a disorder (D) which could be diagnosed with absolute precision at minimal cost and whose course was such that it did not matter when it was diagnosed or treatment started. We would then need a putative treatment (X) which had a compelling scientific basis for its use but whose efficacy was in doubt so that a trial was ethically acceptable. By preference there would be no other active treatment for D so that an inert placebo could be used as a comparator, and X should have no properties by which it could be distinguished from the placebo (e.g. taste, side effects, perception of effect) either by patient or doctor. D should also be sufficiently common and serious for it to be a good candidate for a clinical trial and full funding should be available.

For ascertaining outcome, we should have a perfect knowledge of the natural history of D and it should consistently lead to an unequivocal outcome, such as a disease-specific death, within a given time (say, one year on average). We should then be able to recruit a group of affected people who were absolutely representative of all people with D. These would be randomly allocated on a 1 : 1 basis to receive a course of X or placebo. We would then follow up all participants in both groups and measure after one year the death rate in the intervention and the control groups, comparing the two to ascertain the efficacy of X. The findings would be absolutely unequivocal. Just to make things perfect, X should be a compound which is not under drug company license and which is very cheap and very simple to manufacture and distribute so that the whole world can benefit from this work.

Preferably several independent research groups would have carried out separate similar trials at the same time, and they would all be published together. The role of X in the treatment of D would be systematically reviewed showing a consistent and powerful treatment effect in the individual studies and after meta-analysis. The fairy story would end with there being no political or clinical resistance to the introduction of the treatment and the health care delivery systems being universally available to make X available to all without prejudice. And we would all live happily ever after.

Clearly this is to argue by absurdity but it is important to make the point that the real world is a messy place full of uncertainly and complexity. Diagnosis or case ascertainment may be imprecise and difficult, and it may be modified by help-seeking behaviour which has an influence on outcome. There may be co-morbidity with other conditions and the patient may be taking other medication. Recruitment and obtaining informed consent may be difficult. There may be competing treatments and established practice may mean that placebos are not ethically justifiable. Treatments have side effects which may be unpleasant or serious, affecting both compliance and blinding to treatment group. Outcome may be difficult to measure accurately and loss to follow-up may compromise the validity of the study. When translating outcomes from RCTs into clinical practice, issues of cost and practicality invariably need to be taken into account.

‘Scientific’ vs. ‘practical’ studies–efficacy and effectiveness

One of the major fault lines in the design of trials is the dynamic between purity and generalizability. Generalizability is the degree to which findings of a study can be extrapolated from the study population to other populations of interest: most often to a much broader range of patients and services in general clinical settings. ‘Scientific’ (also known as efficacy, speculative or explanatory) studies take the view that it is important to investigate the effect of the intervention in ideal circumstances; while ‘practical’ (also known as effectiveness, pragmatic or management) studies seek to investigate whether the intervention works in real clinical settings. In practice the main differences between the two sorts of RCT lie in the participants selected for entry and the nature of the intervention. Issues which principally separate ‘scientific’ and ‘practical’ trials are the nature of the studied population (particularly regarding exclusion criteria) and way in which the intervention is delivered (Box 10.3).

It is worth considering some of the arguments put forward for each approach and the reasons why they might be articulated. As mentioned above, the arguments tend to revolve around the relative merits of maximizing either scientific purity or clinical generalizability. ‘Scientific’ studies produce data on the efficacy of an intervention: that is, the extent to which a specific intervention produces an effect in ideal conditions; while ‘practical’ trials produce data on the effectiveness of an intervention: that is the extent to which a specific intervention produces an effect under more normal clinical circumstances in terms of patients and services. Efficacy is of course necessary for effectiveness, but it is not always sufficient. The terms ‘scientific’ and ‘practical’ are problematic since ‘practical’ trails may be far better science than ‘scientific’ trials and occasionally ‘scientific’ trials may be of practical use. Therefore in the rest of this paper the terms efficacy and effectiveness will be used to describe the two types of study.

Efficacy trials are designed to produce the maximum effect size and so are often smaller that an equivalent effectiveness study, but may take time to recruit participants due to multiple exclusion criteria. Part of their attraction to drug companies and researchers alike lies in this maximization of effect. Efficacy trials are more likely to come up with a positive finding, which is good for marketing a drug and good for the researcher's publication record. In contrast effectiveness studies generally have to be larger in order to measure smaller effect sizes. They also tend to be more complex to analyse and interpret because more ‘typical’ groups of participants may be less likely to want to participate in a trial, and more likely to be lost to follow-up. They may also be more likely to have incidental adverse events, such as death, since they are not selected on the basis of being unusually healthy as in many efficacy trials.

Incidental adverse events are worth dwelling on since they are another reason why drug companies may prefer efficacy trials. Severe adverse events are very problematic in a Phase III trial (see below for a description of the drug trial phases) and may occur entirely by chance. However, they are unlikely to be sufficiently common to assort equally across the intervention and control groups. They can get a drug a bad name (giving rival companies ammunition to attack the new drug) and even halt trials, whether or not they have occurred by chance. It is therefore often held that it is more sensible to exclude those with a greater likelihood of such events (e.g. the ill, the disabled, those with co-morbid physical disorder, those who are on other medication) from trials entirely rather than run the risk of losing the investment.

In reality there is a continuum with the gap between efficacy and effectiveness depending on the disorder being studied and the simplicity of the intervention. In our example of the perfect trial, efficacy is the same as effectiveness. However in many studies the gap between research findings and clinical applicability may be huge. An example of this is provided by a study of donepezil, a treatment for Alzheimer's disease (Rogers and Friedhoff 1996) whose exclusion criteria are listed in Box 10.4. No mention is made in this paper of exclusions on the grounds of concurrent medication, but this is also likely given that in follow-up trials (Rogers et al. 1998a,b) patients on anticholinergics, anticonvulsants, antidepressants, and antipsychotics were also excluded as well as potentially those taking other drugs with CNS activity. As a clinical old age psychiatrist these exclusions seem to result in a study population about as far away from those I am called to assess as possible, so bringing into question the applicability of the findings to clinical practice. This is not to say that such drugs are not of use, they may well be, but the evidence that we are often presented with means that real clinical practice is informed by an extrapolation of trial data rather then by its direct application.

Another striking example has been described by Yastrubetskaya et al. (1997). In a Phase III trial of a new antidepressant, 188 patients were screened and 171 (91%) of them met the inclusion criteria of having sufficiently severe depression for the trial. However, when the multiple exclusion criteria were applied to this real-world sample of people with depression, only 8 (4.7%) of those eligible for inclusion could be recruited into the trial. Furthermore, at least 70% of the original sample required antidepressant treatment and were provided with it. Perhaps the way thorough these conundrums is to acknowledge that each study type has its place and it may be that at times a single study can provide efficacy data which are so clearly generalizable to clinical populations that they are in effect close to effectiveness data. However clinicians and purchasers need to know whether an intervention works in the real world. It may be that licensing organizations such as the MCA and FDA can suggest that such data are desirable and base their decisions more directly upon its availability. Alternatively where there are efficacy data but no data on clinical effectiveness, it may be necessary to commission effectiveness studies, either before or after licensing. This raises questions of who could and should fund such work. Should funding for effectiveness trials be levied from drug companies as a necessary part of the licensing procedure? Or should governmental and private agencies involved in purchasing health care fund such work? In either case there is a strong case for such trials being independent of those who stand to gain by the sale of the intervention and also those who stand to pay by purchasing the intervention.

Elements of a randomized controlled trial

In this section we will deal sequentially with some of the major practical elements of the design and conduct of RCTs. These cannot be dealt with in detail here due to constraints of space and the reader is referred to comprehensive texts such as Pocock's Clinical trials: a practical approach (1983) for further details and Last's A dictionary of epidemiology (1995) for succinct explanations of terms.

Review the literature

Before embarking on a trial there is a need to investigate systematically the existing evidence to ascertain the current state of the therapeutics of the disorder being studied and of the intervention being proposed. If there is already compelling evidence in the public domain then it may not be necessary, and therefore ethical, to carry out a trial. Techniques for the conduct of systematic reviews are increasingly well developed and the methods and outputs of the Cochrane Collaboration (see Chapter 11) should be consulted and used.

Clear formulation of a single primary hypothesis to be tested

The trial needs to have a single primary hypothesis which is to be tested by means of the RCT. There may then be secondary hypotheses but these should be limited to avoid multiple significance testing and ‘data dredging’. At this point the level for statistical significance for the primary (e.g. p < 0.05) and secondary hypotheses should be set (e.g. p < 0.01).

Specify the objectives of the trial

The hypothesis should be stated clearly and simply, for example, ‘the objective of the study is to test if Treatment B is more effective than Treatment A in Disease X’. However this means that you need to have decided what constitutes ‘more effective’. If we know that A gets 30% of people with X better how much more effective does B need to be? These are complex questions when A is an acceptable, economic, and widely available treatment with known side effects. We may say that we are only interested in B if it gets another 20% of people with X better, but to an extent these figures will always be arbitrary. They are however vital to the study design since the effect size being sought will determine the size of the study, with larger studies required to detect smaller differences.

It is at this point that the pre-study power calculations need to be completed. At the very least these will require:

  • an estimation of the treatment effect of your comparison group (i.e. the percentage of people with X who respond to known Treatment A, or in the case of a placebo-controlled trial the spontaneous recovery rate from X);

  • an estimate of the minimum effect size of your new Treatment B for it to be considered useful;

  • the level of statistical significance required for there to be accepted that there is indeed a true difference between the two groups. This is generally set at 0.05, that is, a random ‘false positive’ result (type 1 error) is acceptable on one in twenty occasions;

  • the ‘power’ of the study to detect effect. This is generally set at 80–90%, that is, ‘false negative’ results (type 2 error) are acceptable on between one in five and one in ten occasions.

Power calculations are not generally complex but specialised statistical help is advisable. Lower acceptable rates of types I and II error and smaller potential differences in effect require larger numbers of participants. Recruitment targets also need to be inflated to allow for those who will withdraw from the study and those who are lost to follow-up.

The statement of the study objectives should form the start of a detailed study protocol which sets out why the study is being carried out and exactly how the study will be conducted and analysed. This will form the basis for the ethical approval which is necessary for all trials.

Define the reference population

Define the population to which you wish to generalise. In the case of the donepezil trial discussed above, this might have been ‘extraordinarily fit and well people with Alzheimer's Disease’.

Select study population

It is not generally feasible to create a list of all people in a reference population and randomly select cases for inclusion in the study, unless the disorder is very rare and there are very careful records. It is more common that the work will be focussed in a single or a few sites for ease of administration and to control quality. These centres should ideally be representative of centres as a whole and their patients representative of the reference population. Reliance upon research-friendly ‘centres of excellence’ may compromise this. Even in the most inclusive of effectiveness studies there will be entry criteria to be applied to potential participants, these may be simple (e.g. age) or complex (e.g. stage of disorder). One should be careful not to select a study population which automatically and irrevocably limits the applicability of any findings obtained.

Participant identification and recruitment

In this phase participants are recruited by the plan set out in detail in the protocol. Since the RCT may be being carried out simultaneously at multiple sites, it is important that the same processes are adhered to in all study centres so that any selection bias can be minimized. Comprehensive and up-to-date lists of possible cases will need to drawn up and used as a sampling frame from which to randomly sample cases for assessment. Those participants who meet the pre-determined inclusion and exclusion criteria are eligible for entry into the study. A fairly solid rule is that the more exclusions there are, the more compromised is the generalizability of the study. An increasing number of scientific journals now require the CONSORT guidelines to be followed before they will publish a trial (Begg et al. 1996). These include a flow diagram summarising the effect of all inclusion and exclusion criteria and loss to follow-up through the study (see Fig. 10.1). The presentation of such data is an invaluable aid to assessing generalisabilty and therefore the clinical robustness of a study. Those studies presented without such data should be appraised with care.

Fig. 10.1 Flow diagram summarizing progress of participants through both arms of a randomized controlled trial (Begg et al. 1996).

Fig. 10.1
Flow diagram summarizing progress of participants through both arms of a randomized controlled trial (Begg et al. 1996).

Informed consent

There is insufficient space here for a detailed consideration of ethical issues in RCTs, and major issues are well summarized by Edwards et al. (1998). If there is no doubt of the efficacy of an intervention then there is no ethical reason for withholding, and such withholding is implicit in a trial. If there is insufficient evidence of the potential for efficacy then there may be poor ethical grounds for conducting a trail and such evidence should be collected using other methodology. If a trial has insufficient statistical power to demonstrate the required difference between intervention and control groups then again it is unethical since it cannot provide useful data. Equally, poorly designed trials where any observed difference may be a function of bias or confounding are also unethical on the same grounds.

If the RCT design is satisfactory there remains the problem of recruiting participants into the study and the dilemma of how to obtained truly informed consent. In this chapter, I will leave to one side the issue of capacity to consent, which is of importance in mental health research, not only for people with dementia and learning disabilities, but also for people with psychotic and other disorders. Obtaining informed consent may involve a tension between the requirement to provide full information and the objectives for the trial itself. Comprehensive details of every conceivable risk may reduce participation, potentially compromising recruitment, generalizability, and the possibility of important therapeutic advances. Participants will need to receive written and verbal information on the trial and have the chance to discuss any questions they might have, they might need time to consider and consult with family, all of which is time-consuming and difficult for research teams. Consent will almost always need to be written and witnessed with stipulations of being able to withdraw at any time without giving any reason and without such withdrawal compromising their medical care in any way. These documents need to be submitted to and approved by appropriate research ethics committee.

Silverman and Chalmers (2001) have summarized elegantly some of these ethical issues and the value of random allocation of treatment: ‘. . . when there is uncertainty about the relative merits of the double edged swords we wield in medicine today, we are wise to employ this ancient technique of decision making. It is a fair way of distributing the hoped for benefits and the unknown risks of inadequately evaluated treatments’. There is a tension where recruiting physicians stand to gain from recruiting individuals into trials. This gain may be direct such as a financial payments from the pharmaceutical industry per participant recruited, or indirect mediated by the scientific kudos of completing a trial or being seen as successful by peers and seniors. In this context it is of great concern that reports from physicians recruiting patients into trials indicate that a half to three-quarters thought that few of the patients they had recruited understood that trial even though they had given written consent (Spaight et al. 1984; Blum et al. 1987; Taylor and Kelner 1987). In the circumstances that apply in a trial how good are doctors in protecting their patients’ rights?

Baseline measurements

The literature review will have pointed to important possible confounding variables. These need to be measured with accuracy so that their potential effect on the outcome can be measured and controlled for in the analysis.

At this stage social, demographic, and other variables of interest (e.g. financial state, service use) which might change as part of the study need to be recorded. In mental health studies we seldom have hard outcomes such as unequivocal disease-related death. ‘Change’ scores are probably the most frequently used alternative. The measurement of the presence and/or severity of the disorder at recruitment is therefore a vital consideration. This must be achieved accurately and dispassionately, without conscious or unconscious bias, and without any knowledge of which treatment group the individual will be randomized to.


Randomization is the single most powerful element of the RCT design. Its purpose is to ensure that all variables which might have an effect on outcome (known and unknown) other than the intervention(s) being studied are distributed as equally as possible between the intervention and the control group so that the effect of the intervention can be accurately estimated.

The application of randomization has developed over the course of the twentieth century. Its roots however are deeper, as early as 1662 a chemist named van Helmont advocated the drawing of lots to compare the effectiveness of competing contemporary treatments (Armitage 1982). His excellent concise protocol suggested: ‘Let us take out of the hospitals . . . 200, or 500 poor People that have Fevers, Pleurises, etc. Let us divide them into half, let us cast lots, that one half may fall to my share, and the other to yours . . . We shall see how many funerals both of us shall have’. Another early proposal for the random allocation of treatments in human health referred to studies of cholera and typhoid in the first decade of the twentieth century (Greenwood and Yule 1915; Pocock 1983). While first actually applied in agricultural research in 1926 (Box 1980), stratified randomization of matched groups was used in a 1931 by Amberson et al. (1931) to investigate the efficacy of a gold compound in pulmonary tuberculosis (TB). However the first trial to be reported which used full randomization, using in this case sealed envelopes, was the Medical Research Council's (MRC) careful and methodologically advanced trial of streptomycin in TB in the late 1940s (MRC 1948). It is interesting to note that this trial was only ethically possible because the ‘small amount of streptomycin available made it ethically permissible for the control participants to be untreated by the drug . . . ’ (D’Arcy Hart 1999).

Randomization uses individual-level unpredictability to achieve group-level predictability. So if randomization is on a 1:1 basis, we have no idea whether the individual in front of us will be randomized to the intervention or the control group. The result is that there will be a predictably equal distribution between the two groups of known and unknown potential confounders.

In small trials there is the possibility that variables of interest will not assort equally across the intervention and control groups. This may be controlled by the use of stratification at baseline (although this also imposes complexity) or by adjusting the analysis for baseline variables (Roberts and Torgerson 1999).

It is vital that the process of randomization is removed entirely from recruiting researchers since any knowing or unknowing compromise of the chance element to group allocation will undermine the whole basis of the study (Schulz 1995; Altman and Schulz 2001). This will usually require the involvement of a third party who can assure that strict randomization is implemented (e.g. telephoning with the name/study number and only then being assigned a randomization code). The method of assignment and concealment of allocation are important components. If there is an open list of random numbers (or if date of birth or hospital numbers are used) then the process of recruitment is open to influence. For example, if we were interviewing someone and felt that they might have a poor response to the treatment we were testing, and we had worked out that, because she had an even numbered birthday (or hospital number we had glimpsed on an appointment card), she would be allocated to the intervention group we might knowingly or unknowingly, in the process of gaining informed consent, discourage her from participation. Equally if we knew she would be in the control group then we might knowingly or unknowingly encourage her to participate.

Altman and Schulz (2001) suggest that there are two main requirements for adequate concealment of allocation. First, the person generating the allocation sequence must not be the same person determining whether a participant is eligible and enters the trial. Second, the method for treatment allocation should not include anyone involved in the trial. Where the second is not possible they conclude that that the only other plausible method is the use of serially numbered, opaque sealed envelopes although this may still be open to external influence (Schulz 1995; Torgerson and Roberts 1999). In practice, given the expense and complexity of trial design and the vital role that randomization and concealment of allocation plays in a trial it should be a priority to set up an external incorruptible system.

In an useful systematic review Kunz and Oxman (1998) demonstrated that studies which failed to use adequately concealed random allocation generated distorted effect size estimates with the majority overestimating effect. The effects of not using such concealed allocation were often of comparable size to those of the interventions. Another study by Schulz et al. (1995) suggested that RCTs without adequately concealed randomization produce effect size estimates that are 40% higher than trials with good quality randomization. They concluded that while the main effect was to produce a poorer response in the control group, there were also occasions where effects of interventions were obscured or reversed in direction. These data provide strong support for the use of robust and concealed methods for randomization and the need to be very sceptical about data from trials not using, or not declaring that they used, such methodology.

In this chapter the focus has been on simple individual intervention and randomization. However the unit of randomization need not be the individual. In a trial of a general practitioner (GP) psychoeducational package the unit might be the GP or a group of GPs in an individual practice, even though its efficacy is assessed by measurement of their patients. Equally where the intervention is population wide, as in trials of water fluoridation to prevent dental caries, the unit of randomization will be the entire catchment area of a reservoir system. Statistical power in such cluster randomization, depends more on the number of clusters (i.e. the number of units of randomization) than the numbers within the clusters, as well as the intra-cluster correlation of outcome (Kerry and Bland 1998).

Intervention, control groups, and blinding

At its most simple, participants are randomized into an intervention or a control group. The intervention group receives the novel treatment and the control group a placebo if there is no established treatment—or the best established treatment if there is one. If the study design is sound and the randomization robust then the control group should differ from the intervention group only in the treatment allocated to it.

The problems start when either the participant, the clinical staff or the researchers can work out which group they are in. There are fewest problems with drug trials. Placebos, or active control treatments can be formulated to look like the novel treatment. Inert placebos may however be discernable from active interventions if they differ in side effects (e.g. anti-cholinergic side-effects) which may alert a patient or clinician to the intervention status. The use of placebos which contain side-effect mimicking compounds may partially address this difficulty. Problems are far greater when the intervention cannot be concealed, for example, in a trial of psychotherapy. It is salutary to bear in mind Bradford Hill's (1963) defense of not subjecting control patients in the MRC Tuberculosis Trial to the four months of four times daily intramuscular injections which the streptomycin intervention group received, that there was ‘no need in the search for precision to throw common sense out of the window’.

Blinding is different from concealment of random allocation, and concerns the degree to which participants and/or researchers are unaware of intervention groups after these have been allocated. This is an important tool in minimizing potential bias but may not always be possible depending on the nature of the intervention. Single blind studies usually describe a situation where the participant does not know their group but the researcher does. In a double blind study both the participant and the investigators are unaware of group membership. In a triple blind study the participants, the researchers, and the statisticians analysing the data are unaware of group membership. In an open trial everybody knows what is going on and making solid inferences can therefore be difficult. Blinding is not only important in RCTs: for example, the performance of diagnostic tests may be overestimated when the test result is known (Lijmer et al. 1999).

A particular issue for RCTs in mental health is the complexity of the intervention. Procedures which we wish to evaluate are often multifaceted and multidisciplinary rather than confined to different tablets (Banerjee and Dickenson 1997). This compromises blinding and may make intervention seem less precise. Certainly it can be difficult to pinpoint what element of intervention may be of help. These are issues across the whole of health care and the UK Medical Research Council (MRC) has published a framework for the design and evaluation of complex interventions to improve health (Campbell et al. 2000). In this the authors deal with interventions that are ‘made up of various interconnecting parts’ citing examples including the evaluation of specialist stroke units and group psychotherapies. They identify a lack of development and definition of the intervention as a frequent difficulty, and propose a five stage iterative process of trial development (Box 10.5).

Follow-up and reassessment

Given the obstacles to be overcome in getting this far, it is unlikely that assiduous attempts will not be made to follow up all participants in both groups. However the longer the study the more likely there are to be drop outs due to defaulters, and people moving or dying. It is important to attempt as complete a follow-up as possible and to get as much information on those lost to follow-up as possible since incompleteness introduces bias. Assessment of outcome may occur continuously during the trial (e.g. mortality in a cancer chemotherapy trial), intermittently at multiple predetermined timepoints or simply at the end of the defined period of the trial.

A cardinal rule is that assessment of outcome should be completed by a researcher who is blind to randomization group membership. This requires that personnel for the recruitment and the follow-up stages do not assess the same people if they have any knowledge of randomization group. Also any information which might unblind the assessor should be either collected in a different way or left to the end so as not to influence the assessment of outcome in any conscious or unconsciousness way.

Outcome measures should preferably be understandable if they are to be influential in changing clinical practice. Most drug trials rely on rating scales which generate continuous scores (e.g. of depression or cognitive impairment) where these have been widely used and are held to be sensitive to differential change with treatment. However it may be difficult to assess, for example, what a two point change on the Hamilton Depression Rating Scale or the ADAS-Cog actually means in a clinical situation. Clinically relevant outcomes such as recovery from depression need to be used more widely. There is also an increasing role for measures which take a more holistic view of the participant and the impact of the experimental intervention, such as health related quality of life (Guyatt et al. 1998).

Major trials will generally have a monitoring committee set up to inspect the data that emerge from the trial before completion. Their remit is to decide whether the trial needs to be stopped early. This may be because accumulating evidence for a strong benefit from the experimental intervention, or evidence that it appears to be harmful. Trials may also be stopped if they appear to be futile: that is where interim analyses show that there is no treatment benefit and that the remaining trial would not allow for a benefit to become manifest. An important considerations in setting up such committees is the need for confidentiality, regular review and pre-agreed criteria for discontinuation (Pocock 1992; Flemming et al. 1993).

‘Intention to treat’ analysis

The strategy for statistical analysis should be specified prior to the commencement of the study. The primary and secondary hypotheses are tested by comparing the outcomes of the intervention and the control groups. This will usually involve a multivariate analysis to control for the effects of any unevenly distributed confounders and to attempt to delineate the size of the treatment effect. Where continuous measures are used statistical methods such as analysis of covariance (ANCOVA) may be appropriate (Vickers and Altman 2001). As with cohort studies a major concern in RCTs is the completeness of follow up. It is often hard to persuade participants to stay in the trial, and if they drop out of treatment, it is usual for studies to collect no further information on them. This causes problems, especially if there is differential drop out between the treatment and control groups. If, for example, an antidepressant was highly effective in those who could tolerate it, but caused such unpleasant side effects that over half of participants dropped out of treatment, an analysis which compared outcome on just those who completed treatment and the placebo group (in whom only 10% dropped out of treatment), would tend greatly to exaggerate the effectiveness of the treatment. This would, in effect, be a form of selection bias.

Ideally, follow-up information should be collected on as many participants as are randomized. Follow-up information should be collected even if the participant has dropped out of treatment, because the outcome of those who are unable to tolerate treatment, or drop out because it is ineffective, is just as important as that of ‘completers’. This process can be greatly enhanced if the outcome is a simple one (e.g. mortality), but in psychiatric RCTs there is a tendency to measure outcome on complex symptom inventories. The approach where data on all randomized participants are analysed irrespective of how much of the treatment they have received, is referred to as ‘intention to treat analysis’. Using this form of analysis it is irrelevant whether an individual complied with treatment since it is the offer of the treatment which is being evaluated. As Last (1995) states ‘failure to follow this step defeats the main purpose of random allocation and can invalidate the results’.

Inevitably, there are situations where it is impossible to gain full information on participants, and researchers have to account for such incomplete data. Probably the best approach is to present as much information as can be gathered from the data collected, and to then perform sensitivity analyses which account for missing data. One of the most common is to perform a ‘last observation carried forward’ analysis, where endpoint data a substituted with previous results. This is usually a conservative approach, but in situations where there is differential drop out, for example, in a psychotherapy trial comparing an active treatment, with treatment as usual, it is common for there to be more drop outs in the ‘treatment as usual group’. Using ‘last observation carried forward’ in this situation would tend to lead to the treatment group appearing to have improved more. Another approach is to impute missing values using regression techniques. Using multi-level modelling is another suitable approach, which is particularly useful for handling missing data. However, no statistical approach is a substitute for good study design and conduct which minimize drop outs.

Interpretation of data

Following analysis, the data need to be presented in a way that can be understood. This is helped by the use of clinically relevant outcome measures, but the paraphernalia of statistical inference can be difficult to penetrate. The presentation of the number needed to treat may aid comprehensibility. This is the number of patients from your study population who need to be given the new intervention for the study period in order to achieve the desired outcome (e.g. recovery), or to prevent an undesired one (e.g. death). It is calculated as the reciprocal of the risk difference between treatment group and control (Sackett et al. 1991). If, for example, the outcome is recovery from depression, and the risk of recovery at 6 weeks in the treatment group is 0.7, whereas the risk of recovery in the control is 0.5, the risk difference is 0.2, and the NNT 5. To illustrate the meaning of this, one could imagine 10 patients receiving the control condition and five of them getting better. If the same number was given the new treatment, seven would get better. Therefore in 2 out of 10 the treatment would have been responsible for recovery (assuming that the results of the trial were valid), in other words, one would need to treat 5 patients with the treatment, to bring about one recovery attributable to the treatment. The NNT has the unusual characteristic of having a null value of infinity (i.e. one would need to treat an infinite number of patients to bring about a recovery, if the treatment was no better than control), and therefore where a non significant finding is being reported the 95% confidence intervals will span infinity (e.g. NNT = 40, 95% CI 25, ∞, –50). A negative value on the NNT suggests that the treatment is doing harm.

Publication, communication, and dissemination

Once a trial has been completed its findings need to be communicated. This usually requires the preparation of a scientific paper or series of papers and their submission to peer reviewed journals. This is a quality control measure, designed to assess the robustness and scientific strength of the study and its conclusions. Unfortunately publication bias can lead to a tendency for editors and authors to prepare and publish new and significant data rather than replications or negative findings. This can distort an estimation of the true effect of findings and the techniques of systematic review and meta-analysis have developed to attempt to locate unpublished data and to incorporate it into aggregate estimates of effect size. Publication should be seen as the start of a communication strategy for novel findings as it is by drug companies. McCormack and Greenhalgh (2000) have argued powerfully, using data from the UK prospective diabetes study, that there can be a problem with interpretation bias at this stage of a trial with widely disseminated conclusions being unsupported by the actual data presented in the papers. They identified powerful motivations for researchers, authors, editors, and presumably other stakeholders such as the drug industry and the voluntary sector to impart positive spin to trial data. The interpretive biases included the following:

  • ‘We've shown something here’ bias—researcher enthusiasm for a positive result;

  • ‘The result we've all been waiting for’ bias—prior expectations moulding interpretation;

  • ‘Just keep taking the tablets’ bias—overestimating the benefits of drugs;

  • ‘What the hell can we tell the public’ bias—political need for high impact breakthroughs;

  • ‘If enough people say it, it becomes true’ bias—a bandwagon of positivity preceding publication.

That said, there is a need to ensure that important findings are made available, in a way that is accessible to them, for those who formulate policy and purchase services and also those who use them. Andrews (1999) has argued that mental health services may be particularly resistant to changing practice on the basis of empirical evidence, citing the persistence of psychoanalytic psychotherapy and the lack of implementation of family interventions for people with schizophrenia as examples.

Additional information on randomized controlled trials

In this section we will consider some of the more common supplementary questions raised by RCTs.

The phases of pharmaceutical trials

The meanings of and distinctions between the various phases of new drug development can seem opaque. They are best viewed as the necessary processes which need to be completed so that the drug company can satisfy regulatory authorities such as the United States Food and Drug Administration (FDA) or the United Kingdom's Medicines Control Agency (MCA). These phases explicitly refer only to experiments on human participants, there will have been a substantial programme of in vitro and animal experiments which will have been completed before the Phase I trials begin, which are beyond the scope of this chapter.

Phase I: Clinical pharmacology and toxicology. This represents the first time a drug is given to humans—usually healthy volunteers in the first place followed by patients with the disorder. The purpose is to identify acceptable dosages, their scheduling and side effects. These are most often carried out in a single centre, requiring 20–80 patients.

Phase II: Initial evaluation of efficacy. These are to determine whether the compound has any beneficial activity. They continue the process of safety monitoring and require close observation, they may be used to decide which of a number of competing compounds go through to Phase III trials. They may be single or multi-centre, and generally require 100–200 patients.

Phase III: Evaluation of treatment effect. This is a competitive phase where the new drug is tested against standard therapy or placebo. There may also be a further element of optimal dose finding. The format for this evaluation is that of an RCT. This phase usually requires large numbers (100s–1000s) and therefore a multi-centre design.

Phase IV: Post-marketing surveillance. Once a drug has been put on the market, there is a need to continue monitoring for rare and common adverse effects including mortality and morbidity. These may only become evident when the drug is used in large numbers in real clinical populations.

Other types of trial

Crossover trials: In a crossover trial the intention is that each patient acts as their own control. Randomization is to receipt of the intervention or the control followed by a wash out period then the treatment not received in the first phase.

Factorial trials: Interventions may be given alone or together so that their individual and joint effects can be evaluated.

Community trials: As discussed above, sometimes the unit of randomisation (i.e. the entity to which the treatment is given) is not an individual but is a community. A good example of this is water fluoridation for dental disease which can only be achieved on a population basis, so that the reservoir and the population it serves becomes the unit of randomization. The number within each community is of secondary importance and may add little to the statistical power of the study. With such interventions there are clear possibilities of problems with compliance (e.g. choosing to drink bottled water only); contamination (travel to fluoridated communities) and blinding (it may be politically unacceptable to prevent the community from knowing what is being done to them).

Other intervention designs

Comparison with historical controls

In this design a group of people are treated with a novel intervention and their progress is compared with a group that has been studied in the past with a different or no treatment and whose outcome is known. A major problem with this approach lies in the other changes which may have occurred as well as the intervention (e.g. lifestyle, diet, healthcare delivery, and other risk and prognostic factors). It may be difficult or impossible to adjust for these factors if they have been incompletely recorded or measured in a different way.

Simultaneous comparison of differently treated groups

This design strategy is subject to bias since there is seldom any element of randomization. The groups for treatment are usually selected either by the treating physicians or by the patients themselves and so are very unlikely to be representative of all individuals with the disorder. It may be impossible to adjust for the effects of variables other than the treatment being studied such as other healthcare provided, illness severity, and concomitant disorders. Inferences concerning the relative efficacy of the interventions may therefore be limited substantially. The same problems are associated with ‘waiting list control’ evaluations where there is the possibility of any discretion on the part of patients or treating physicians.

Patient preference trials

In trials of this sort the patient takes a more or less active role in deciding which of the treatment arms he or she will complete. This clearly compromises the power of randomization and blindness. However such approaches may be necessary where the belief systems of a study population mean that a standard RCT is not possible. In these trials patients with strong views as to treatment are given the intervention they want and those without preferences (and those with preferences who still agree) are randomized in the normal way. The data from such studies are often difficult to interpret and they will usually need to be very large to enable adequate statistical power for between group comparisons.


  1. 1. Discuss the following with respect to their implications for a randomized controlled trial:

    1. (a) Extent of exclusion criteria—can you identify a true ‘effectiveness’ study in your field of research? Is this important?

    2. (b) Randomization—how can this be achieved in service-level research?

    3. (c) Blinding—is this feasible in Psychiatry?

    4. (d) Sample size—can a study ever be too large?

    5. (e) Intention to treat analysis—what outcomes should be assigned to people who are immediately lost to follow-up?

  2. 2. Pick a controversial treatment in a chosen area of practice—ideally an intervention which has proven efficacy but which is not yet fully accepted by clinicians and/or ‘established’ as cost-effective. Divide the students into three groups. One group are to represent patients or their advocates, one group are to represent prescribing doctors (or those who will deliver the treatment), and one group are to represent policy makers (e.g. advisors to a health minister) who have to consider the possible costs of the treatment (and assume that what is spent on this will have to be taken away from other aspects of care). Allow the groups about 30 min to prepare a brief presentation and let the fight commence! (Note: the choice of the ‘treatment will depend on the nature of the class and current wider debate’. Previously successful examples have included atypical antipsychotic agents, anticholinesterase treatments for Alzheimer's disease and novel pharmacological interventions for smoking cessation. The purpose of the exercise is to emphasize that proof of efficacy is only the beginning . . . )


Altman, D. G. and Bland, J. M. (1999) Treatment allocation in controlled trials: why randomise? British Medical Journal, 318, 1209.Find this resource:

Altman, D. G. and Schulz, K. F. (2001) Concealing treatment allocation in randomised trials. British Medical Journal, 323, 446–7.Find this resource:

Amberson, J.B., McMahon, B.T., and Pinner, M. (1931) A clinical trial of sanocrysin in pulmonary tuberculosis. American Review of Tuberculosis, 24, 401–35.Find this resource:

    Andrews, G. (1999) Randomised controlled trials in psychiatry: important but poorly accepted. British Medical Journal, 319, 562–4.Find this resource:

    Armitage, P. (1982) The role of randomisation in clinical trials. Statistics in Medicine, i, 345–52.Find this resource:

    Banerjee, S. and Dickenson, E. (1997) Evidence based health care in old age psychiatry. Psychiatry in Medicine, 27, 283–92.Find this resource:

    Begg, C., Cho, M., Eastwood, S., et al. (1996) Improving the quality of reporting of randomized controlled trials: the CONSORT statement. Journal of the American Medical Association, 276, 637–9.Find this resource:

    Benson, K. and Hartz, A. J. (2000) A comparison of observational studies and randomised controlled trials. New England Journal of Medicine, 342, 1878–86.Find this resource:

    Blum, A.L., Chalmers, T.C., Deutch, E., Koch-Weser, J., Rosen, A., Tygstrup, N., et al. (1987) The Lugano statement on controlled clinical trials. Journal of International Medical Research, 15, 2–22.Find this resource:

    Box (1980) RA Fisher and the design of experiments, 1922–1926. American Statistics, 34, 1–7.Find this resource:

    Bull, J. P. (1959) The historical development of clinical therapeutic trials. Journal of Chronic Disease, 10, 218–48.Find this resource:

    Campbell, M., Fitzpatrick, R., Haines, A., Kinmouth, L., Sandercock, P., Spiegelhalter, D., and Tyrer, P. (2000) Framework for design and evaluation of complex interventions to improve health. British Medical Journal, 321, 694–6.Find this resource:

    Concato, J., Shah, N., and Horwitz, R. I. (2000) Randomised controlled trials, observational studies and the hierarchy of research designs. New England Journal of Medicine, 342, 1887–92.Find this resource:

    D’Arcy Hart, P. (1999) A change in scientific approach: from alternation to randomised allocation in clinical trials in the 1940s. British Medical Journal, 319, 572–3.Find this resource:

    Edwards, S.J.L., Lilford, R.J., and Hewison, J. (1998) The ethics of randomised controlled trials from the perspective of patients, the public, and healthcare professionals. British Medical Journal, 317, 1209–12.Find this resource:

    Evans, S. and Pocock, S. (2001) Societal responsibilities of clinical trial sponsors. British Medical Journal, 322, 569–70.Find this resource:

    Flemming, T. R. and De Mets, D. L. (1993) Monitoring of clinical trials: issues and recommendations. Controlled Clinical Trials, 14, 183–97.Find this resource:

    Greenwood, M. and Yule, G. U. (1915) The statistics of anti-typhoid and anti-cholera inoculations and the interpretations of such statistics in general. Proceedings of the Royal Society of Medicine, Sect Epidemiol State Med, 8, 113–94.Find this resource:

    Guyatt, G.H., Juniper, E.F., Walter, S.D., Griffith, L.E., and Goldstein, R. S. (1998) Interpreting treatment effects in randomised trials. British Medical Journal, 316, 690–3.Find this resource:

    Hill, A. B. (1963) Medical ethics and controlled trials. British Medical Journal, i, 1943.Find this resource:

    Ioannidis, J.P.A., Haidich, A.-B., and Lau, J. (2001) Any casualties in the clash of randomised and observational evidence? British Medical Journal, 322, 879–80.Find this resource:

    Kerry, S. M. and Bland, M. (1998) The intracluster correlation coefficient in cluster randomisation. British Medical Journal, 316, 1455–60.Find this resource:

    Kunz, R. and Oxman, A.D. (1998) The unpredictability paradox: review of empirical comparisons of randomised and non-randomised clinical trials. British Medical Journal, 317, 1185–90.Find this resource:

    Last (1995). A dictionary of epidemiology. Oxford University Press, Oxford.Find this resource:

      Lievre, M., Menard, J., Bruckert, E., Cogneau, J., Delahaye, F., and Giral, P., et al. (2001) Premature discontinuation of clinical trial for reasons not related to efficacy, safety, or feasibility. British Medical Journal, 322, 603–6.Find this resource:

      Lijmer, J.G., Mol, B.W., Heisterkamp, S., Bonsel, G.J., Prins, M.H., van der Meulen, J.H., et al. (1999) Empirical evidence of design-related bias in studies of diagnostic test. Journal of the American Medical Association, 282, 1061–6.Find this resource:

      Lilienfield, A. M. (1982) Ceteris paribus: the evolution of the clinical trial. Bull History Medicine, 56, 1–56.Find this resource:

      Lind, J. (1753) A Treatise of the scurvy. Edinburgh: Sands, Murray & Cochran.Find this resource:

        McCormack, J. and Greenhalgh, T. (2000) Seeing what you want to see in randomised controlled trials: versions and perversions of the UKPDS data. British Medical Journal, 320, 1720–3.Find this resource:

        Medical Research Council (1948) Streptomycin treatment of pulmonary tuberculosis. British Medical Journal, ii, 769–82.Find this resource:

          Pocock, S. J. (1983) Clinical trials: a practical approach. Wiley, Chichester.Find this resource:

            Pocock, S. J. (1992) When to stop a clinical trial. British Medical Journal, 305, 235–40.Find this resource:

            Roberts, C. and Torgerson, D. J. (1999) Baseline imbalance in randomised controlled trials. British Medical Journal, 319, 185.Find this resource:

            Rogers, S. L. and Friedhoff, L. T. (1996) The efficacy and safety of donepezil in patients with Alzheimer's Disease: results of a US multicentre, randomised, double-blind, placebocontrolled trial. Dementia, 7, 293–303.Find this resource:

            Rogers, S.L., Farlow, M.R., Doody, R.S., et al. (1998a) A 24-week, double-blind, placebocontrolled trial of donepezil in patients with Alzheimer's Disease. Neurology, 50, 136–45.Find this resource:

            Rogers, S.L., Doody, R.S., Mohs, R.C., et al. (1998b) Donepezil improves cognition and global function in Alzheimer's Disease. Archives of Internal Medicine, 158, 1021–31.Find this resource:

            Sackett, D.L., Haynes, R.B., Gutatt, G.H., et al. (1991) Clinical epidemiology: a basis science for clinical medicine. Little Brown, Boston.Find this resource:

              Sacks, H., Chalmers, T.C., Smith, H. (1982) Randomized versus historical controls for clinical trials. American Journal of Medicine, 72, 233–40.Find this resource:

              Schultz, K. F. (1995) Subverting randomisation in controlled trails. Journal of the American Medical Association, 274, 1456–8.Find this resource:

              Schultz, K.F., Chalmers, I., Hayes, R.J., and Altman, D. G. (1995) Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trails. Journal of the American Medical Association, 273, 408–12.Find this resource:

              Silverman, W. A. and Chalmers, I. (2001) Casting and drawing lots: a time honoured way of dealing with uncertainty and ensuring fairness. British Medical Journal, 323, 1467–8.Find this resource:

              Spaight, S.J., Nash, S., Finison, L.J., and Patterson, W. B. (1984) Medical oncologists’ participation in cancer clinical trials. Progress in Clinical Biological and Research, 156, 49–61.Find this resource:

              Taylor, K. M. and Kelner, M. (1987) Interpreting physician participation in randomized clinical trials — are patients really informed? Journal Health and Social Behaviour, 28, 389–400.Find this resource:

              Torgerson, D.J., and Roberts, C. (1999) Randomisation methods: concealment. British Medical Journal, 319, 375–6.Find this resource:

              Vickers, A. J. and Altman, D. G. (2001) Analysing controlled trials with baseline and follow up measurements. British Medical Journal, 323, 1123–4.Find this resource:

              Yastrubetskaya, O., Chiu, E., and O’Connell, S. (1997) Is good clinical research practice for clinical trials good clinical practice? International Journal of Geriatric Psychiatry, 12, 227–31.Find this resource: