Show Summary Details
Page of

# (p. 280) Randomized Trials

Randomized Trials
Chapter:
Randomized Trials
DOI:
10.1093/med/9780195314465.003.0013
Page of

PRINTED FROM OXFORD MEDICINE ONLINE (www.oxfordmedicine.com). © Oxford University Press, 2016. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Medicine Online for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 23 October 2019

A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. … Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. … Her task is to divide the eight cups into two sets of four, agreeing, if possible, with the treatments received.

sir ronald fisher

## Introduction

### Can Ginkgo Biloba Prevent Dementia?

Dementia, a chronic condition that includes Alzheimer’s disease and related disorders, afflicts millions of Americans. Few modifiable risk factors have been identified. However, use of the herbal drug Ginkgo biloba has been popular in many areas of the world in an attempt to preserve memory. It contains antioxidants and other ingredients that may retard aggregation of amyloid protein into the neurotoxic brain deposits characteristic of Alzheimer’s disease. In the United States, sales of Ginkgo biloba in 2006 were estimated to have exceeded \$249 million, although evidence on the drug’s effectiveness for preventing memory decline remained limited (DeKosky et al., 2008).

To determine whether Ginkgo biloba actually prevents dementia, the Ginkgo Evaluation of Memory (GEM) study enrolled 3,069 older adult volunteers over an eight-year period at five study centers across the United States. Among the participants, 84% had normal cognition on entry, while the rest had mild cognitive impairment but not dementia. Subjects were then randomized to take either a twice-daily dose of Ginkgo biloba extract or twice-daily placebo. Participants were re-evaluated every 6 months using a standardized set of neuropsychological tests and were interviewed about possible side effects. The results were interpreted by an expert panel of clinicians to determine whether criteria for dementia were met. Neither the subjects themselves nor study staff responsible for follow-up or outcome assessment knew which drug each subject was taking. An independent Data Safety and Monitoring Board reviewed the results periodically during the trial to determine whether the study should be stopped early, due either to proof of efficacy or to safety concerns.

After a median follow-up of 6.1 years, the incidence of dementia among Ginkgo biloba recipients was 3.3 cases per 100 person-years, compared to 2.9 among placebo recipients (hazard ratio = 1.12, 95% CI: 0.94–1.33). Ginkgo biloba also had no evident effect on the incidence of dementia in the subset of participants with mild cognitive impairment at baseline (hazard ratio = 1.13, 95% CI: 0.85–1.50). The investigators concluded that Ginkgo biloba could not be recommended for prevention of dementia.

The GEM study is an example of a randomized trial—a comparative study in which study subjects are assigned by a formal chance mechanism to two or more intervention strategies. Randomized trials occupy a special place among epidemiologic study designs because they can provide particularly strong evidence to support a hypothesis of a causal link between an exposure and an outcome. Randomized-trial results can sometimes reverse inferences drawn from non-randomized studies (Anderson et al., 2004; Omenn et al., 1996; Odgaard-Jensen et al., 2011).

### Why Randomize?

First, randomization generally offers excellent protection against confounding. As we saw in Chapter 12, a necessary condition for confounding is that there be an association between exposure and the potential confounder. Randomization tends to distribute potential confounders similarly among exposure groups, which removes this necessary precondition. In causal diagram form:

Randomization thus tends to break the links between exposure (here, the randomized treatment assignment) and all potential confounders, removing all backdoor paths from exposure to outcome. That said, while randomization balances potential confounders between exposure groups on average, some degree of imbalance may still occur by chance. With sufficiently large samples, any such imbalances are likely to be modest in size and can, if necessary, be addressed using the same kinds of analytic techniques that are routinely used to deal with confounding in non-randomized studies. Moreover, as discussed later, randomization can sometimes be implemented in such a way as to guarantee balance between exposure groups on a key potential confounder.

For example, in the GEM study, presence of the APOE ϵ‎4 allele is known to be strongly associated with the risk of Alzheimer’s disease and was thus an important potential confounding factor. Each of the 578 subjects who carried this genetic marker had an equal chance of being assigned to Ginkgo biloba or to placebo. It turned out that 297 subjects with the allele were assigned by chance to Ginkgo biloba, and 281 were assigned to placebo. The resulting prevalence of APOE ϵ‎ 4 was 24.1% in the active treatment group and 23.0% in the placebo group—not identical, but very close indeed. The two groups also proved to be similar on a variety of other health-related characteristics that were assessed in the trial. Moreover, randomization works to prevent confounding even by factors that have not been measured or that may be unknown to the investigator.

Second, if potential study subjects or those who refer subjects into a trial can anticipate which treatment arm a subject would be assigned to, self-selection or selective referral can lead to a biased comparison. A desperately ill patient might prefer an experimental treatment to placebo and might choose to enroll only if likely to receive the experimental treatment. Such self-selection could skew the experimental-treatment arm toward sicker patients with worse prognosis. The specific characteristics that influence such choices in a particular context can be very difficult to identify, measure, and control adequately. The unpredictability of assignments that results from randomization helps prevent imbalances in those characteristics across treatment groups. In some trials, unpredictability due to randomization also makes it possible to keep participants unaware of which treatment group they are in (“blinded”), which helps avoid bias in self-reported outcomes and differential subject attrition.

Third, randomization supports statistical inference. It allows the probability of any given set of trial results under the null hypothesis to be calculated from statistical theory. This provides a solid basis for obtaining p-values and confidence limits: the assumptions behind standard statistical tests are satisfied by design.

### When Can a Randomized Trial Design Be Used?

Some exposure–outcome relationships are more amenable than others to study in a randomized trial. Conditions that favor use of the randomized-trial design include the following:

1. 1. The exposure is potentially modifiable. Factors such as genotype, family history, and race/ethnicity cannot be altered, so their influence on disease occurrence must be studied with non-randomized designs.

2. 2. The exposure is potentially modifiable by the investigator. Certain exposures such as occupation, marital status, and smoking habits are, in principle, modifiable, but it is often impossible or impractical to assign people at random among the possible categories for research purposes. However, even this limitation does not always preclude use of a randomized trial in some form, as in the following example:

Example 13-1. Many observational studies have found cigarette smoking during pregnancy to be associated with increased risk of having a low birth-weight baby. However, part of this association could represent confounding by maternal risk factors associated with smoking. Sexton and Hebel (1984) randomly assigned 935 pregnant smokers to either a smoking-cessation intervention or a usual-care control group. By the eighth month of pregnancy, 57% of mothers in the intervention group still smoked, versus 80% of mothers in the control group. The average birth weight of babies born to all intervention-group mothers was 92 grams higher than that in the control group (p < 0.05). These results suggested not only that the intervention was effective at helping pregnant mothers to stop smoking, but also that maternal smoking was truly a cause of low birth weight.

Here, the exposure of main interest (smoking during pregnancy) was not the same as the randomized intervention actually studied (participation in a smoking-cessation program). But if it is safe to assume that the narrowly focused smoking cessation program did not influence birth weight in any other way than by changing mothers’ smoking behavior, then the observed difference in birth weight represented a “diluted” effect of maternal smoking, biased away from finding a difference. When an association was found nonetheless, the case for a causal effect of smoking gained strong support.

3. 3. There is genuine uncertainty about which intervention strategy is superior. Sometimes other available evidence indicates beyond a reasonable doubt that one intervention strategy is superior to another. If so, it may be unethical to offer people the inferior method in lieu of the superior one. For example, a tongue-in-cheek article in the British Medical Journal pointed out that parachutes are thought to “reduce the risk of injury after gravitational challenge, but their effectiveness has not been proved with randomised controlled trials” (Smith and Pell, 2003).

4. 4. The primary outcomes are relatively common and occur relatively soon. Randomized trials almost always involve measuring outcomes prospectively in two or more comparison groups. As will be shown later, the power of a trial to detect an intervention effect on incidence depends on occurrence of a sufficient number of outcome events. For this reason, randomized trials are less well suited to studying rare or long-delayed outcomes, because large samples or prolonged follow-up would be needed to get the necessary number of outcome events.

Randomized trials can be expensive, especially when large numbers of participants are required, the period of follow-up is long, and/or the process of intervening and collecting data on each subject is costly. Still, it is important to keep in mind that the key distinguishing feature of a randomized trial is simply that the comparison groups be formed by a formal chance mechanism—a process that is technically quite easy to carry out and is neither costly nor time-consuming. Indeed, many randomized trials have been conducted speedily and at low cost when they concerned common, easily measured outcomes that occurred soon after the intervention was applied—e.g., Rosa et al. (1998); Klein et al. (1995). As recounted in the headnote to this chapter, Sir Ronald Fisher, inventor of the randomized trial, proved this point in a classic experiment about whether a lady could truly tell whether tea or milk was first added to the cup. (She did!)

### Historical Milestones

Randomized trials are a relatively modern innovation in research design (Chalmers, 2001). Although a few comparative treatment experiments had been carried out before 1900, none used formal randomization to form the comparison groups. Fisher first developed randomized-trial methodology for use in agriculture (Fisher, 1935), and some terminology for experimental designs (e.g., “split plot”) still reflects this legacy. A groundbreaking early randomized clinical trial showed that the antibiotic streptomycin was effective treatment for pulmonary tuberculosis (Medical Research Council, 1948). Growing appreciation of the design’s unique benefits led to rapid development and acceptance of its use throughout the 1950s, including a large trial of the Salk vaccine for prevention of polio that involved 1.8 million schoolchildren (Dawson, 2004). In 1962, the United States Food and Drug Administration began requiring evidence of efficacy from properly controlled randomized trials before a new drug could be approved for general use in the United States (Junod, 2012). The 1970s–1980s saw mounting of many large multi-site trials, including the University Group Diabetes Program (University Group Diabetes Program, 1970a,b), the Multiple Risk Factor Intervention Trial for prevention of cardiovascular disease (Multiple Risk Factor Intervention Trial Research Group, 1982), and cooperative cancer chemotherapy groups (DeVita and Chu, 2008). In the 1990s, the Cochrane Collaboration—named for Archibald Cochrane, a tireless and influential advocate of randomized trials (Cochrane, 1972; Winkelstein, 2009)—was formed to provide critical systematic reviews of evidence from randomized trials relevant to medical practice and health policy (Friedrich, 2013). In 1996, the first CONSORT guidelines (discussed below) for reporting of trial results were published (Begg et al., 1996) and were soon widely endorsed by medical journals. In 2004, the International Committee of Medical Journal Editors mandated that all clinical trials be registered in a publicly accessible trial registry (e.g., www.ClinicalTrials.gov), prior to patient enrollment, as condition of consideration for publication (DeAngelis et al., 2004).

## Consort

In an attempt to improve the completeness and quality of clinical trial reports in medical journals, an international committee of trialists, biostatisticians, epidemiologists, and journal editors developed the Consolidated Standards of Reporting Trials (CONSORT) in 1996 (Begg et al., 1996). The CONSORT statement has since been revised periodically (Moher et al., 2001; Schulz et al., 2010), and expansions of it have been published for several specialized types of randomized trials (CONSORT, 2013).

The CONSORT statement has been endorsed by more than 600 medical journals. Accordingly, investigators who plan to conduct a randomized trial would be wise know what information journals will be likely to require in a trial report. The scientific basis for the CONSORT guidelines is presented in an extensively referenced companion publication (Moher et al., 2010), which explains not only why each element should be addressed in a trial report, but also why the corresponding aspect of trial design and conduct is important scientifically. The CONSORT guidelines will be referenced repeatedly in this chapter.

## Trial Objectives

### Explanatory vs. Pragmatic Aims

Randomized trials are undertaken to determine whether one intervention strategy is better than another in some way. But why would the answer matter? Sometimes a potentially useful theory predicts a certain result, and the proposed trial would provide a rigorous test of the theory. Trials designed for this reason are often called explanatory trials (Schwartz and Lellouch, 1967; Thorpe et al., 2009). They are akin to basic research. Their focus is on understanding causal mechanisms, and the main product is improved knowledge about how the world works.

Example 13-2. To follow up on earlier research that had found an inverse association between dietary calcium intake and body weight, Bendsen et al. (2008) sought to test the hypothesis that high calcium intake causes increased fecal excretion of dietary fat. Eleven adult volunteers were identified through advertisements in shops and campus websites. The investigators then provided subjects with two tightly controlled diets, one containing 2,300 mg. of daily calcium and the other containing 700 mg. of daily calcium. Randomization determined which of the two diets a given subject followed first, and after following that initial diet for seven days, the participant was switched to the other diet for seven more days. Total fat excretion was found to be 11.5 grams/day on the high-calcium diet and 5.4 grams/day on the low calcium diet (p < 0.001), supporting the theory that a high-calcium diet can promote weight loss.

In other situations, knowing which intervention strategy is better matters because a practical decision must be made between two or more possible courses of action. Time, money, lives, comfort, or other valued outcomes are at stake. Trials that are undertaken to guide decision-making in the “real world” are often called pragmatic trials. They are akin to applied research. The focus is on choosing between feasible alternatives that are potentially widely applicable (Tunis et al., 2003). Whatever value the trial has as a test of a theory is a happy by-product.

Example 13-3. Women with breast cancer that has spread to a nearby lymph node are normally treated by surgical removal of the tumor, radiation therapy to the breast, and possibly chemotherapy. But there has been debate about whether the surgeon should do additional, more extensive dissection of axillary lymph nodes at the time of surgery in an attempt to find and remove other nodes that may contain cancer. Doing so may limit cancer spread, but it also prolongs surgery and poses risk of infection, lymphedema, and other complications.

Giuliano et al. (2011) conducted a collaborative randomized trial at 115 clinical centers, at which 891 eligible and consenting women with breast cancer and evidence of one or two affected local lymph nodes were randomized to receive either: (A) further dissection of at least ten axillary lymph nodes, or (B) no further axillary surgery. At follow-up, five-year survival was 91.8% for group A and 92.5% for group B. Disease-free survival was 82.2% for group A and 83.9% for group B. The investigators concluded that more-extensive axillary dissection offered little or no advantage in terms of survival or disease recurrence among such patients.

Explanatory and pragmatic aims are both perfectly good reasons for carrying out a trial. Sometimes the same study can satisfy both kinds of aims well. But choices about specific features of trial design must often be made between mutually exclusive alternatives. These choices can involve trade-offs between testing a theory rigorously and maximizing relevance to practical decision-making. Table 13.1 lists aspects of trial design that can be influenced by which kind of goal takes precedence. A clear view of whether a trial’s primary purpose is explanatory or pragmatic provides a consistent philosophy to guide these choices.

Table 13.1. Influence of Explanatory vs. Pragmatic Orientations on Trial Design

Feature

Explanatory

Pragmatic

Experimental intervention

What theory predicts should work best, whether practical for wider application or not

Strategy practical for use in real-world settings

Comparison intervention

Sharply defined, often maximizing contrast between experimental and control conditions

Realistic practical alternative to experimental intervention, sometimes just “usual care”

Study subjects

Those in whom an effect of intervention is expected to be greatest, including highly compliant people

Relatively broad sample from the potential target population

Outcome variables

Variables most sensitive to effects predicted by theory

Outcomes most relevant to subjects and care providers

A closely related distinction is between the efficacy and the effectiveness of an intervention (Porta, 2008; Koepsell et al., 2011). Efficacy refers to how well an intervention can work under ideal circumstances—e.g., when administered by well-trained experts and aimed at perfectly compliant recipients—even if those conditions are artificial and hard to mimic outside the research setting. Effectiveness refers to how well an intervention does work under “field conditions”—e.g., when administered by ordinary practitioners and offered to a relatively unselected target population.

### Clinical Trial Phases

The United States Food and Drug Administration (FDA) has set forth a typology of phases of clinical trials. The typology was originally developed to describe and classify trials of drugs, but terminology based on it is now widely used for trials of other kinds of experimental interventions as well. Note that the FDA typology uses the term “trial” to refer to an intervention study in humans, not necessarily a randomized intervention study.

• Phase 1 trials aim to determine a safe dosage range and to identify side effects, usually by administering the drug or other intervention to 20–80 healthy volunteers. No control group is studied.

• Phase 2 trials seek preliminary data about effectiveness by giving the experimental intervention to about 30–300 persons with the target health condition. Safety is also monitored. Some phase 2 studies lack a control group and base inferences only on changes over time in the treated individuals. Other phase 2 studies involve randomization between two arms, one of which is treated with a placebo or an alternative treatment.

• Phase 3 trials are nearly always randomized, seeking to evaluate the intervention’s effectiveness and assess its side effects and safety compared to placebo or an alternative. Phase 3 studies typically involve about 1,000–3,000 patients.

• Phase 4 studies are conducted after a new drug or other treatment is on the market, to obtain additional data about its benefits and risks in regular use. Phase 4 studies are typically observational studies that can involve thousands of patients, but without a control group. Due to their larger size, they can often identify rare but important side effects.

This chapter relates most closely to phase 3 studies and to phase 2 studies that include a randomized control group.

## Treatment Arms

The alternative conditions to which trial participants are assigned are often termed arms or treatment arms of the trial. (The term “treatment” in this context should be interpreted broadly to mean an intervention of any kind, not necessarily a therapeutic one.) A simple and common trial design involves just two arms, commonly called the experimental and control arms. Other design variations are considered later.

### Experimental

Interest in a particular new intervention (or an older one in which there is renewed interest) is usually what motivates a trial in the first place, and the trial is built around it. The form taken by this experimental intervention may be tailored to meet explanatory or pragmatic goals, as noted above.

### Control

The extent to which an experimental intervention is found to affect outcomes can depend heavily on what it is compared to. Several common options for the control arm follow.

#### Nothing

If there is no widely accepted competitor for the experimental intervention, one option is simply to compare it to nothing at all. For example, a randomized trial of low-dose aspirin for prevention of coronary heart disease was conducted among male British physicians, in which participants agreed to take either 500 mg. of aspirin daily or to avoid aspirin and aspirin-containing products (Peto et al., 1988). As one consequence of this design feature, the trial became an “open label” study, in which each participant knew full well whether he had been assigned to take aspirin or not. Such knowledge can influence self-reporting of outcomes, as discussed below.

#### Placebo

In drug trials, a placebo is a preparation that resembles the experimental drug to the senses but that omits the active ingredient and thus is believed to have no true biological effect (Vickers and de Craen, 2000). Placebos have long been credited with having the potential to make patients feel better by inducing expectations of benefit, although evidence in support of this claim is mixed (Hrøbjartsson and Gøtzsche, 2001; Walsh et al., 2002). In a randomized trial, the main function of a placebo is to help keep subjects unaware of which treatment they are receiving—i.e., to preserve blinding—which in turn helps prevent bias in ascertainment of outcomes or from differential attrition in the two arms. The GEM study of Ginkgo biloba used identical-appearing placebo tablets for just this purpose.

In trials of non-pharmacological interventions, the same function can be served by a “placebo-like” control intervention that appears similar to the experimental intervention but that lacks its presumably active component.

Example 13-4. Arthroscopic debridement and lavage has been a common orthopedic procedure to relieve knee pain due to osteoarthritis and resistant to drug therapy. The procedure involves cutting away damaged knee cartilage and rinsing out the joint space to remove fragments of debris, all done through an arthroscope to minimize the size of the surgical incision and to speed post-operative healing. Although about 650,000 such procedures were performed annually in the United States in the late 1990s, there was little firm evidence that the procedure improved outcomes. Moseley et al. (2002) conducted a randomized trial that compared arthroscopic surgery with sham surgery. They described what happened in the operating room and afterward for control-group subjects as follows:

After the knee was prepped and draped, three 1-cm incisions were made in the skin. The surgeon asked for all instruments and manipulated the knee as if arthroscopy were being performed. Saline was splashed to simulate the sounds of lavage. No instrument entered the portals for arthroscopy. The patient was kept in the operating room for the amount of time required for a debridement. Patients spent the night after the procedure in the hospital and were cared for by nurses who were unaware of the treatment-group assignment.

It may seem alarming that half the participants in this trial underwent an invasive surgical procedure in which, deliberately, no action was taken to treat their underlying disease. However, it should be kept in mind that all the study subjects knew in advance that there was a 50% chance that they would receive arthroscopic surgery and an equal chance that they would receive sham surgery, and they willingly gave their informed consent to participate. The study protocol was also approved by an institutional review board. In the end, knee pain and functional outcomes up to two years later differed very little between patients in the two arms. In retrospect, then, subjects in the sham surgery arm did not necessarily receive inferior treatment, and they contributed to important new knowledge about the (in)effectiveness of a costly and invasive treatment for their condition.

Other novel examples of non-drug “placebos” include sham foot orthotics in a treatment trial for plantar fasciitis (Landorf et al., 2006), sham hemodialysis in a trial of treatment for schizophrenia (Carpenter et al., 1983), placebo steam in a trial of moist air for relief of upper respiratory infections (Macknin et al., 1990), and sham electroconvulsive therapy in trials of treatment for depression (Johnstone et al., 1980).

#### Delayed Intervention

In some instances, the only feasible comparison option may be to assign participants at random to receive the experimental intervention either right away or at a later time. Outcomes in both groups are monitored concurrently during the period when the early-intervention group has received the experimental intervention but the control group has not.

Example 13-5. Gray et al. (2007) investigated the extent to which circumcision could reduce a man’s risk of becoming infected with HIV in a high-prevalence area. Some 4,996 Ugandan men agreed to be randomized to receive circumcision either immediately or after a 24-month delay. Over the ensuing 24 months, the incidence of HIV infection was found to be 0.66 cases per 100 person-years in the early-circumcision group, compared to 1.33 cases per 100 person-years in the control group.

A delayed-intervention control strategy can be attractive when it is not practically or politically feasible to deny the experimental intervention entirely to control participants, and/or when resources are too scarce to provide that intervention to everyone immediately. A disadvantage is that once the control group receives the intervention, a concurrent randomized comparison of intervention vs. no intervention is no longer possible, which sacrifices the ability to gauge effects on longer-term outcomes.

#### Active Alternative

An older, established intervention may already be accepted as beneficial in comparison to nothing at all. If a new competitor then comes along, it is likely to be both more ethical and more relevant to compare the new intervention to the older one, rather than to compare the new one with placebo or with nothing. For example, randomized trials done in the 1960s showed that treatment of high blood pressure with certain antihypertensive drugs could reduce the risk of cardiovascular complications, compared to the risk among those who received no such treatment (Veterans Administration Cooperative Study Group on Antihypertensive Agents, 1967). Nowadays, a new antihypertensive drug would need to be shown to be at least as good as older drugs of proven benefit if the new drug is to merit a role in clinical practice (ALLHAT Collaborative Research Group, 2000).

#### Usual Care

A new treatment can be compared to what study subjects would otherwise receive. This option can be attractive when the trial is chiefly pragmatic in orientation. However, the researcher often has little control over what happens to participants who are assigned to usual care, and what they actually receive can prove to be an important determinant of trial results.

Example 13-6. The Multiple Risk Factor Intervention Trial (MRFIT) sought to determine whether a multi-component intervention would reduce mortality from coronary heart disease (Multiple Risk Factor Intervention Trial Research Group, 1982). Some 12,866 high-risk men aged 35–57 years were randomized to either (1) a special intervention that included hypertension control, smoking cessation, and/or dietary reduction of blood cholesterol; or (2) usual care. When a subject entered the trial, risk-factor levels were evaluated through extensive testing during three baseline visits prior to randomization and annually thereafter. For men in the special-intervention group, data from these visits were used to tailor the intervention to their individual needs. For men in the usual-care group, reports from the baseline visits were sent to the participant’s regular physician without specific recommendations about what should be done.

During follow-up averaging seven years, coronary heart disease mortality was slightly but not significantly lower in the special-intervention group than in the usual-care group (−7%, 95% confidence interval: −25% to +15%), while mortality from all causes was actually slightly higher in the special-intervention group. Blood-pressure levels, cholesterol, and smoking prevalence had all declined in special-intervention group men, but declines were also observed in the usual-care group. Coronary heart disease mortality in both groups proved to be much lower than had been expected when planning the trial. Possible explanations suggested for the trial’s unexpected negative result included (1) the special intervention did not work; or (2) risk-factor changes in the usual-care group led to the trial’s being unable to detect a modest relative benefit of the special intervention. A later report after ten years of follow-up found more convincing evidence of mortality reduction in special-intervention men (Multiple Risk Factor Intervention Trial Research Group, 1990).

## Enrollment of Study Subjects

### Eligibility Criteria

The CONSORT guidelines mandate reporting the specific eligibility criteria for trial participants, and the settings and locations where the data were collected. Several factors come into play in setting eligibility criteria (Van Spall et al., 2007):

#### Internal Validity

A trial’s ability to reach a correct conclusion for subjects who actually take part is termed its internal validity. Examples of eligibility criteria motivated by a desire to protect internal validity include the following:

• Subject retention. People who expect to move away or who have an illness that may cut short their participation are often excluded.

• Data quality. People who do not share the native language of study personnel are often excluded because of the difficulty and expense of translating data instruments satisfactorily into other languages and concern about increases in measurement error.

• Compliance. In trials that seek to determine efficacy, potential subjects may be excluded if they would be unable or unwilling to comply with the intervention to which they are assigned. One strategy for enhancing compliance is to screen potential subjects during a run-in phase (Lang, 1990; Pablos-Mendez et al., 1998). Before being accepted and randomized, screenees are asked to do on a test basis what they would be expected to do during the main trial, such as taking a drug and/or providing certain data. Only those who show that they are willing and able to do so are accepted into the main phase of the trial. For example, in a trial of β‎-carotene and retinol for cancer prevention, only potential participants who took at least half of a supply of assigned placebo capsules during a three-month run-in phase were accepted into the trial (Thornquist et al., 1993).

#### External Validity

Generalizability of a trial’s findings to non-participants is termed its external validity. A pragmatic trial can be most useful if it includes a broadly representative sample of persons for whom the practical decision addressed by the trial would arise. The kinds of exclusions listed above, which seek to protect the trial’s internal validity, may involve sacrificing generalizability.

The right balance between these competing goals depends in part on how much is already known about the experimental intervention. If there is uncertainty about whether it works even under highly favorable conditions, then establishing its efficacy may need to take priority—without it, there is little reason to consider wide dissemination. But if previous studies have shown that the intervention can work, at least in certain settings, the emphasis may properly shift to evaluating whether and how well it works in a broader target population (Rothwell, 2005; Weiss et al., 2008; Koepsell et al., 2011).

#### Risks and Benefits to Subjects

Finally, the experimental intervention, or the control condition, may pose unusual risks to certain kinds of people. A vaccine that is prepared in eggs could be dangerous to someone with a known allergy to eggs, so he or she would probably be excluded from a trial evaluating the vaccine. An experimental treatment may be known to be highly toxic or risky, but the disease itself may be almost uniformly fatal. Eligibility for a trial of such a treatment may thus be open only to persons for whom other therapeutic options have been exhausted.

Any exclusion rule that stems from special risks or benefits posed by one of the trial arms must nonetheless apply to all potential subjects. Bias could be introduced if such an exclusion were applied selectively after the outcome of randomization became known, because doing so would upset the similarity of the groups formed by random assignment, thus potentially losing the key benefit of randomization.

### Number of Subjects

The CONSORT guidelines also mandate that a trial report specify how the trial’s sample-size requirements were determined. The statistical methodology for estimating sample size for cohort studies, as described in Appendix 5B of Chapter 5, also applies to two-arm parallel-groups trials and is not repeated here. Exposed is replaced with experimental arm and unexposed with control arm. Chapter 5 also provides guidance on choosing input values for sample-size formulas and references to biostatistical sources for specialized designs.

When estimating a trial’s sample-size requirements, specifying the size of the intervention effect—for example, the difference in incidence of outcome events between trial arms—often poses a challenge. Uncertainty about the size of this effect is, after all, the motivation for the trial in the first place, and it may seem circular to have to assume a value in order to plan the trial itself. This paradox can be resolved by thinking of the value specified for the intervention effect not as a prediction about what will happen, but as a judgment about the smallest effect that the trial needs to be able to detect. Ideally, that threshold value would be motivated by theory in an explanatory trial (How big an effect would it take to confirm or refute the theory?), or by practical considerations in a pragmatic trial (How big a difference in outcomes would justify opting for one strategy over the other, given the other factors affecting that decision?).

Perhaps because of the difficulty in making such judgments, investigators are sometimes tempted to specify the target intervention effect based on what was observed in a pilot study. That strategy can be hazardous, not only because it ignores the theoretical issue just discussed, but also because the small size of most pilot studies usually implies that the resulting estimate of intervention effect is quite imprecise. The pilot-study result may thus be far from the truth in either direction just by chance. A large effect that happens to be found in a small pilot study may spur enthusiasm for the intervention but could cause the main trial to be badly underpowered. A spuriously small effect size in a pilot study may cause a perfectly good intervention to be abandoned prematurely. Instead, a pilot study is best regarded as a small-scale evaluation of the feasibility of various aspects of study methodology, not as a preliminary test of the main hypothesis itself (Kraemer et al., 2006; Gore, 1981b; Wittes and Brittain, 1990).

In some instances, the available number of subjects is limited by external factors, so that sample size is not under the investigator’s sole control. To help determine whether a trial is worth doing under those circumstances, the formulas that are used to estimate sample size can often be rearranged to solve for β‎, the maximum tolerable probability of a Type II error under a specified alternative hypothesis. A decision on whether to proceed would be based on whether study power (= 1 − β‎) is judged to be high enough. One can also solve for the smallest detectable difference in outcome frequency, given specified values of the sample size, α‎, β‎, and possibly other design parameters.

In practice, setting a target sample size can be far easier than achieving it. Recruitment into trials is a frequent problem (Lovato et al., 1997; Taylor et al., 1984), particularly for minorities, older adults, and certain other subpopulations (Ford et al., 2008; Swanson and Ward, 1995). Campbell et al. (2007) have evaluated and summarized evidence behind various ways to enhance enrollment into trials, and Lai et al. (2006) did so for minorities. A pilot study can be valuable for identifying recruitment problems that may arise and for verifying that the target number of subjects is realistic.

An important determinant of sample-size requirements is the expected frequency of the outcome events of main interest: other things being equal, rarer outcomes require larger sample sizes. This statistical fact is one reason why randomized trials are considered relatively inefficient for the detection of rare but serious adverse events. One proposed way to help overcome this limitation is to design and conduct large, simple trials—typically two-arm trials with uncomplicated eligibility criteria and bare-bones data collection requirements that focus on essential outcomes (Yusuf et al., 1984). By minimizing the cost per participant, more subjects can be enrolled for a given overall trial cost. Moreover, recruiting a large number of subjects can be facilitated by keeping eligibility criteria simple. As one example, the Rotavirus Efficacy and Safety Trial (Vesikari et al., 2006) studied 68,038 healthy infants in 11 countries to test whether a new vaccine against rotavirus increased the risk of intussusception, an important adverse outcome that had been associated with earlier anti-rotavirus vaccines and whose estimated incidence was about 50 cases per 100,000 infant-years (Heyse et al., 2008). Within one year after receiving either the new vaccine or a placebo, 12 vaccine recipients and 15 placebo recipients developed intussusception (RR = 0.8, 95% CI: 0.3 – 1.8). The trial’s large size yielded sufficiently narrow confidence limits around the final relative-risk estimate to provide reassurance that the new vaccine led to little or no increase in risk of intussusception.

### Informed Consent

Randomized trials funded by any United States government agency must meet ethical standards concerning involvement of human subjects (U.S. Department of Health and Human Services, 1991). Many research organizations apply these policies regardless of their funding source. With rare exceptions, participants must give their informed consent to serve as research subjects. Required elements of informed consent involve informing potential subjects about:

• The fact that they would be participating in research

• Procedures involved

• The nature of risks and discomfort they might face

• Potential benefits to those who take part and to other people as a result of subjects’ trial participation

• Alternative treatments or procedures that might be to subjects’ advantage

• The extent to which information gathered would remain confidential

• Compensation available in the event of injury, if more than minimal risk is involved

• Whom subjects may contact with questions about the research

• The fact that participation is voluntary and that subjects may withdraw without penalty or loss of benefits to which they might otherwise be entitled.

Ethical aspects of the conduct of randomized trials are discussed more fully by Sugarman (2002), Kahn et al. (1998), and Kodish et al. (1990).

## Randomization

Random assignment of participants to treatment groups is the defining feature of a randomized trial. In this context, random does not mean simply “haphazard” or “with no apparent pattern”; rather, it means resulting from a formal chance mechanism by which each subject has a known probability of being assigned to a given treatment group, such that the outcome of that assignment is not predictable in advance.

Because randomization is so important, the CONSORT guidelines mandate reporting on three aspects of randomization, each of which is discussed further below:

• Sequence generation: The specific method that was used to generate the random-allocation sequence, including the type of randomization employed and details of any restrictions, such as blocking and block size.

• Allocation concealment: How the random-allocation sequence was kept concealed prior to subjects’ assignments to treatment groups.

• Implementation: Who generated the random allocation sequence, who enrolled participants, and who assigned participants to treatments.

### Generating the Sequence of Random Assignments

Several types of randomization can be used to generate a sequence of treatment-group assignments (Schulz and Grimes, 2002b). The choice among them involves two goals that can compete with each another:

• Balance. To prevent confounding, randomization should ideally form intervention groups that are similar on both measured and unmeasured characteristics that may influence outcomes. In a particular instance, the risk and extent of imbalance can depend on the specific type of randomization employed, as well as sample size.

• Unpredictability. To prevent bias due to self-selection or selective referral, neither potential subjects themselves nor others who refer potential subjects into a trial should be able to anticipate the intervention arm to which a subject would be assigned if he/she joined the trial. Maintaining unpredictability depends in part on keeping the random-allocation sequence adequately concealed, as discussed later. But it also depends on the extent to which someone who looks closely at past treatment-group assignments could use that information to predict the next assignment.

The main types of randomization are described below. For simplicity, we assume here that the design calls for forming two treatment groups of approximately equal size; with minor changes, all of the methods can also be applied when the goal is a ratio of group sizes other than 1:1. Appendix 13A provides more technical details about how each can be implemented by computer and examples.

#### Simple Randomization

Simple (or unrestricted) randomization is tantamount to flipping a coin. Every subject has an equal chance of being assigned to either group, and all assignments are independent of one another. Simple randomization thus maximizes unpredictability: the sequence of past assignments provides no clue about the group to which the next patient is likely to be assigned.

The main disadvantage of simple randomization is the possibility of chance imbalances, particularly with small sample sizes. To illustrate: if 10 subjects with a certain characteristic are randomized between two groups, the probability that seven or more of them will end up in one group while the remaining three or fewer end up in the other group is about 0.34. With 30 subjects randomized, the probability of a split as lopsided as 70%:30% or worse is about 0.04; with 100 subjects randomized, it is about 0.00008; and with 1,000 subjects randomized, it is about 2 ×10−35. In short, simple randomization balances the size and composition of the intervention groups better and better as the number of units randomized increases.

Restricted randomization refers to the use of another procedure along with random assignment in order to promote balance in the sizes or composition of study groups.

#### Block Randomization

Block randomization, the most common form of restricted randomization, involves first grouping study subjects into sets, or blocks. Within each block, an equal number of subjects is then randomly assigned to each treatment group.

The block size is usually a small integer multiple of the number of treatment groups. For example, if there are two treatment groups, block sizes of 2, 4, or another even number could be used. Often the basis for grouping subjects into blocks is their order of entry into the trial: for block size = 2, the first two subjects entering the trial belong to the first block, the next two subjects to the second block, etc. Doing so assures that the sizes of the treatment groups will be exactly equal at the end of each block during subject enrollment, even if recruitment ends early.

Block randomization gives priority to keeping the sizes of treatment groups balanced. However, it does so at the expense of unpredictability. For example, suppose that the recent sequence of treatment-group assignments (E = Experimental, C = Control) has been: E C E C C E E C C. It is not hard to detect a pattern: within each consecutive pair of subjects, there is one E and one C. The last subject was assigned to C, so the next will probably go to E in order to complete the pair. A care provider who really prefers that his patient, Mr. Jones, receive the experimental treatment may delay referral of Mr. Jones into the trial until it looks likely that he would be assigned to E. Even if the provider lacks complete information about recent assignments or does not detect the pattern, simply waiting to refer Mr. Jones until the previous trial participant went to C has a 3/4 chance of steering Mr. Jones toward receiving the experimental treatment. Needless to say, any such subversion of randomization in forming the treatment groups can create skewed trial results. Unfortunately, experience suggests that subversion of randomization does occur (Schulz, 1995), and it can be very difficult to “repair” a trial that is tainted by it.

One way to help prevent corruption of block randomization is to use larger block sizes. With a block size of four, as many as four participants in a row could be assigned to the same treatment group (the first two at the end of one block, the second two at the start of the next block), making it more difficult to detect a pattern and to subvert true random allocation. Block size can also be varied randomly to interject even greater unpredictability.

A special case of block randomization can be used to create two equal-sized treatment groups when the exact number of study subjects is known in advance. For example, if k subjects will be studied in a two-arm trial, exactly k/2 of them can be assigned to each of the two treatment groups. Having equal-sized treatment groups maximizes the statistical power of a trial, although there is generally only slight loss of power with modest departures from equal group sizes (Hewitt and Torgerson, 2006; Hedden et al., 2006).

#### Stratified Randomization

Sometimes trial designers deem it important to guarantee that a strong potential confounding factor will be balanced between treatment groups. Stratified randomization can serve this purpose (Kernan et al., 1999). Each level on the confounding factor becomes a stratum, and randomization is carried out separately within each stratum. Block randomization (or some other form of restricted randomization) must be used within each stratum to assure that the ratio of experimental to control subjects is equal across strata.

Because there may be many potential confounders, it can be tempting to try to “micro-manage” randomization by stratifying simultaneously on several possible confounders. Unfortunately, this strategy soon becomes self-defeating. Within each stratum, block randomization assures equal allocation of subjects to treatment arms only at the completion of a block. If there are numerous small strata, many of them can contain uncompleted blocks when recruitment ends, and the desired balance may not be achieved, especially if larger block sizes are used to preserve unpredictability. Hence if stratified randomization is used, the number of strata is best kept small (Kernan et al., 1999). As consolation, post-hoc stratification can always be used in the analysis to correct for imbalances, even if stratified randomization was not used to assign treatments.

#### Adaptive Allocation and Related Methods

A variety of other allocation methods have been described, mostly seeking to improve on the balancing property of randomization, including minimization (Scott et al., 2002), biased-coin allocation (Efron, 1971), and urn randomization (Wei and Lachin, 1988). These methods are called adaptive because the probability of the next subject’s assignment to a given study group is allowed to vary depending on the outcome of previous assignments. The strongest case for their use can be made for small trials, when the risk of imbalance with simple randomization is greatest, and for sequential trials (described below) that may end early with modest sample sizes (Hedden et al., 2006). However, standard statistical tests that rely on an assumption of simple randomization may no longer apply, requiring that special analysis methods be used for proper statistical inference.

#### Remarks on Sequence Generation

While there are no universally accepted rules for which randomization method is best under which circumstances, some general guidance can be offered, based largely on Schulz and Grimes (2002b). For trials with about 100 or more subjects per group and no interim analyses planned, simple randomization can be expected to produce good balance, and it maximizes unpredictability. If total sample size is known in advance, single-block randomization equalizes group sizes, thus maximizing power. Otherwise, block randomization with many small blocks based on order of entry can keep group sizes balanced throughout recruitment, which facilitates interim analyses and possible early termination. Randomly varying block sizes promotes unpredictability. Stratified randomization is needed most in trials with fewer than about 50 subjects per treatment group when perhaps one or two strong prognostic factors are known a priori. Within strata, block randomization (possibly single-block, if stratum sample sizes are known in advance) is needed to keep treatment group sizes balanced. Adaptive allocation can be worth considering in special situations when there are good reasons to doubt the adequacy of conventional randomization methods and if resources are available to apply the non-standard analysis methods that may be required.

Regardless of the method chosen, randomization is not technically difficult or costly to carry out. In order to reap its major theoretical benefits, and in view of the CONSORT guidelines that call for detail in reporting, careful efforts should be made to randomize properly. Appendix 13A describes straightforward algorithms that use computer-generated random numbers produced by standard statistical software and that can be implemented so as to leave an audit trail.

Once the study design and type of randomization have been decided upon, the sequence of random assignments can almost always be generated at the beginning of the trial, before any subjects are enrolled. Doing so allows the sequence to be checked for technical correctness when there is still time to fix any programming errors gracefully. Early preparation of the allocation list can also simplify the process of assigning subjects to treatment groups when enrollment gets underway. Once a subject has been found eligible and has given consent, authorized study staff need only look up the subject’s assignment on a list that has already been prepared.

### Allocation Concealment

While there are good reasons to generate the allocation list for a trial early, access to the list must be carefully controlled in order to preserve unpredictability and prevent selection bias. Inadequate allocation concealment has been found in several studies to be associated with differences in trial results (Odgaard-Jensen et al., 2011), suggesting that bias is not just a theoretical possibility.

Fortunately, this kind of bias is almost always preventable. Probably the most effective method of allocation concealment is centralized randomization. When a potential new trial participant has been identified, the central study office is contacted by telephone or other means. A trained staff member whose primary responsibility is to protect study integrity receives the call, verifies the subject’s eligibility, registers his/her identifying information, finds the subject’s treatment assignment on a master allocation list that is kept secure at study headquarters, and then provides appropriate instructions to the caller about action to be taken for the newly enrolled subject. Alternatively, for placebo-controlled drug trials, trained pharmacy staff can register the subject and dispense sequentially numbered containers that have been pre-filled with the appropriately randomized experimental drug or placebo. Other systems include using sequentially numbered, sealed, opaque envelopes that contain treatment-group assignments and that can be opened in the field only after a new subject has been duly enrolled. However, it is difficult to make such a system completely tamper-proof without oversight by an independent third party responsible to the trial (Hewitt et al., 2005; Schulz and Grimes, 2002a).

### Implementation

The CONSORT guidelines also call for reporting how key procedures related to randomization were actually carried out: in particular, who was responsible for which tasks. CONSORT recommends complete separation between (a) the persons who generate the random assignment sequence and who keep it concealed; and (b) those who enroll participants and carry out trial procedures once a subject has been assigned to a treatment group (Moher et al., 2010). The aim is to maintain an information “firewall” that prevents premature disclosure of treatment assignments.

## Data Collection

### Baseline Characteristics

Information on study subjects at entry into a trial serves several purposes:

• Verify eligibility. Some data must be gathered to confirm that a subject qualifies for entry according to the trial’s eligibility criteria.

• Describe the study population. Baseline data also serve to describe the study population and thus help consumers of the results to judge the scope of generalizability. Often the first table in a published trial report shows demographic and clinical characteristics of the groups formed by randomization, as called for by the CONSORT guidelines (Schulz et al., 2010; Altman and Dore, 1990; Assmann et al., 2000).

• Assess potential confounding. While randomization works well on average to create balanced treatment groups, “accidents of randomization” do occur. The initial table describing the study population in a trial report usually also permits a check on the degree to which randomization balanced the groups. Characteristics that are known determinants of the trial’s main outcomes are especially important, because imbalance between treatment arms would make these factors confounders. Information on such factors may also be needed for use in stratified or blocked randomization.

• Identify planned subgroups. Some trials have a priori hypotheses about variation in the size of treatment effects across different subgroups (effect modification). Hence baseline information is needed to place study subjects in the relevant subgroup for later analysis. For example, in the Multiple Risk Factor Intervention Trial, the investigators hypothesized in advance that men with a normal baseline electrocardiogram would benefit most from intervention (Multiple Risk Factor Intervention Trial Research Group, 1982).

• Enhance study power. If a study outcome is also a characteristic that can be measured at baseline—e.g., blood pressure, body weight, or bone density—then a trial’s statistical power can often be enhanced by using each subject’s baseline value in the analysis of later outcomes. To the extent that baseline and follow-up values are correlated within subjects, including the baseline value in the analysis can remove a source of between-subject variation that would otherwise be treated as part of random variation in outcomes (Fleiss, 1986, Chapter 7).

### Outcomes

A trial’s outcome measures usually follow naturally from its objectives. But like cohort studies, randomized trials lend themselves to studying multiple outcomes of a given exposure or intervention, and these outcomes can be distinguished from one another in several ways.

#### Primary and Secondary Outcomes

The CONSORT guidelines mandate that a trial’s primary and secondary outcomes be clearly specified in trial reports. Primary outcomes are those deemed most important to users of trial results and are those used in sample-size or power calculations. Whether the trial has chiefly explanatory or pragmatic aims can affect the choice of measures: outcomes that matter for testing theory may matter less for guiding practice, and vice versa (Table 13.1). But it is often easy to add secondary outcome measures to the data collection plan without complicating the basic study design. The GEM study, for example, provided a convenient setting in which to examine the effects of Ginkgo biloba on endpoints other than dementia, including serious bleeding episodes, coronary heart disease, and stroke (DeKosky et al., 2008).

#### Intermediate Outcomes

Investigators often have in mind a causal model under which an intervention should produce certain effects in a certain order. For example, in a trial of sodium fluoride to prevent bone fractures in postmenopausal women with osteoporosis, sodium fluoride was expected to increase cancellous bone formation and thus to increase bone density, leading to reduced incidence of fractures (Riggs et al., 1990). Both bone density and fracture incidence were measured as outcomes. Bone density would be termed an intermediate outcome, while fracture occurrence would be a later clinically important outcome. Often, as here, an intermediate outcome is a biomarker measuring some physiological variable or physical sign that the experimental treatment is expected to affect early but that may be imperceptible to participants. A later, clinically important outcome reflects a subject’s symptoms, functioning, or survival (Lassere, 2008; Bucher et al., 1999).

One function of an intermediate outcome is to permit testing assumptions about an intervention’s mechanism of action. In the randomized trial described earlier (Example 13-1) of smoking cessation counseling to prevent low birth weight in infants of smoking mothers, two outcomes were measured: (1) whether pregnant smokers stopped smoking, and (2) infant birth weight. More than twice as many pregnant smokers in the intervention group as in the control group quit smoking (43% vs. 20%), and babies born to intervention-group mothers averaged 92 grams heavier. Together, these findings suggested that the intervention did indeed work as intended. But had there been no difference in mean birth weight between groups, one might wonder whether: (1) the intervention had been ineffective in getting pregnant mothers to stop smoking, or (2) the causal link between smoking in pregnancy and low infant birth weight was not as strong as had been supposed. Knowing whether maternal smoking behavior differed between groups would have provided a useful clue about where the hypothesized causal chain was broken.

The situation is different, however, if only an intermediate outcome is measured while the clinically important outcome is not. In that context, the intermediate outcome is often termed a surrogate outcome: it is used as a substitute for the clinically important outcome. An association between the surrogate outcome and the later clinically important endpoint may be considered to be already well established, so measurement of the surrogate outcome alone is proposed as sufficient for evaluating efficacy of an intervention. A major attraction of this approach is that the trial can almost certainly be completed earlier and may require many fewer subjects than if the clinically important outcome were the primary target. However, the following example illustrates a pitfall of this strategy:

Example 13-7. The Cardiac Arrhythmia Suppression Trial (CAST) (Echt et al., 1991) was motivated by two observations: (1) myocardial infarction survivors whose electrocardiogram showed frequent premature ventricular contractions (PVCs) had an increased risk of sudden cardiac death, and (2) anti-arrhythmic drugs could reduce the frequency of PVCs in these patients. In early testing, two drugs—encainide and flecainide—were found to be particularly effective in suppressing PVCs (Cardiac Arrhythmia Pilot Study Investigators, 1988). The main CAST trial was then mounted to assess the ability of these drugs to save lives. After 10 months, the results shown in Table 13.2 were obtained, and the trial was stopped. Unexpectedly, both arrhythmic and non-arrhythmic cardiac deaths were more common among patients who received encainide or flecainide than among those who received placebo. Although the precise mechanisms for the observed excess deaths were unclear, these drugs evidently had important cardiac side effects beyond their ability to suppress PVCs. Encainide was subsequently taken off the market, and the clinical indications for flecainide were curtailed. Note that using PVC suppression alone as an intermediate endpoint to measure effectiveness would have been dangerously misleading.

Table 13.2. Mortality Experience in the Cardiac Arrhythmia Suppression Trial

Encainide or flecainide

Placebo

Patients randomized

755

743

Number of deaths due to:

Arrhythmia

43

16

Other cardiac cause

17

5

Non-cardiac cause

3

5

All causes

63

26

(Based on data from Echt et al. [1991])

The sodium fluoride trial to prevent bone fractures, described earlier, is a second example. Bone density was indeed 10–35% higher in sodium fluoride recipients than in placebo recipients, depending on anatomical location, but the incidence of non-vertebral fractures proved to be about three times greater in the sodium fluoride group. Although treatment with sodium fluoride made skeletal bones denser, it evidently also made them more fragile (Riggs et al., 1990).

These and other studies using intermediate outcomes (Lassere, 2008; Grimes and Schulz, 2005; Bucher et al., 1999; Psaty et al., 1999; Avorn, 2013) have taught sobering lessons about the potential danger of relying on surrogate outcomes as evidence of clinical effectiveness. The key problem is that an intervention can affect clinical outcomes through mechanisms other than those involving the intermediate outcome. These other mechanisms may not be known in advance but can turn out to be very important, even predominant.

Prentice (1989) set forth the following principle: a valid surrogate outcome is “a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint.” That is, the intermediate outcome measure must fully capture the effects of treatment on the clinical endpoint. Unfortunately, is difficult to verify that this requirement is met without having data on both the intermediate and the clinically important endpoints—a situation that, ironically, eliminates the need to rely exclusively on the intermediate outcome. Lassere (2008) and Weir and Walley (2006) review statistical approaches aimed at this problem.

#### Composite Endpoints

Some trials define their primary outcome as a composite endpoint: a subject is counted as having had an outcome event if he/she experiences any one of several possible clinical occurrences, some of which may be fatal and others nonfatal (Freemantle et al., 2003).

Example 13-8. The Trial to Assess Chelation Therapy (TACT) aimed to determine whether sodium EDTA infusions, a controversial treatment used mainly by alternative and complementary medicine practitioners, would improve outcomes after myocardial infarction (Lamas et al., 2013). A total of 1,708 participants from 134 clinical sites were randomized to either 40 EDTA infusions or 40 placebo infusions administered over several months. The primary endpoint was any occurrence of: death from any cause, reinfarction, stroke, undergoing coronary revascularization, or hospitalization for cardiac chest pain. The incidence of this composite outcome proved to be 18% lower (95% CI: 1%–31%) in EDTA recipients than in placebo recipients—a difference that barely crossed the threshold for statistical significance. Analyses that focused on individual components of the composite outcome showed roughly similar proportionate reductions in each of the individual components, but none were statistically significant at p < 0.05.

A major motivation for using a composite endpoint is to reduce the trial’s sample-size requirements: incidence of the composite endpoint will be greater than the incidence of any of its components, leading to lower sample-size requirements to detect a certain proportionate reduction in incidence. Using a single composite outcome can also help circumvent a multiple-testing problem that might otherwise occur if statistical tests are conducted on each of several outcome types. A limitation of composite endpoints, however, is that they implicitly treat their different components as equally important—a view that may not be shared by consumers of trial results. In addition, reporting a single treatment effect may be taken to imply a similar effect on all components of the composite (Freemantle et al., 2003). The trial results themselves may even suggest different effects on different components, but with insufficient precision to confirm or refute this inference conclusively. The TACT trial was criticized on those grounds (Nissen, 2013).

### Blinding

Blinding (or masking, especially in trials of eye diseases) refers to keeping persons involved in a trial unaware of which study subjects are in which treatment arm during the trial. (Allocation concealment, discussed earlier, involves hiding information about the treatment group to which a subject will be assigned until after the assignment is made; blinding involves continuing to keep that information hidden throughout the trial.) The main reasons for this deliberate withholding of information are to prevent bias in ascertainment of outcomes and to minimize differential attrition of subjects.

A trial is double-blind if both the study subjects and the research staff members responsible for measuring outcomes are kept unaware of treatment-group assignments. A trial is single-blind if only one of these parties (usually study subjects) is kept unaware. However, blinding can also be extended to people who play other roles. In clinical trials, a patient’s caregiver may not be the person who measures outcomes, but the caregiver’s actions can nonetheless influence outcomes and be influenced by knowing which intervention the patient is receiving. The study protocol may therefore keep caregivers blinded while providing an escape mechanism to break the blinding if knowing to which arm of the study the patient has been assigned becomes critical for care delivery. Even the statistician(s) responsible for day-to-day data analysis may be kept blinded to the identity of the intervention being received by each of the study groups, because it may be hard for a statistician who is in regular contact with other members of a research team to remain completely neutral as to the expected outcome. Most large trials create an independent Data Safety and Monitoring Board, whose members have exclusive access to information about which group is which. These people are charged with deciding whether any differences in outcomes across study groups justify breaking the blinding and halting the trial.

Blinding is not always possible. In community trials, for example, the intervention may involve a mass-media campaign to influence health behavior (Bauman and Koepsell, 2006), which is clearly at odds with keeping people unaware of which treatment arm they are in. Even when blinding is attempted, it may be only partly successful because it is difficult to design a perfect placebo. Certain characteristic side effects, for example, may be hard to mimic.

The success of blinding can often be evaluated at trial’s end simply by asking people to guess which treatment they received (Colagiuri, 2010). In the GEM study, for example, 61% of all participants who completed an exit interview believed they had been taking placebo, and 39% believed they had been taking Ginkgo biloba, but those percentages were nearly identical in both treatment groups. However, attempts at blinding can also fall well short of their goal (Fergusson et al., 2004; Byington et al., 1985; Howard et al., 1982). Sometimes differential attrition from the trial can also provide indirect evidence that suggests unblinding of at least some participants. For example, in the TACT trial of chelation therapy after myocardial infarction (Example 13-8), a larger proportion of patients in the placebo group dropped out during the trial, suggesting that some of them may have detected that they had been randomized to ineffective therapy (Lamas et al., 2013; Nissen, 2013). Nonetheless, even if some trial participants correctly judge the nature of the intervention arm to which they have been assigned, removing part of the potential bias is arguably better than removing none at all.

## Analysis

Initial analyses describe the study sample and compare the treatment groups, especially with regard to known determinants of study outcomes, as a check on randomization. Occasional statistically significant differences can occur by chance, but an unexpectedly large number of them may raise suspicion about whether sequence generation or allocation concealment went awry.

### Intent-to-Treat Principle

An important potential pitfall in the analysis and interpretation of randomized trials is illustrated by a now-famous example:

Example 13-9. The Coronary Drug Project was a multicenter, randomized, double-blind, placebo-controlled trial, one aim of which was to test whether treatment with the lipid-altering agent clofibrate could reduce mortality from coronary heart disease (Coronary Drug Project Research Group, 1980). Compliance with the intended drug regimen was checked by pill counts at study visits. Among trial participants assigned to the clofibrate arm who actually took 80% or more of the clofibrate dispensed, five-year cumulative mortality was 15.0%, compared to 24.6% among patients who were less compliant (p = 0.00011). It is tempting to infer that clofibrate worked well among those who took it, but that it could not benefit patients who failed to take it. However, among patients in the control arm who were 80%+ compliant with placebo, five-year cumulative mortality was 15.1%, compared with 28.3% among those who were less compliant with placebo (p < 5 × 10−16). Multivariable adjustment for 40 baseline characteristics had little effect on these findings. Five-year cumulative mortality among all patients randomized to clofibrate was 20.0%, compared to 20.9% among those randomized to placebo (p = 0.55), suggesting that use of clofibrate itself had little effect.

Under the intent-to-treat principle, the primary analysis in a randomized trial should compare outcomes between the groups formed by randomization. In other words, each participant is categorized according to what intervention he or she was intended to receive. Recall that the main reason for using the randomized-trial design in the first place is to form groups that can be assumed to be similar, even with regard to unmeasured factors that may affect outcomes. If the composition of one or more of those groups is altered, this balancing property of randomization is lost. A study that began as a randomized trial would, in effect, be converted into a non-randomized study in which confounding may be present.

Several circumstances can tempt investigators to depart from the intent-to-treat principle:

• Non-compliance with the intended treatment can occur, as in the Coronary Drug Project.

• Crossing over can occur if persons who were originally randomized to one treatment end up receiving the other. For example, in a comparison of watchful waiting versus early surgical repair for men with a minimally symptomatic inguinal hernia, 23% of those randomized to watchful waiting ended up receiving surgical repair, while 17% of those randomized to early surgical repair ended up opting for watchful waiting (Fitzgibbons et al., 2006).

• Selective exclusions can occur if participants are deliberately dropped from analysis after randomization. For example, in the Joint Study of Extracranial Arterial Occlusion (Sackett and Gent, 1979), the risk of recurrent transient ischemic attack, stroke, or death was reportedly reduced by 27% (p = 0.02) in patients who had surgery to bypass an arterial occlusion when the analysis was based on those “available for follow-up.” However, this analysis excluded 15 patients who had been randomized to surgery but who died early or had perioperative strokes, while only one patient who had been randomized to medical treatment was excluded after randomization. In an intent-to-treat analysis, patients who were assigned to undergo surgery had only a 16% reduction in risk (p = 0.09).

Late exclusions can often be reduced by delaying randomization until as late as possible, so that some persons who are destined to drop out early do so before randomization.

The price paid for keeping the benefits of randomization in an intent-to-treat analysis is typically a diluted treatment effect. (Exclusions after randomization can shift the results in either direction depending on the reasons for exclusion.) In an effectiveness trial, this dilution may be an accurate reflection of what to expect under real-world conditions and thus may be consistent with trial goals. In an efficacy trial, a diluted treatment effect is unwanted, but at least the direction of bias is generally predictable, toward finding no difference. Any treatment effect found in an intent-to-treat analysis is thus likely to be a conservative estimate of efficacy.

### Estimating Efficacy Indirectly

In some trials, efficacy of an intervention can be estimated from study data even in the face of non-compliance with the intended treatment, under certain assumptions. One method was described in Chapter 12, which involved using the randomized treatment-group assignment as an instrumental variable. Another method, based on counterfactuals, is illustrated in the following example:

Example 13-10. A randomized trial was conducted in Indonesia to determine whether vitamin A supplementation would reduce mortality among preschool children (Sommer and Zeger, 1991; Sommer et al., 1986). Of 450 study villages, 229 were randomly selected to have their age-eligible children receive two doses of oral vitamin A. Children in the control villages received no supplements. However, for various reasons, including logistical problems and parental non-compliance, about 20% of children in the vitamin-A villages did not receive vitamin A supplements as intended. During follow-up, child deaths were distributed as follows:

Table 13.3.

n

No. of deaths

Cumulative mortalitya

Intervention

12,094

46

3.8

9,675

12

1.2

2,419

34

14.1

Control

11,588

74

6.4

a Deaths per 1,000 children

The effect of the overall intervention program can be estimated from an intent-to-treat analysis. The risk difference was 3.8 − 6.4 = − 2.6 deaths per 1,000 children, or a (6.4 − 3.8)/6.4 × 100% = 41% reduction in cumulative mortality.

However, the investigators also wanted to estimate the efficacy of vitamin A supplements in children who actually received them. A naïve estimate of 1.2 − 14.1 = −12.9 deaths per 1,000 (a 91% reduction) would be confounded by whatever factors influenced receipt or non-receipt of vitamin A in the study group, and these factors would be difficult to identify and control.

A better estimate that avoids this problem can come from envisioning the results if the treatment assignments had been reversed. Had the control group been assigned to receive vitamin A, some children in that group would have received the supplements and some would not. The two treatment groups and their subgroups can thus be organized and labelled as follows:

Table 13.4.

n

No. of deaths

Cumulative mortalitya

Intervention

12,094

46

3.8

9,675

12

1.2

B: Did not receive vitamin A

2,419

34

14.1

Control

11,588

74

6.4

C: Would have received vitamin A

D: Would not have received vitamin A

a Deaths per 1,000 children

Because randomization produces two exchangeable groups, whatever factors influence a child’s receipt of vitamin A should be equally common in the control group and the intervention group. Therefore, given that 2,419/12,094 = 20% of the intervention group did not receive vitamin A, we would expect that 20% of the control group would not have received vitamin A either, if it had been their assigned treatment. Subgroup D should thus contain about 11,588 × 20% ≈ 2,318 children. Moreover, it is reasonable to expect that the cumulative mortality in subgroup D should be the same as the cumulative mortality actually observed in subgroup B. This is because they are equivalent subgroups, and neither received vitamin A supplements (albeit for different reasons). Thus, there should be about 2,318×14.1/1,000 ≈ 33 deaths among children in subgroup D. Now, given that the size and number of deaths in subgroup D have been estimated, the size and number of deaths in subgroup C can be obtained by subtraction from the entire control group, yielding:

Table 13.5.

n

No. of deaths

Cumulative mortalitya

Intervention

12,094

46

3.8

9,675

12

1.2

B: Did not receive vitamin A

2,419

34

14.1

Control

11,588

74

6.4

C: Would have received vitamin A

9,270

41

4.4

D: Would not have received vitamin A

2,318

33

14.1

a Deaths per 1,000 children

The effect of receiving vitamin A supplements can now be estimated as 1.2 − 4.4 = −3.2 per 1,000 children, or about a 73% reduction in cumulative mortality. This result can be interpreted as the estimated efficacy of vitamin A supplements among children who received them, or who would have received them if assigned to do so.

### Subgroup Analyses

The primary aim of most trials is to determine whether one strategy is better overall than another. Still, it is usually reasonable to assume that different subjects may respond differently to an intervention and that it may therefore be more effective in some subjects than in others. Often particular subgroups are identified in advance in whom an intervention effect is expected to be larger or smaller than in other subgroups. For example, in the study of β‎-carotene for prevention of lung cancer, the investigators hypothesized that participants who had a low serum β‎-carotene level at baseline would benefit more from active treatment than those who had a high baseline β‎-carotene level (Omenn et al., 1996). Even when there are few or no such a priori subgroup hypotheses, investigators often conduct exploratory analyses to search for larger or smaller intervention effects in multiple subgroups defined according to a variety of factors. The temptation to do so is especially great when trial results suggest little or no overall effect, which may represent a disappointing failure of what was thought to be a promising intervention. A “positive” effect in a subgroup can be viewed as rescuing an otherwise “negative” study. Pocock et al. (2002) found that in a sample of 50 trial reports in several major medical journals, 70% presented results of subgroup analyses.

Subgroup analyses are subject to several important pitfalls, however (Rothwell, 2005; Barraclough and Govindan, 2010; Oxman and Guyatt, 1992; Head et al., 2013; Yusuf et al., 1991):

1. 1. Increased risk of Type II errors. The target number of subjects for most trials is driven by a main hypothesis that posits an overall treatment effect of a certain size. Because a subgroup is inevitably smaller than the full study population, statistical tests for a similar treatment effect within subgroups have less statistical power. In principle, if an a priori subgroup hypothesis is of sufficient scientific importance, trial size can be increased accordingly during the planning phase, but this is unusual.

2. 2. Increased risk of Type I errors. Because a study population can be divided up in many ways, subgroup analyses can quickly present a multiple-comparisons problem (Schulz and Grimes, 2005). The more ways one looks for subgroup differences, the more likely it is that some “statistically significant” ones will be found, even if they reflect only the play of chance. This possibility is especially likely in a post-hoc exploratory search for an intervention effect in each of many subgroups.

To illustrate this point, investigators on a trial of early aspirin versus placebo in patients with suspected myocardial infarction found that, overall, early aspirin reduced 5-week mortality by about 23% (95% CI: 15%, 31%) (ISIS-2 Collaborative Group, 1988). However, after a search for subgroup differences, they found that whether early aspirin was beneficial or not appeared to depend on the presence or absence of a certain marker. Among subjects with this marker, mortality in the aspirin group was actually 9% higher than in the placebo group (95% CI: –15%, 37%), while among those without the marker, mortality in the aspirin group was 28% lower (95% CI: 18%, 38%). So what was this magic marker? Having an astrological sign of Gemini or Libra!

One way to help combat the multiple-comparisons problem is to apply interaction tests. For a factor C with two or more categories, one first conducts a single omnibus test of whether adding all (treatment group) ×(C) interaction terms at once significantly improves the fit of a statistical model predicting outcomes, compared to a model that includes only treatment group and C (without their interactions) as predictors. Only if this omnibus interaction test is statistically significant does one proceed to examine treatment effects within the individual subgroups defined by different categories of C (Pocock et al., 2002). Methods also exist to “correct” subgroup-specific p-values for the number of different statistical tests done, but their use remains controversial (Savitz and Olshan, 1995; Bender and Lange, 2001). Methodologists generally advise limiting the severity of the multiple-comparisons problem by specifying only a few well-justified, planned subgroup comparisons in advance, and by interpreting post-hoc subgroup comparisons with great caution (Schulz and Grimes, 2005; Brookes et al., 2004; Yusuf et al., 1991).

3. 3. Formation of subgroups on characteristics that may be influenced by treatment assignment. Bias can arise if subgroups are formed according to characteristics influenced by events after randomization. Whether a trial participant ends up in a certain subgroup may then depend on the treatment group to which he or she was assigned. In the Coronary Drug Project example, one might be tempted to try to circumvent the “dilution” of treatment effects that would occur in an intent-to-treat analysis by comparing outcomes only among patients who were compliant with their assigned treatment. But compliance was determined only after randomization and may well have depended on the particular treatment received. A patient who was compliant with placebo might not have complied with clofibrate, and vice versa. Hence this analysis cannot be considered a true randomized comparison.

Issues surrounding the interpretation of associations whose presence or size varies among subgroups of the study population are not confined to randomized trials and will be further discussed in Chapter 18.

## Design Variations

So far, we have mainly considered randomized trials in which individual study subjects are randomized to one of two arms and are then followed concurrently to assess and compare outcomes. But while random assignment to exposure groups is the defining characteristic of a randomized trial, a number of other design features can be used in conjunction with randomization to address different kinds of scientific questions. A few are discussed below. These features are not necessarily mutually exclusive and can be used in combination.

### Noninferiority Trials

Most trials seek to determine whether one intervention strategy is superior to another. But suppose that one of two treatments has already been well studied, is deemed effective, and has become an accepted standard. The other treatment, of unknown effectiveness, is a new alternative that offers some other advantage(s) over the standard treatment: for example, it may be less costly, have fewer side effects, be more widely available, or be easier to use. If the new treatment is nearly as effective as the standard treatment, the new one would probably be preferred on the basis of its other advantages. The main goal of a trial comparing the two treatments is then to test whether the new treatment is no less effective than the standard: in other words, whether the new treatment is noninferior.

More formally, it is assumed that users would actually prefer the new treatment even if it were slightly less effective than the standard, in view of its other advantages. The difference-in-effectiveness value at which users would be indifferent between the new and old treatment is termed the noninferiority threshold, Δ‎. A value for Δ‎ must be specified in order to plan a noninferiority trial. Δ‎ can be specified in terms of the true difference in cumulative incidence of the primary outcome between treatments, or on a ratio scale (e.g., relative risk or hazard ratio). Considerations in setting a value for Δ‎ are discussed by Mulla et al. (2012) and Fleming et al. (2011).

Once the trial has been completed, it produces an observed risk difference D for the primary outcome, and confidence limits for D that specify an interval within which the true difference probably lies. The trial’s results are then interpreted according to where the confidence interval for D falls in relation to Δ‎.

Example 13-11. In an influenza pandemic, health care workers would be likely to be at high risk for infection. If no specific vaccine were available, other protective measures might be needed, such as face masks. Type N95 respirators are designed to filter out 95% of airborne particulate matter, but they are bulkier, costlier, and less widely available than standard surgical masks. Surgical masks offer some protection against airborne respiratory droplets, but they do not filter out smaller particles, and they allow more leakage around the mask edges.

Loeb et al. (2009) conducted a randomized noninferiority trial of surgical masks vs. fit-tested N95 respirators among 426 nurses at eight Ontario hospitals during the 2008–2009 influenza season. The investigators decided in advance on clinical grounds that surgical masks would be judged noninferior if the upper 95% confidence limit for the difference in cumulative incidence of laboratory-confirmed influenza infection (surgical mask minus respirator) was below 9 per 100.

At trial’s end, the cumulative incidence of influenza infection in the surgical mask group was 23.6 per 100, compared to 22.9 in the N95 respirator group, for a risk difference of +0.7 (95% CI: -7.3,+8.8) per 100. The upper 95% confidence limit for the risk difference was less than the pre-specified threshold of 9 per 100, so surgical masks were judged to be noninferior to N95 respirators for this purpose.

Although a value for Δ‎ is needed to plan a noninferiority trial, the results are arguably best reported in terms of an estimate of effect (e.g., the risk difference D) and confidence limits for that estimate, rather than simply as a binary verdict about noninferiority. Doing so allows users to interpret the results in relation to their own noninferiority threshold, which may differ from that of the trial planners (Mulla et al., 2012).

Noninferiority trials differ from superiority trials in other ways as well. Concepts in the standard statistical-hypothesis testing framework have different meanings for noninferiority trials then for superiority trials, as summarized in Table 13.6. These differences call for different methods to estimate the required sample size (Julious and Owen, 2011). Also, non-compliance or crossovers, which usually attenuate treatment-group differences in an intent-to-treat analysis, will tend to bias results toward finding the new treatment to be noninferior. Accordingly, proposed extensions to the CONSORT guidelines for noninferiority trials recommend that results of both intent-to-treat and as-treated analyses be presented (Piaggio et al., 2006).

Table 13.6. Differences in the Meaning of Concepts Involved in Statistical Hypothesis Testing between Superiority and Noninferiority Trials

Notation:

Symbol

Denotes

Rs

True cumulative incidence of adverse outcome with standard treatment

Rn

True cumulative incidence of adverse outcome with new treatment

Δ‎

Smallest clinically important difference in cumulative incidence, expressed as a positive quantity

Superiority trial

Noninferiority trial

Null hypothesis

RsRn = 0

RnRs ≥ Δ‎

Alternative hypothesis

|RsRn| ≥ Δ‎a

RnRs < Δ‎

Type I error

Conclude that one treatment is superior when in fact they are equally effective

Conclude that new treatment is noninferior when in fact it is inferior

Type II error

Conclude that treatments are equally effective when in fact one of them is superior

Conclude that new treatment is inferior when in fact it is not

a 2-sided

Equivalence trials are related to noninferiority trials but seek to test the hypothesis that the risk difference (or other effect measure) falls between -Δ‎ and +Δ‎: in other words, that neither treatment is superior to the other by more than a prespecified Δ‎ (Christensen, 2007).

### Sequential Trials

For both ethical and economic reasons, it can be desirable to end a trial early if the accumulated evidence shows that one treatment strategy is clearly superior. The CAST study was stopped early when it became clear that encainide and flecainide had unexpectedly increased cardiovascular mortality. Alternatively, the accumulated evidence may make it very unlikely that continuing the trial could reveal any meaningful difference in outcomes between treatment arms. Trials that involve continuous or periodic comparisons of outcomes as the study proceeds are termed sequential trials.

Example 13-12. Atherosclerotic narrowing of a major intracranial artery is a known strong risk factor for stroke. Angioplasty with stent placement became widely used in an attempt to remove stenosis and maintain blood flow, despite limited evidence on effectiveness of this procedure compared to medical therapy. Chimowitz et al. (2011) conducted a randomized trial of aggressive drug therapy, with or without angioplasty and stenting, in patients who had 70–99% occlusion of a major intracranial artery. As subject recruitment continued over a period of several years, the trial’s Data Safety and Monitoring Board reviewed interim results every six months. The initial target sample size had been 764 patients, but after 451 patients were randomized, enrollment was stopped because the 30-day cumulative incidence of stroke or death was 14.7% in the stented group vs. 5.8% in the non-stented group (p = 0.002). Medication therapy alone remained superior in continued follow-up to one year after randomization.

The design and analysis of sequential trials must account for multiple “looks” at the results, which pose another kind of multiple-comparisons problem. Without special measures, repeated statistical testing would inflate the overall probability of a Type I error beyond the desired level. Fortunately, biostatistical methods are available to deal with this problem (Whitehead, 1999; Todd, 2007). It is possible technically to reanalyze the data each time a new outcome event has occurred. But in practice it is usually more feasible to plan a few interim analyses at regular intervals for review by a data monitoring committee (Ellenberg et al., 2002; Slutsky and Lavery, 2004).

### Factorial Trials

A factorial randomized trial involves one treatment group for every possible combination of two or more interventions.

Example 13-13. Adenomas of the colon are benign growths that can evolve over time into malignancy. When adenomas are detected by colonoscopy or X-ray, they are normally removed, but persons who have had an initial adenoma are at high risk for later recurrence. Results from observational studies suggested that people who take aspirin may be less likely to develop adenomas, and likewise for people with relatively high dietary intake of folate.

To determine whether aspirin use and/or folate supplements could prevent recurrent adenomas, Logan et al. (2008) conducted a trial in which consenting patients who had just had an adenoma removed were randomized to one of four possible combinations of aspirin (300 mg. daily) and/or folate (0.5 mg. daily):

Table 13.7.

Group

Drug(s) assigned

A

Aspirin + folate

B

Aspirin only

C

Folate only

D

Neither

To permit blinding, two kinds of placebos were used, one resembling aspirin and the other resembling folate. Depending on his/her treatment group assignment, each participant was thus instructed to take two pills daily: (1) either aspirin or an aspirin-like placebo, and (2) either folate or a folate-like placebo.

Over the subsequent three years, recurrent adenomas were detected as shown in Tables 13.8 and 13.9. Based on these results, several comparisons of interest are possible:

Table 13.8.

Group

Active drug(s) assigned

No. of patients examined

Cumulative incidence

A

Aspirin + folate

217

50

23.0%

B

Aspirin only

217

49

22.6%

C

Folate only

215

65

30.2%

D

Neither

204

56

27.5%

Table 13.9.

Comparison

Groups compared

RR

(95% CI)

(Aspirin + folate) vs. nothing

A vs. D

0.84

(0.60–1.17)

Aspirin alone vs. nothing

B vs. D

0.82

(0.59–1.15)

Folate alone vs. nothing

C vs. D

1.10

(0.81–1.49)

(Aspirin + folate) vs. folate alone

A vs. C

0.76

(0.56–1.05)

(Aspirin + folate) vs. aspirin alone

A vs. B

1.02

(0.72–1.44)

Aspirin vs. no aspirin

(A + B) vs. (C + D)

0.79

(0.63–0.99)

Folate vs. no folate

(A + C) vs. (B + D)

1.07

(0.85–1.34)

The first three comparisons suggest that aspirin with folate, and aspirin alone, may have modestly reduced the risk of recurrent adenoma, while folate alone may have slightly increased the risk. However, confidence limits around all three RR estimates are fairly wide and include 1.00, so the results based on these comparisons are inconclusive.

The next two comparisons suggest that aspirin, when added to folate, may have reduced the risk by about 24%. Folate, when added to aspirin, had virtually no effect. But again, the confidence limits for both RR estimates are wide enough to include 1.00, so again these results are inconclusive.

These first five comparisons all involved pairwise contrasts between just two of the four groups, so each comparison used data on only half of all trial participants. The last two comparisons use data on all trial participants and take advantage of the fact that the factorial design prevents confounding of one drug’s effect by that of the other drug. Among aspirin users (groups A + B), half took folate (group A) while the rest did not (group B). Likewise, among aspirin non-users (groups C + D), half took folate (group C), while the rest did not (group D). (The exact percentages of folate users differ slightly from 50% because the percentage of participants who underwent follow-up colonoscopy varied slightly among the treatment groups.) Thus, the comparison of aspirin users vs. aspirin non-users cannot be confounded by folate use. By similar logic, the comparison of folate users vs. folate non-users cannot be confounded by aspirin use.

These last two comparisons estimate the overall effects of aspirin and of folate. The confidence limits for the RR comparing aspirin vs. no aspirin exclude 1.00, reflecting the added precision gained from being able to use data on the full study population. These comparisons presuppose negligible effect modification: that is, aspirin and folate are assumed to be neither synergistic nor antagonistic. Statistically, this assumption can be checked with a significance test for an aspirin × folate interaction effect. Here, that interaction test yields a p-value of 0.745, which is compatible with no true interaction and justifies estimating a single overall effect for each drug.

The factorial trial design offers two main advantages (McAlister et al., 2003; Green, 2002; Stampfer et al., 1985):

• If the effect of each intervention is independent of the other(s)—i.e., there is no synergy or antagonism—then the entire study population can be used to estimate and test the main effect of each intervention. This proved to be the case in the adenoma study, which was thus able to evaluate two interventions for little more than the cost of evaluating one.

• If the effect of each intervention is not independent of the other(s), the factorial design can detect and quantify the synergy or antagonism between them. This would not be possible if a separate study were done for each intervention, or if a three-arm trial (intervention #1 vs. intervention #2 vs. nothing) were conducted instead. The added information thus gained may be helpful in choosing a particularly effective combination of intervention approaches, or in raising cautions that one intervention’s effectiveness may be compromised among persons exposed to another.

The factorial design strategy can be extended to simultaneous study of three or more interventions. Cook et al. (2007) describe a 2 × 2 × 2 factorial trial of three chemopreventive agents against cardiovascular disease in women, and Day et al. (2002)evaluated three intervention approaches to prevention of falls in older adults using a 2 × 2 × 2 factorial design.

### Randomizing Within an Individual

In all of the trials considered so far, individual people were randomized to different treatments. But entities smaller in scale than a person can be randomized as well. Being able to make comparisons within the same people offers two main advantages. First, patient-level characteristics that may influence outcomes, such as overall disease severity and symptom perceptions, apply in common to both intervention and control conditions. Hence the potential for confounding by these characteristics is largely eliminated. Second, statistical power is usually enhanced because the variability in responses within the same individual tends to be smaller than the variability in responses between different individuals (Gore, 1981a). Accordingly, when these designs can be used, they often require fewer participants than a parallel-groups design in which each person receives only one treatment (Louis et al., 1984).

#### Randomizing Body Parts

When a treatment is expected to affect only a localized part of the body, other similar parts of the same person’s body can receive a different treatment and thus provide a matched control. One such trial sought to determine whether laser photocoagulation treatments could slow the progression of retinal complications of diabetes (Blankenship, 1979). For each eligible diabetic patient, one eye was chosen at random to be treated with laser photocoagulation, while the other eye served as a matched control.

#### Crossover Trials

In a crossover trial, each subject receives each of the alternative treatm ents at a different time. The order of exposure is randomized (Mills et al., 2009).

Example 13-14. A crossover trial was conducted to test whether taking the drug candesartan, an angiotensin II receptor blocker, could prevent migraine attacks in people prone to such headaches (Tronvik et al., 2003). Sixty patients who typically had two to six migraine attacks per month were randomized into two groups. One group took candesartan for 12 weeks, followed by a four-week “washout” period with no treatment, then took placebo for 12 more weeks. The other group took placebo for 12 weeks, then had a four-week “washout” period, then took candesartan for 12 weeks. The mean number of days with headache proved to be 18.5 during placebo periods vs. 13.6 during candesartan periods (p < 0.001), implying that the drug was effective for migraine prevention.

A crossover design is most suitable when the expected effects of treatment occur promptly after treatment is begun and taper off within a reasonable time period after it is stopped.

#### N-of-1 Trials

The so-called n-of-1 trial design extends the concept behind crossover trials even further, conducting a randomized trial on only a single individual (Guyatt et al., 1986, 1988; Gabler et al., 2011).

Example 13-15. The stimulant drug methylphenidate has been tried as empirical treatment for geriatric patients with severe depression or apathy. But uncertainty persists about the drug’s effectiveness, and patients have been observed to respond differently to it. Jansen et al. (2001) conducted five separate n-of-1 trials on five different older adults with depression or apathy. Each trial involved five one-week periods. Each period involved giving either methylphenidate or placebo on Monday and Tuesday, giving no drug on Wednesday, and then giving the opposite drug on Thursday and Friday. Randomization applied to each one-week period determined whether methylphenidate was given first or second during that period. A geriatrician who was blinded to each patient’s treatment status evaluated outcomes using standardized measures of depression, apathy, and possible drug side effects. Upon trial completion, three patients were judged to be significantly improved when taking methylphenidate; one was significantly worse when taking methylphenidate; and one patient was unable to complete the trial due to loss of speech, which made it impossible to assess key outcomes.

N-of-1 trials thus use randomization to guide choice of the best treatment approach for an individual patient. As with crossover studies, n-of-1 trials are most suitable when the therapy under investigation produces its effects promptly and when those effects wane fairly soon after the treatment is stopped.

### Randomizing Groups of Individuals

The design variations just discussed involved randomizing something within an individual. Another variation involves randomizing aggregates of individuals. As discussed in Chapter 16, a group-randomized trial can also be considered as a type of ecological study.

Example 13-16. The overall public health impact of vaccination against many infectious diseases is believed to depend in part on herd immunity: people who have not been vaccinated themselves may benefit nonetheless by being surrounded by other vaccinated people who cannot serve as a source of infection. One strategy for prioritizing influenza vaccination has been to target schoolchildren and adolescents. Young people spend much of their time in schools where the virus can spread easily, and they can then infect adults at home and elsewhere. To test this vaccination strategy, 49 Hutterite colonies in three Canadian provinces agreed to be randomized to have their children receive either influenza vaccine or hepatitis A vaccine, which served as a control (Loeb et al., 2010). Each Hutterite colony was a close-knit, relatively isolated community of Anabaptist families, typically with a population of 60–120 people. All eligible children within a given colony received the same vaccine. Over a six-month follow-up period that included a flu season, the results summarized in Table 13.10 were obtained. The incidence of influenza was reduced by more than half among both influenza vaccine recipients and non-recipients living in the same communities, demonstrating herd immunity in action.

Table 13.10. Selected Results of a Group-Randomized Trial of Influenza Vaccination in Canadian Hutterite Colonies

Vaccine effectivenessb

Group

Influenza cases

Person-days

Incidence ratea

Est.

(95% CI)

Vaccine recipients

55%

(−21%, 84%)

Influenza vaccine arm

41

70,377

5.8

Control arm

79

58,954

13.4

Vaccine non-recipients

61%

(8%, 83%)

Influenza vaccine arm

39

182,866

2.1

Control arm

80

151,902

5.3

Everyone

59%

(5%, 82%)

Influenza vaccine arm

80

253,243

3.2

Control arm

159

210,856

7.5

a Influenza cases per 10,000 person-days

b From proportional-hazards regression, accounting for clustering within colonies

(Based on data from Loeb et al. [2010])

A group-randomized design is worth considering when:

• By its nature, the experimental intervention applies non-selectively to an entire group. For example, fluoridation of a community’s water supply or broadcasting health promotion messages via the mass media would affect nearly everyone in a community, so randomizing individuals within those communities to receive or not to receive such interventions would be difficult or impossible.

• Intervention effects are thought to be transmissible from person to person. The Hutterite colony influenza vaccination trial provided one such example, but non-biological attributes such as attitudes, norms, or behaviors can also be viewed as contagious within a social group. For example, Peterson et al. (2000) randomized 40 school districts either to implement a smoking-prevention intervention or to serve as controls. The intervention sought to change norms about the desirability of smoking, so that schoolchildren would reinforce each other’s decisions not to smoke. The intervention mechanism itself thus depended on interactions between individuals within a school. In other contexts, transmissibility of intervention effects may instead be a potential source of unwanted contamination of controls if individuals were randomized (Torgerson, 2001). That contamination may be largely avoidable if groups that are not in close communication with each other are randomized instead.

Study planning and data analysis for a group-randomized trial are typically more complex than when individuals are randomized (Murray, 1998; Donner and Klar, 2000; Atienza and King, 2002; Bauman and Koepsell, 2006). Because outcome measurements on people within the same group are likely to be correlated, analyzing the data as if individuals had been randomized tends to exaggerate statistical significance (Eldridge et al., 2004; Donner et al., 1981; Koepsell et al., 1991). Avoiding this non-conservative bias requires taking both individual-level and group-level random variation within treatment arm into account (Murray, 1998; Donner and Klar, 2000; Murray et al., 2008). The total number of individuals studied to achieve a certain level of statistical power must usually be greater in a group-randomized study than in an individual-randomized study of the same topic.

## Conclusion

Most of the time, epidemiologists must rely on non-randomized study designs. These methods provide us with many ways to detect and to quantify associations, and there is broad consensus (albeit not unanimity) about how evidence from observational studies can be interpreted to support or refute causal inferences. Nonetheless, a relationship observed in a randomized trial provides perhaps the sturdiest bridge we have from association to causation. Historically, results from a randomized trial have not uncommonly contradicted a theory that had been built on multiple prior observational studies. These surprises ought to keep us vigilant and humble when we seek to interpret the results of observational studies. They also show why randomized trials deserve a special place in our set of research tools and why well-trained epidemiologists need to know when and how to use them.

## Appendix 13A: Algorithms for Randomization by Computer

Four methods of randomization are described and illustrated below. For each method, the starting point is a list of subject identifiers (ID). If all subjects are known in advance, each subject’s ID may be his/her position on a master list of subjects (which can be in any order). If instead a randomized assignment list is to be created before any subjects are enrolled, the starting IDs may simply be a list of consecutive integers that will refer to subjects’ order of entry into the trial. Here, each method is illustrated for the first eight subjects.

All methods involve pairing each subject ID with a random number drawn from a uniform distribution between 0 and 1. All widely used statistical packages and some spreadsheet programs provide a function to generate such random numbers. Below, the variable containing these random values is labelled RN.

For illustration, assume that the design involves two groups, Experimental and Control, of approximately equal size. With minor changes, these allocation methods could be used to create two groups of different sizes, or to allocate subjects among three or more groups.

### Simple Randomization

A suitable way to carry out simple randomization is: if RN > 0.5, assign the subject to Experimental; otherwise, assign the subject to Control. For example:

Table 13.11.

ID

RN

Assignment

1

0.81422

Experimental

2

0.90634

Experimental

3

0.32979

Control

4

0.05449

Control

5

0.32959

Control

6

0.06776

Control

7

0.72420

Experimental

8

0.29415

Control

### Block Randomization

For block size = 2, a suitable way to carry out block randomization is: whichever subject within a given block has the larger value of RN is assigned to Experimental, and the other block member is assigned to Control. For example:

Table 13.12.

ID

Block

RN

Assignment

1

1

0.81422

Control

2

1

0.90634

Experimental

3

2

0.32979

Experimental

4

2

0.05449

Control

5

3

0.32959

Experimental

6

3

0.06776

Control

7

4

0.72420

Experimental

8

4

0.29415

Control

For larger block sizes of size k, say, this rule can be generalized to assign the k/2 subjects within each block who have the largest values of RN to Experimental, and assign the remaining block members to Control.

### Single-Block Randomization

If it is known that there will be k subjects in all, the above block-randomization method can be applied to all k subjects treated as a single block. The k/2 subjects who have the largest values of RN are assigned to Experimental and the remaining subjects to Control. For example, if k = 8:

Table 13.13.

ID

RN

Assignment

1

0.81422

Experimental

2

0.90634

Experimental

3

0.32979

Experimental

4

0.05449

Control

5

0.32959

Control

6

0.06776

Control

7

0.72420

Experimental

8

0.29415

Control

### Stratified Randomization

Randomization is carried out separately within each stratum, using block randomization (or some other form of restricted randomization) to achieve equal or nearly equal Experimental and Control group sizes within the stratum. For example, if gender is the stratification factor and block size = 2:

Table 13.14.

Stratum

ID (within stratum)

Block (within stratum)

RN

Assignment

Males

1

1

0.81422

Control

2

1

0.90634

Experimental

3

2

0.32979

Experimental

4

2

0.05449

Control

Females

1

1

0.32959

Experimental

2

1

0.06776

Control

3

2

0.72420

Experimental

4

2

0.29415

Control

## References

ALLHAT Collaborative Research Group. Major cardiovascular events in hypertensive patients randomized to doxazosin vs chlorthalidone: the antihypertensive and lipid-lowering treatment to prevent heart attack trial (ALLHAT). JAMA 2000; 283:1967–1975.Find this resource:

Altman DG, Dore CJ. Randomisation and baseline comparisons in clinical trials. Lancet 1990; 335:149–153.Find this resource:

Anderson GL, Limacher M, Assaf AR, Bassford T, Beresford SAA, Black H, et al. Effects of conjugated equine estrogen in postmenopausal women with hysterectomy: the Women’s Health Initiative randomized controlled trial. JAMA 2004; 291:1701–1712.Find this resource:

Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet 2000; 355:1064–1069.Find this resource:

Atienza AA, King AC. Community-based health intervention trials: an overview of methodological issues. Epidemiol Rev 2002; 24:72–79.Find this resource:

Avorn J. Approval of a tuberculosis drug based on a paradoxical surrogate measure. JAMA 2013; 309:1349–1350.Find this resource:

Barraclough H, Govindan R. Biostatistics primer: what a clinician ought to know: subgroup analyses. J Thorac Oncol 2010; 5:741–746.Find this resource:

Bauman A, Koepsell TD. Epidemiologic issues in community interventions. Chapter 6 in Brownson RC, Petitti DB (eds.), Applied epidemiology (2nd ed.). New York: Oxford University Press, 2006.Find this resource:

Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA 1996; 276:637–639.Find this resource:

Bender R, Lange S. Adjusting for multiple testing—when and how? J Clin Epidemiol 2001; 54:343–349.Find this resource:

Bendsen NT, Hother AL, Jensen SK, Lorenzen JK, Astrup A. Effect of dairy calcium on fecal fat excretion: a randomized crossover trial. Int J Obes (Lond) 2008; 32:1816–1824.Find this resource:

Blankenship GW. Diabetic macular edema and argon laser photocoagulation: a prospective randomized study. Ophthalmology 1979; 86:69–78.Find this resource:

Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol 2004; 57:229–236.Find this resource:

Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users’ guides to the medical literature: XIX. Applying clinical trial results. A. How to use an article measuring the effect of an intervention on surrogate end points. Evidence-Based Medicine Working Group. JAMA 1999; 282:771–778.Find this resource:

Byington RP, Curb JD, Mattson ME. Assessment of double-blindness at the conclusion of the β‎-Blocker Heart Attack Trial. JAMA 1985; 253:1733–1736.Find this resource:

Campbell MK, Snowdon C, Francis D, Elbourne D, McDonald AM, Knight R, et al. Recruitment to randomised trials: Strategies for Trial Enrollment and Participation Study. The STEPS study. Health Technol Assess 2007; 11:iii, ix–105.Find this resource:

Cardiac Arrhythmia Pilot Study Investigators. Effects of encainide, flecainide, imipramine and moricizine on ventricular arrhythmias during the year after acute myocardial infarction: the Cardiac Arrhythmia Pilot Study (CAPS). Am J Cardiol 1988; 61:501–509.Find this resource:

Carpenter WT, Sadler JH, Light PD, Hanlon TE, Kurland AA, Penna MW, et al. The therapeutic efficacy of hemodialysis in schizophrenia. N Engl J Med 1983; 308:669–675.Find this resource:

Chalmers I. Comparing like with like: some historical milestones in the evolution of methods to create unbiased comparison groups in therapeutic experiments. Int J Epidemiol 2001; 30:1156–1164.Find this resource:

Chimowitz MI, Lynn MJ, Derdeyn CP, Turan TN, Fiorella D, Lane BF, et al. Stenting versus aggressive medical therapy for intracranial arterial stenosis. N Engl J Med 2011; 365:993–1003.Find this resource:

Christensen E. Methodology of superiority vs. equivalence trials and non-inferiority trials. J Hepatol 2007; 46:947–954.Find this resource:

Cobb LA, Thomas GI, Dillard DH, Merendino KA, Bruce RA. An evaluation of internal-mammary-artery ligation by a double-blind technic. N Engl J Med 1959; 260:1115–1118.Find this resource:

Cochrane AL. Effectiveness and efficiency: random reflections on health services. London: Nuffield Provincial Hospitals Trust, 1972.Find this resource:

Colagiuri B. Participant expectancies in double-blind randomized placebo-controlled trials: potential limitations to trial validity. Clin Trials 2010; 7:246–255.Find this resource:

CONSORT, 2013. Extensions of the CONSORT statement. http://www.consort-statement.org/extensions/.

Cook NR, Albert CM, Gaziano JM, Zaharris E, MacFadyen J, Danielson E, et al. A randomized factorial trial of vitamins C and E and beta carotene in the secondary prevention of cardiovascular events in women: results from the Women’s Antioxidant Cardiovascular Study. Arch Intern Med 2007; 167:1610–1618.Find this resource:

Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N Engl J Med 1980; 303:1038–1041.Find this resource:

Dawson L. The Salk Polio Vaccine Trial of 1954: risks, randomization and public involvement in research. Clin Trials 2004; 1:122–130.Find this resource:

Day L, Fildes B, Gordon I, Fitzharris M, Flamer H, Lord S. Randomised factorial trial of falls prevention among older people living in their own homes. BMJ 2002; 325:128–133.Find this resource:

DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, et al. Clinical trial registration: a statement from the International Committee of Medical Journal Editors. JAMA 2004; 292:1363–1364.Find this resource:

DeKosky ST, Williamson JD, Fitzpatrick AL, Kronmal RA, Ives DG, Saxton JA, et al. Ginkgo biloba for prevention of dementia: a randomized controlled trial. JAMA 2008; 300:2253–2262.Find this resource:

DeVita VT Jr, Chu E. A history of cancer chemotherapy. Cancer Res 2008; 68:8643–8653.Find this resource:

Donner A, Birkett N, Buck C. Randomisation by cluster: sample size requirements and analysis. Am J Epidemiol 1981; 114:906–914.Find this resource:

Donner A, Klar N. Design and analysis of cluster randomisation trials in health research. New York: Edward Arnold, 2000.Find this resource:

Echt DS, Liebson PR, Mitchell LB, Peters RW, Obias-Manno D, Barker AH, et al. Mortality and morbidity in patients receiving encainide, flecainide, or placebo. The Cardiac Arrhythmia Suppression Trial. N Engl J Med 1991; 324:781–788.Find this resource:

Efron B. Forcing a sequential experiment to be balanced. Biometrika 1971; 58:403–417.Find this resource:

Eldridge SM, Ashby D, Feder GS, Rudnicka AR, Ukoumunne OC. Lessons for cluster randomized trials in the twenty-first century:a systematic review of trials in primary care. Clin Trials 2004; 1:80–90.Find this resource:

Ellenberg SS, Fleming TR, DeMets DL. Data monitoring committees in clinical trials: a practical perspective. Hoboken, NJ: Wiley and Sons, 2002.Find this resource:

Fergusson D, Glass KC, Waring D, Shapiro S. Turning a blind eye: the success of blinding reported in a random sample of randomised, placebo controlled trials. BMJ 2004; 328:432.Find this resource:

Fisher RA. The design of experiments. London: Oliver and Boyd, 1935.Find this resource:

Fitzgibbons RJ Jr, Giobbie-Hurder A, Gibbs JO, Dunlop DD, Reda DJ, McCarthy M Jr, et al. Watchful waiting vs repair of inguinal hernia in minimally symptomatic men: a randomized clinical trial. JAMA 2006; 295:285–292.Find this resource:

Fleiss JL. The design and analysis of clinical experiments. New York: Wiley and Sons, 1986.Find this resource:

Fleming TR, Odem-Davis K, Rothmann MD, Li Shen Y. Some essential considerations in the design and conduct of non-inferiority trials. Clin Trials 2011; 8:432–439.Find this resource:

Ford JG, Howerton MW, Lai GY, Gary TL, Bolen S, Gibbons MC, et al. Barriers to recruiting underrepresented populations to cancer clinical trials: a systematic review. Cancer 2008; 112:228–242.Find this resource:

Freemantle N, Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA 2003; 289:2554–2559.Find this resource:

Friedrich MJ. The Cochrane Collaboration turns 20: assessing the evidence to inform clinical care. JAMA 2013; 309:1881–1882.Find this resource:

Gabler NB, Duan N, Vohra S, Kravitz RL. N-of-1 trials in the medical literature: a systematic review. Med Care 2011; 49:761–768.Find this resource:

Giuliano AE, Hunt KK, Ballman KV, Beitsch PD, Whitworth PW, Blumencranz PW, et al. Axillary dissection vs no axillary dissection in women with invasive breast cancer and sentinel node metastasis: a randomized clinical trial. JAMA 2011; 305:569–575.Find this resource:

Gore SM. Assessing clinical trials—design I. Br Med J 1981a; 282:1780–1781.Find this resource:

Gore SM. Assessing clinical trials—first steps. Br Med J 1981b; 282:1605–1607.Find this resource:

Gray RH, Kigozi G, Serwadda D, Makumbi F, Watya S, Nalugoda F, et al. Male circumcision for HIV prevention in men in Rakai, Uganda: a randomised trial. Lancet 2007; 369:657–666.Find this resource:

Green S. Design of randomized trials. Epidemiol Rev 2002; 24:4–11.Find this resource:

Grimes DA, Schulz KF. Surrogate end points in clinical research: hazardous to your health. Obstet Gynecol 2005; 105:1114–1118.Find this resource:

Guyatt G, Sackett D, Adachi J, Roberts R, Chong J, Rosenbloom D, et al. A clinician’s guide for conducting randomized trials in individual patients. CMAJ 1988; 139:497–503.Find this resource:

Guyatt G, Sackett D, Taylor DW, Chong J, Roberts R, Pugsley S. Determining optimal therapy—randomized trials in individual patients. N Engl J Med 1986; 314:889–892.Find this resource:

Head SJ, Kaul S, Tijssen JGP, Serruys PW, Kappetein AP. Subgroup analyses in trial reports comparing percutaneous coronary intervention with coronary artery bypass surgery. JAMA 2013; 310:2097–2098.Find this resource:

Hedden SL, Woolson RF, Malcolm RJ. Randomization in substance abuse clinical trials. Subst Abuse Treat Prev Policy 2006; 1:6.Find this resource:

Hewitt C, Hahn S, Torgerson DJ, Watson J, Bland JM. Adequacy and reporting of allocation concealment: review of recent trials published in four general medical journals. BMJ 2005; 330:1057–1058.Find this resource:

Hewitt CE, Torgerson DJ. Is restricted randomisation necessary? BMJ 2006; 332:1506–1508.Find this resource:

Heyse JF, Kuter BJ, Dallas MJ, Heaton P, REST Study Team. Evaluating the safety of a rotavirus vaccine: the REST of the story. Clin Trials 2008; 5:131–139.Find this resource:

Hooton TM, Roberts PL, Stapleton AE. Cefpodoxime vs ciprofloxacin for short-course treatment of acute uncomplicated cystitis: a randomized trial. JAMA 2012; 307:583–589.Find this resource:

Howard J, Whittemore AS, Hoover JJ, Panos M. How blind was the patient blind in AMIS? Clin Pharmacol Ther 1982; 32:543–553.Find this resource:

Hrøbjartsson A, Gøtzsche PC. Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment. N Engl J Med 2001; 344:1594–1602.Find this resource:

Hulley S, Grady D, Bush T, Furberg C, Herrington D, Riggs B, et al. Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. Heart and Estrogen/progestin Replacement Study (HERS) Research Group. JAMA 1998; 280:605–613.Find this resource:

ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group. Lancet 1988; 2:349–360.Find this resource:

Jansen IHM, Olde Rikkert MGM, Hulsbos HAJ, Hoefnagels WHL. Toward individualized evidence-based medicine: five “n-of-1” trials of methylphenidate in geriatric patients. J Am Geriatr Soc 2001; 49:474–476.Find this resource:

Johnstone EC, Drakin JFW, Lawler P, Frith CD, Stevens M, McPherson K, et al. The Northwick Park electroconvulsive therapy trial. Lancet 1980; 2:1317–1320.Find this resource:

Julious SA, Owen RJ. A comparison of methods for sample size estimation for non-inferiority studies with binary outcomes. Stat Methods Med Res 2011; 20:595–612.Find this resource:

Junod, 2012. FDA and clinical drug trials: a short history. At http://www.fda.gov/AboutFDA/What-WeDo/History/Overviews/.

Kahn JP, Mastroianni AC, Sugarman J. Beyond consent: seeking justice in research. New York: Oxford, 1998.Find this resource:

Kannus P, Parkkari J, Niemi S, Paganen M, Palvanen M, Jarvinen M, et al. Prevention of hip fracture in elderly people with use of a hip protector. N Engl J Med 2000; 343:1506–1513.Find this resource:

Kernan WN, Viscoli CM, Makuch RW, Brass LM, Horwitz RI. Stratified randomization for clinical trials. J Clin Epidemiol 1999; 52:19–26.Find this resource:

Klein EJ, Shugerman RP, Leigh-Taylor K, Schneider C, Portscheller D, Koepsell T. Buffered lidocaine: analgesia for intravenous line placement in children. Pediatrics 1995; 95:709–712.Find this resource:

Knuth DE. The art of computer programming. Volume 2: Seminumerical algorithms (3rd ed.). Reading, MA: Addison-Wesley, 1997.Find this resource:

Kodish E, Lantos JD, Siegler M. Ethical considerations in randomized controlled clinical trials. Cancer 1990; 65:2400–2404.Find this resource:

Koepsell TD, Martin DC, Diehr PH, Psaty BM, Wagner EH, Perrin EB, et al. Data analysis and sample size issues in evaluations of community-based health promotion and disease prevention programs: a mixed-model analysis of variance approach. J Clin Epidemiol 1991; 44:701–713.Find this resource:

Koepsell TD, Zatzick DF, Rivara FP. Estimating the population impact of preventive interventions from randomized trials. Am J Prev Med 2011; 40:191–198.Find this resource:

Kraemer HC, Mintz J, Noda A, Tinklenberg J, Yesavage JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry 2006; 63:484–489.Find this resource:

Kramer MS, Barr RG, Dagenais S, Yang H, Jones P, Ciofani L, et al. Pacifier use, early weaning, and cry/fuss behavior: a randomized controlled trial. JAMA 2001; 286:322–326.Find this resource:

Lai GY, Gary TL, Tilburt J, Bolen S, Baffi C, Wilson RF, et al. Effectiveness of strategies to recruit underrepresented populations into cancer clinical trials. Clin Trials 2006; 3:133–141.Find this resource:

Lamas GA, Goertz C, Boineau R, Mark DB, Rozema T, Nahin RL, et al. Effect of disodium EDTA chelation regimen on cardiovascular events in patients with previous myocardial infarction: the TACT randomized trial. JAMA 2013; 309:1241–1250.Find this resource:

Landorf KB, Keenan AM, Herbert RD. Effectiveness of foot orthoses to treat plantar fasciitis: a randomized trial. Arch Intern Med 2006; 166:1305–1310.Find this resource:

Lang JM. The use of a run-in to enhance compliance. Stat Med 1990; 9:87–95.Find this resource:

Lassere MN. The Biomarker-Surrogacy Evaluation Schema: a review of the biomarker-surrogate literature and a proposal for a criterion-based, quantitative, multidimensional hierarchical levels of evidence schema for evaluating the status of biomarkers as surrogate endpoints. Stat Methods Med Res 2008; 17:303–340.Find this resource:

Loeb M, Dafoe N, Mahony J, John M, Sarabia A, Glavin V, et al. Surgical mask vs N95 respirator for preventing influenza among health care workers: a randomized trial. JAMA 2009; 302:1865–1871.Find this resource:

Loeb M, Russell ML, Moss L, Fonseca K, Fox J, Earn DJD, et al. Effect of influenza vaccination of children on infection rates in Hutterite communities: a randomized trial. JAMA 2010; 303:943–950.Find this resource:

Logan RFA, Grainge MJ, Shepherd VC, Armitage NC, Muir KR, ukCAP Trial Group. Aspirin and folic acid for the prevention of recurrent colorectal adenomas. Gastroenterology 2008; 134:29–38.Find this resource:

Louis TA, Lavori PW, Bailar JC 3rd, Polansky M. Crossover and self-controlled designs in clinical research. N Engl J Med 1984; 310:24–31.Find this resource:

Lovato LC, Hill K, Hertert S, Hunninghake DB, Probstfield JL. Recruitment for controlled clinical trials: literature summary and annotated bibliography. Control Clin Trials 1997; 18:328–352.Find this resource:

Macknin ML, Mathew S, Medendorp SV. Effect of inhaling heated vapor on symptoms of the common cold. JAMA 1990; 264:989–991.Find this resource:

McAlister FA, Straus SE, Sackett DL, Altman DG. Analysis and reporting of factorial trials: a systematic review. JAMA 2003; 289:2545–2553.Find this resource:

Medical Research Council. Streptomycin treatment of pulmonary tuberculosis: a Medical Research Council investigation. Br Med J 1948; II:769–782.Find this resource:

Mills EJ, Chan AW, Wu P, Vail A, Guyatt GH, Altman DG. Design, analysis, and presentation of crossover trials. Trials 2009; 10:27.Find this resource:

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, et al. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340:c869.Find this resource:

Moher D, Schulz KF, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA 2001; 285:1987–1991.Find this resource:

Moseley JB, O’Malley K, Petersen NJ, Menke TJ, Brody BA, Kuykendall DH, et al. A controlled trial of arthroscopic surgery for osteoarthritis of the knee. N Engl J Med 2002; 347:81–88.Find this resource:

Mulla SM, Scott IA, Jackevicius CA, You JJ, Guyatt GH. How to use a noninferiority trial: users’ guides to the medical literature. JAMA 2012; 308:2605–2611.Find this resource:

Multiple Risk Factor Intervention Trial Research Group. Multiple risk factor intervention trial. Risk factor changes and mortality results. JAMA 1982; 248:1465–1477.Find this resource:

Multiple Risk Factor Intervention Trial Research Group. Mortality rates after 10.5 years for participants in the Multiple Risk Factor Intervention Trial. Findings related to a priori hypotheses of the trial. JAMA 1990; 263:1795–1801.Find this resource:

Murray DM. Design and analysis of group-randomized trials. New York: Oxford, 1998.Find this resource:

Murray DM, Pals SL, Blitstein JL, Alfano CM, Lehman J. Design and analysis of group-randomized trials in cancer: a review of current practices. J Natl Cancer Inst 2008; 100:483–491.Find this resource:

Nissen SE. Concerns about reliability in the Trial to Assess Chelation Therapy (TACT). JAMA 2013; 309:1293–1294.Find this resource:

Odgaard-Jensen J, Vist GE, Timmer A, Kunz R, Akl EA, Schünemann H, et al. Randomisation to protect against selection bias in healthcare trials. Cochrane Database Syst Rev 2011; :MR000012.Find this resource:

Omenn GS, Goodman GE, Thornquist MD, Balmes J, Cullen MR, Glass A, et al. Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N Engl J Med 1996; 334:1150–1155.Find this resource:

Oxman AD, Guyatt GH. A consumer’s guide to subgroup analyses. Ann Intern Med 1992; 116:78–84.Find this resource:

Pablos-Mendez A, Barr RG, Shea S. Run-in periods in randomized trials. Implications for the application of results in clinical practice. JAMA 1998; 279:222–225.Find this resource:

Peterson AV Jr, Kealey KA, Mann SL, Marek PM, Sarason IG. Hutchinson Smoking Prevention Project: long-term randomized trial in school-based tobacco use prevention—results on smoking. J Natl Cancer Inst 2000; 92:1979–1991.Find this resource:

Peto R, Gray R, Collins R, Wheatley K, Hennekens C, Jamrozik K, et al. Randomised trial of prophylactic daily aspirin in British male doctors. Br Med J 1988; 296:313–316.Find this resource:

Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJW, CONSORT Group. Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement. JAMA 2006; 295:1152–1160.Find this resource:

Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med 2002; 21:2917–2930.Find this resource:

Porta M (ed.). A dictionary of epidemiology (5th edition). New York: Oxford, 2008.Find this resource:

Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med 1989; 8:431–440.Find this resource:

Psaty BM, Weiss NS, Furberg CD, Koepsell TD, Siscovick DS, Rosendaal FR, et al. Surrogate end points, health outcomes, and the drug-approval process for the treatment of risk factors for cardiovascular disease. JAMA 1999; 282:786–790.Find this resource:

Riggs BL, Hodgson SF, O’Fallon WM, Chao EY, Wahner HW, Muhs JM, et al. Effect of fluoride treatment on the fracture rate in postmenopausal women with osteoporosis. N Engl J Med 1990; 322:802–809.Find this resource:

Rosa L, Rosa E, Sarner L, Barrett S. A close look at therapeutic touch. JAMA 1998; 279:1005–1010.Find this resource:

Rothwell PM. External validity of randomised controlled trials: “To whom do the results of this trial apply?” Lancet 2005; 365:82–93.Find this resource:

Sackett DL, Gent M. Controversy in counting and attributing events in clinical trials. N Engl J Med 1979; 301:1410–1412.Find this resource:

Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol 1995; 142:904–908.Find this resource:

Schulz KF. Subverting randomization in controlled trials. JAMA 1995; 274:1456–1458.Find this resource:

Schulz KF, Altman DG, Moher D, CONSORT Group. CONSORT 2010 statement: updated guidelines for reporting parallel group randomized trials. Ann Intern Med 2010; 152:726–732.Find this resource:

Schulz KF, Grimes DA. Allocation concealment in randomised trials: defending against deciphering. Lancet 2002a; 359:614–618.Find this resource:

Schulz KF, Grimes DA. Generation of allocation sequences in randomised trials: chance, not choice. Lancet 2002b; 359:515–519.Find this resource:

Schulz KF, Grimes DA. Multiplicity in randomised trials II: subgroup and interim analyses. Lancet 2005; 365:1657–1661.Find this resource:

Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutical trials. J Chron Dis 1967; 20:637–648.Find this resource:

Scott NW, McPherson GC, Ramsay CR, Campbell MK. The method of minimization for allocation to clinical trials: a review. Control Clin Trials 2002; 23:662–674.Find this resource:

Sexton M, Hebel JR. A clinical trial of change in maternal smoking and its effect on birth weight. JAMA 1984; 251:911–915.Find this resource:

Slutsky AS, Lavery JV. Data safety and monitoring boards. N Engl J Med 2004; 350:1143–1147.Find this resource:

Smith GC, Pell JP. Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials. BMJ 2003; 327:1459–1461.Find this resource:

Sommer A, Tarwotjo I, Djunaedi E, West KP Jr, Loeden AA, Tilden R, et al. Impact of vitamin A supplementation on childhood mortality. A randomised controlled community trial. Lancet 1986; 1:1169–1173.Find this resource:

Sommer A, Zeger SL. On estimating efficacy from clinical trials. Stat Med 1991; 10:45–52.Find this resource:

Stampfer MJ, Buring JE, Willett W, Rosner B, Eberlein K, Hennekens CH. The 2 × 2 factorial design: its application to a randomized trial of aspirin and carotene in U.S. physicians. Stat Med 1985; 4:111–116.Find this resource:

Sugarman J. Ethics in the design and conduct of clinical trials. Epidemiol Rev 2002; 24:54–58.Find this resource:

Swanson GM, Ward AJ. Recruiting minorities into clinical trials: toward a participant-friendly system. J Natl Cancer Inst 1995; 87:1747–1759.Find this resource:

Taylor KM, Margolese RG, Soskolne CL. Physicians’ reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. N Engl J Med 1984; 310:1363–1367.Find this resource:

Thornquist MD, Omenn GS, Goodman GE, Grizzle JE, Rosenstock L, Barnhart S, et al. Statistical design and monitoring of the Carotene and Retinol Efficacy Trial (CARET). Controlled Clin Trials 1993; 14:308–324.Find this resource:

Thorpe KE, Zwarenstein M, Oxman AD, Treweek S, Furberg CD, Altman DG, et al. A Pragmatic-Explanatory Continuum Indicator Summary (PRECIS): a tool to help trial designers. CMAJ 2009; 180:E47–E57.Find this resource:

Todd S. A 25-year review of sequential methodology in clinical studies. Stat Med 2007; 26:237–252.Find this resource:

Torgerson DJ. Contamination in trials: is cluster randomisation the answer? BMJ 2001; 322:355–357.Find this resource:

Tronvik E, Stovner LJ, Helde G, Sand T, Bovim G. Prophylactic treatment of migraine with an angiotensin II receptor blocker. JAMA 2003; 289:65–69.Find this resource:

Tunis SR, Stryer DB, Clancy CM. Practical clinical trials: increasing the value of clinical research for decision making in clinical and health policy. JAMA 2003; 290:1624–1632.Find this resource:

University Group Diabetes Program. A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: I. Design, methods and baseline results. Diabetes 1970a; 19 (Suppl. 2):747–783.Find this resource:

University Group Diabetes Program. A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: II. Mortality results. Diabetes 1970b; 19 (Suppl. 2):787–830.Find this resource:

US Department of Health and Human Services. 45 Code of Federal Regulations 46. Fed Reg 1991; 56:28012.Find this resource:

Van Spall HGC, Toren A, Kiss A, Fowler RA. Eligibility criteria of randomized controlled trials published in high-impact general medical journals: a systematic sampling review. JAMA 2007; 297:1233–1240.Find this resource:

Vesikari T, Matson DO, Dennehy P, Van Damme P, Santosham M, Rodriguez Z, et al. Safety and efficacy of a pentavalent human-bovine (WC3) reassortant rotavirus vaccine. N Engl J Med 2006; 354:23–33.Find this resource:

Veterans Administration Cooperative Study Group on Antihypertensive Agents. Effects of treatment on morbidity in hypertension. Results in patients with diastolic blood pressures averaging 115 through 129 mmHg. JAMA 1967; 202:116–122.Find this resource:

Vickers AJ, de Craen AJ. Why use placebos in clinical trials? A narrative review of the methodological literature. J Clin Epidemiol 2000; 53:157–161.Find this resource:

Walsh BT, Seidman SN, Sysko R, Gould M. Placebo response in studies of major depression: variable, substantial, and growing. JAMA 2002; 287:1840–1847.Find this resource:

Wei LJ, Lachin JM. Properties of the urn randomization in clinical trials. Control Clin Trials 1988; 9:345–364.Find this resource:

Weir CJ, Walley RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Stat Med 2006; 25:183–203.Find this resource:

Weiss NS, Koepsell TD, Psaty BM. Generalizability of the results of randomized trials. Arch Intern Med 2008; 168:133–135.Find this resource:

Whitehead J. A unified theory for sequential clinical trials. Stat Med 1999; 18:2271–2286.Find this resource:

Winkelstein W Jr. The remarkable Archie: origins of the Cochrane Collaboration. Epidemiology 2009; 20:779.Find this resource:

Wittes J, Brittain E. The role of internal pilot studies in increasing the efficiency of clinical trials. Stat Med 1990; 9:65–72.Find this resource:

Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? Stat Med 1984; 3:409–422.Find this resource:

Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991; 266:93–98.Find this resource:

## Exercises

1. 1. Internal mammary artery ligation enjoyed brief popularity in the 1950s for treatment of coronary artery disease until randomized trials showed the procedure to be of little or no benefit. A key trial by Cobb et al. (1959) involved randomizing twelve men and five women to either internal mammary artery ligation or sham surgery.

1. (a) According to the published report, all five of the women were randomly assigned to the ligated group. But the probability that all of the women would be assigned to the new procedure by chance alone is only about 0.03. Does this imply that the randomization scheme had been improperly carried out or subsequently subverted?

2. (b) Under what circumstances would the resulting imbalance in the sex composition of the two treatment groups bias the outcome of the study?

3. (c) Is there any way by which the investigators could have prevented the gender imbalance, yet still allocate subjects at random to the two groups? If so, how?

4. (d) During follow-up of 3–15 months, patients who received sham surgery tended to report marked improvement in their symptoms and reduced need for nitroglycerin tablets to control their angina attacks. Does this indicate that sham surgery was effective for relieving angina? Why or why not?

5. (e) Suppose you were a devoted advocate of the internal mammary artery ligation technique. Other than possible confounding due to sex or other factors, are there any arguments you would invoke as to why this study should not be considered proof that the technique is valueless?

2. 2. Use of infant pacifiers has been found in several observational studies to be associated with early termination of breastfeeding. Partly on that basis, many pediatricians and some professional organizations have discouraged the use of pacifiers, even though many parents find them a convenient way to respond to a baby’s crying or fussing.

Kramer et al. (2001) sought to test whether pacifier use is actually a cause of early weaning. They randomized 281 mothers who planned to breastfeed their newborn baby to one of two types of counseling sessions. For intervention-group mothers, counseling included a recommendation that they avoid use of pacifiers, and education about other ways to comfort a crying or fussing baby. Control-group mothers received no such advice or education about pacifiers. Follow-up interviews were conducted by research staff who were kept unaware of each mother’s treatment group assignment.

Among intervention-group mothers, 38.6% reported that they totally avoided pacifier use, compared with 16.0% of control-group mothers (ratio 2.4, 95% CI: 1.5–3.8). Other interview data also showed less-frequent pacifier use among intervention-group mothers. But at three months postpartum, the percentage of mothers who had weaned their infant was nearly equal: 18.9% in the intervention group vs. 18.3% in the control group. The reported frequency of crying or fussing was also similar. The researchers concluded that there was no causal link between pacifier use and early weaning, and they recommended that organizations promoting breastfeeding re-examine their opposition to pacifiers.

1. (a) What features of the research problem made it amenable to study with a randomized trial design?

2. (b) The trial report describes the study as “double-blind.” Do you agree? Why or why not? Does it matter in this case?

3. (c) The trial report states: “we estimated that a reduction in daily pacifier use from 60% to 40% would reduce the risk of weaning before the age of 3 months from 40% to 35%. With an α‎ level of .05 and a β‎ of .10, approximately 140 infants were required per group.” It is not clear from this statement whether the target sample size was based on the hypothesized difference in daily pacifier use or on the hypothesized difference in early weaning. Can you determine which difference was used in the sample-size calculations?

4. (d) When the investigators analyzed outcome data from the trial according to actual use of pacifiers (without regard to randomized treatment-group assignments), they found that pacifier use was indeed associated with early weaning, just as previous observational studies had observed. But they argued that the lack of association between pacifier use and early weaning in the randomized-trial analysis “trumped” the positive association seen in the observational-study analysis. What key confounding factor(s) do you think could be controlled in the randomized-trial analysis but not in the observational-study analysis?

3. 3. Cystic fibrosis is a chronic inherited disease that involves abnormally viscous respiratory secretions, which put affected persons at increased risk for lung infections and long-term decline in pulmonary function. Imagine that you are a co-investigator on a proposed trial of a newly developed drug that is supposed to liquefy respiratory secretions in cystic fibrosis patients, making the secretions easier to clear. The trial will be a two-arm, parallel-groups trial comparing the new drug with placebo.

Four cystic fibrosis specialty centers in different cities have contacted the patients for whom they currently provide ongoing care, in order to determine each patient’s eligibility and willingness to participate. Other clinical practices and outcomes vary among the centers, so the investigators want to allocate trial participants in such a way that care-center affiliation is guaranteed not to confound the main treatment comparison.

A total of 100 potential participants have been identified—10 at Center A, 20 at Center B, 30 at Center C, and 40 at Center D—which has been determined to be an adequate total sample size to test the main study hypothesis. Select an appropriate method of randomization, and generate the sequence of treatment group assignments using the method you have chosen.

4. 4. The Heart and Estrogen/progestin Replacement Study (HERS) was a randomized trial that sought to confirm many prior observational studies showing lower risk of coronary heart disease (CHD) associated with use of estrogen supplements during menopause (Hulley et al., 1998). The investigators described it as a “secondary prevention trial” because participating women already had CHD, as evidenced by prior myocardial infarction (MI), past coronary artery bypass graft surgery or percutaneous coronary revascularization, or angiographic evidence of at least a 50% occlusion of one or more major coronary arteries. The main study hypothesis was that hormone supplements would prevent future CHD events, including MI and sudden cardiac death, in this high-risk cohort. However, the results appeared to be at odds with the observational evidence. The abstract from the main published report appears below:

Context. Observational studies have found lower rates of coronary heart disease (CHD) in postmenopausal women who take estrogen than in women who do not, but this potential benefit has not been confirmed in clinical trials.

Objective. To determine if estrogen plus progestin therapy alters the risk for CHD events in postmenopausal women with established coronary disease.

Design. Randomized, blinded, placebo-controlled secondary prevention trial.

Setting. Outpatient and community settings at 20 United State clinical centers.

Participants. A total of 2,763 women with coronary disease, younger than 80 years, and postmenopausal with an intact uterus. Mean age was 66.7 years.

Intervention. Either 0.625 mg. of conjugated equine estrogens plus 2.5 mg. of medroxyprogesterone acetate in 1 tablet daily (n = 1,380) or a placebo of identical appearance (n = 1,383). Follow-up averaged 4.1 years; 82% of those assigned to hormone treatment were taking it at the end of 1 year, and 75% at the end of 3 years.

Main Outcome Measures. The primary outcome was the occurrence of nonfatal myocardial infarction (MI) or CHD death. Secondary cardiovascular outcomes included coronary revascularization, unstable angina, congestive heart failure, resuscitated cardiac arrest, stroke or transient ischemic attack, and peripheral arterial disease. All-cause mortality was also considered.

Results. Overall, there were no significant differences between groups in the primary outcome or in any of the secondary cardiovascular outcomes: 172 women in the hormone group and 176 women in the placebo group had MI or CHD death (relative hazard [RH], 0.99; 95% confidence interval [CI], 0.80–1.22). The lack of an overall effect occurred despite a net 11% lower low-density lipoprotein cholesterol level and 10% higher high-density lipoprotein cholesterol level in the hormone group compared with the placebo group (each p < 0.001). Within the overall null effect, there was a statistically significant time trend, with more CHD events in the hormone group than in the placebo group in year 1 and fewer in years 4 and 5. More women in the hormone group than in the placebo group experienced venous thromboembolic events (34 vs. 12; RH, 2.89; 95% CI, 1.50–5.58) and gallbladder disease (84 vs. 62; RH, 1.38; 95% CI, 1.00–1.92). There were no significant differences in several other end points for which power was limited, including fracture, cancer, and total mortality (131 vs. 123 deaths; RH, 1.08; 95% CI, 0.84–1.38).

Conclusions. During an average follow-up of 4.1 years, treatment with oral conjugated equine estrogen plus medroxyprogesterone acetate did not reduce the overall rate of CHD events in postmenopausal women with established coronary disease. The treatment did increase the rate of thromboembolic events and gallbladder disease. Based on the finding of no overall cardiovascular benefit and a pattern of early increase in risk of CHD events, we do not recommend starting this treatment for the purpose of secondary prevention of CHD. However, given the favorable pattern of CHD events after several years of therapy, it could be appropriate for women already receiving this treatment to continue.

1. (a) The HERS trial’s primary aim was to test the efficacy of postmenopausal hormone therapy for preventing future CHD events in a high-risk cohort of women. But it also sought to evaluate the safety of this regimen in terms of the incidence of other non-cardiovascular health conditions, including several forms of cancer. In what way might a randomized trial of this sort be limited in its ability to establish safety in comparison to other epidemiologic study designs?

2. (b) The study’s criteria for eligibility included a long list of exclusions:

Women were excluded for the following reasons: CHD event within 6 months of randomization; serum triglyceride level higher than 3.39 mmol/L (300 mg/ dL); use of oral, parenteral, vaginal, or transdermal sex hormones within 3 months of the screening visit; history of deep vein thrombosis or pulmonary embolism; history of breast cancer or breast examination or mammogram suggestive of breast cancer; history of endometrial cancer; abnormal uterine bleeding, endometrial hyperplasia, or endometrium thickness greater than 5 mm on baseline evaluation; abnormal or unobtainable Papanicolaou test result; serum aspartate aminotransferase level [a liver-function test] more than 1.2 times normal; unlikely to remain geographically accessible for study visits for at least 4 years; disease (other than CHD) judged likely to be fatal within 4 years; New York Heart Association class IV or severe class III congestive heart failure; alcoholism or other drug abuse; uncontrolled hypertension (diastolic blood pressure ≥105 mm Hg or systolic blood pressure ≥200 mm Hg); uncontrolled diabetes (fasting blood glucose level ≥16.7 mmol/L [300 mg/dL]); participation in another investigational drug or device study; less than 80% compliance with a placebo run-in prior to randomization; or history of intolerance to hormone therapy.

Identify at least two exclusions that were included chiefly to protect the safety of participants. Identify two others that were included chiefly to protect the internal validity of the trial.

3. (c) After the study was published, a critic contended that the results may have been biased because diagnostic tests were not performed on all women at the outset of the trial to ascertain the extent of coronary and other vascular disease in participants, in order to prove that the two treatment groups were similar on these factors. Would you agree that this was a serious flaw?

4. (d) Lipid-lowering drugs are often prescribed in order to reduce low-density lipoprotein (“bad”) cholesterol and to raise high-density lipoprotein (“good”) cholesterol. These drugs turned out to be prescribed more often during the course of the trial for placebo recipients than for estrogen/progestin recipients. Should this differential use of lipid-lowering drugs be considered a potential confounding factor in evaluating the effectiveness of these hormones for preventing future CHD events?

5. 5. Hooton et al. (2012) conducted a randomized trial that compared two antibiotics from different drug classes for uncomplicated urinary tract infection in women. Drug A was ciprofloxacin, a fluroquinolone; Drug B was cefpodoxime, a new, third-generation cephalosporin. At the time of the study, Drug A was widely used, but bacterial resistance to it was becoming more frequent, leading to concerns about overuse and possible treatment failures. Drug B was thought not to have these shortcomings and was equally easy to take. The trial was designed as a noninferiority trial, with a noninferiority threshold of 10 percentage points in the percent clinically cured after 30 days.

In an intent-to-treat analysis, 139/150 (93%) of women treated with drug A were clinically cured after 30 days, compared with 123/150 (82%) of women treated with drug B, for a difference of 11% (95% CI: 3%–18%).

1. (a) Was Drug A superior to Drug B?

2. (b) Was Drug B noninferior to Drug A?

3. (c) How can you reconcile these two conclusions?

6. 6. Falls are common in older adults. Kannus et al. (2000) conducted a randomized trial of thin, inexpensive pads that can be positioned over the greater femoral trochanter and worn under clothing. If the wearer falls, the hip pad is intended to cushion the blow and keep the upper femur from breaking.

The trial involved 1,801 ambulatory adults who resided in 20 Finnish geriatric treatment units. The treatment units were randomly allocated in a 1:2 ratio to be either a hip-pad unit or a control unit.

1. (a) In principle, each individual adult could have been randomized either to wear hip pads or to serve as a control. Why do you suppose the investigators chose to randomize treatment units instead?

2. (b) The published trial report contains a table comparing the baseline characteristics of residents of units assigned to the hip-pad and control groups, from which the results shown in Table 13.15 are excerpted.

We normally expect only about 5% of baseline comparisons to be statistically significant at the 0.05 level. Why do you think so many of the differences shown in this table resulted in such small p-values?

3. (c) After the treatment units had been randomized, older adults on each unit were asked whether they would be willing to take part in the study. Some 31% of patients on hip-pad units declined, vs. 9% of patients on control units. Could this difference have led to bias? If so, how might it have been avoidable by designing the trial differently?

4. (d) Suppose that you are thinking about conducting a confirmatory trial in which older adults in assisted-living settings would be individually randomized with equal probability to hip-pad or control conditions and then followed for 18 months. Drawing on the Finnish study results, you consider it reasonable to assume that the 18-month cumulative incidence of hip fracture among controls should be about 7.5%. You would like the study to have 80% power to detect a 50% reduction in hip-fracture incidence in the hip-pad group. You plan to test the results for statistical significance with a two-tailed test at the 0.05 level. Assume that dropouts will be negligible. How many subjects would be needed in each arm?

Table 13.15. Baseline Characteristics of Residents in the Intervention and Control Groups in a Randomized Trial of Hip Pads

Percent or mean ± s.d.

Characteristic

Control group (n = 1,148)

p-value

Sex

0.41

Female

77%

79%

Male

23%

21%

Age (years)

81±6

82±6

0.006

Weight (kg.)

63.1±11.8

65.5±13.1

< 0.001

Medical conditions

Heart disease

52%

51%

0.46

Dementia

33%

26%

0.001

Hypertension

20%

23%

0.13

Past stroke

21%

15%

0.002

Mental status

< 0.001

Normal

39%

42%

Mild impairment

19%

25%

Moderate impairment

22%

22%

Severe impairment

21%

12%

Walking ability

0.001

Independently

39%

35%

With cane or walker

49%

57%

With help

12%

8%

(Based on data from kANNUS et al. [2000])

1. 1.

1. (a) Imply is too strong a word. Even under a properly implemented random allocation process, all five women could be assigned to the ligated group with probability (1/2)5 = 1/32 ≈ 0.03—an unusual but not impossible outcome of random allocation. After all, people do win the lottery.

2. (b) Because of the apparent “accident of randomization,” gender was strongly associated with exposure (whether the internal mammary artery was ligated). If gender itself were associated with any of the outcomes under study, then it could become a confounding factor for those outcomes. The results were not reported separately by gender, however, so we cannot actually compare outcomes between men and women within the ligated group.

3. (c) They could have used randomly permuted blocks within each gender stratum. For example, if blocks of size two were used, the first two women recruited for study would be considered to belong to the same block. A random number would then be chosen to decide which of the two women would be assigned to the ligated group, with the other woman automatically going to the non-ligated group.

4. (d) No. This was only a before–after comparison. Symptoms of coronary heart disease tend to vary over time within an individual, waxing and waning in severity. Patients would be unlikely to consider surgery at a time when their symptoms were relatively mild or improving; instead, they would be looking for new therapeutic options when their symptoms were relatively severe or were getting worse. Even in the absence of any therapeutic benefit, we might expect such patients’ symptom severity to improve over time as they regress toward their respective mean severity levels. More convincing evidence of the effectiveness of sham surgery would come from a randomized comparison group that did not undergo sham surgery.

5. (e) A clinically important effect of mammary-artery ligation would be very difficult to detect with only 17 study subjects due to low statistical power. Also, the follow-up period was relatively short and may have missed any longer-term differences in outcomes between the ligated and non-ligated groups.

2. 2.

1. (a)

• Parents largely controlled whether and how often a pacifier was used to calm their baby. As the results showed, this form of parenting behavior could, in turn, be modified by the investigators through an educational intervention.

• The key outcome, early weaning, was relatively common. It also occurred relatively soon after birth, so that prolonged follow-up was not required.

• Besides early weaning, cry/fuss behavior was also of interest as a secondary outcome. A randomized-trial design lends itself to studying multiple outcomes.

• There was uncertainty about the balance of good and bad effects of pacifiers.

2. (b) Use of the term “double-blind” is questionable in this case. Although the research staff who interviewed participants about study outcomes were unaware of which mother was in which treatment group, the educational intervention itself should have made mothers well aware of whether they had been discouraged from using pacifiers. This educational message could have influenced their perceptions and/or reporting of the frequency of pacifier use and cry/fuss behavior.

3. (c) The target sample size of 140 per group appears to have been based on the hypothesized difference in pacifier use. Applying the sample-size estimation method in Appendix 5B:

$Display mathematics$

The investigators may have used a slightly different sample-size formula, some of which incorporate a continuity correction, or they may have boosted the target sample size slightly to allow for dropouts. In any event, repeating the above calculations with p0 = 0.4 and p1 = 0.35 yields n ≈ 1,968, so it is clear that the study was too small to detect a 5-percentage-point difference in early weaning with the specified power.

4. (d) While we cannot know for sure, mothers who used pacifiers more often may have been less committed to longer-term breastfeeding to begin with, or they may have experienced more discomfort or inconvenience from it once they started. In the randomized-trial analysis, we can expect those possibly subtle factors to be fairly well balanced between groups, even though they were not directly measured.

3. 3. Confounding by center can be prevented by assuring that equal numbers of patients are assigned to the new drug and to placebo within each center. The number of willing and eligible patients at each center is already known, so each center’s patients can be randomized as a single block. A simple algorithm for achieving the desired result is:

1. (a) Construct a data set with 100 observations (rows). Call its first column (variable) Center, and assign the value “A” for the first 10 observations, “B” for the next 20, “C” for the next 30, and “D” for the last 40.

2. (b) Add another column called Assignment, and give half of the observations within each center the value “New drug” and the other half “Placebo” in any order. A convenient way to do this is simply to assign “New drug” and “Placebo” in alternating order from the first observation to the last.

3. (c) Add another column called RN, and fill it in with random numbers from a uniform distribution in the range 0–1. Statistical packages and spreadsheet programs typically have a built-in function for this purpose.

4. (d) Sort the rows by RN within Center. This “shuffles” (randomly permutes) the rows within each center without changing the number of patients assigned to each treatment group.

5. (e) For convenience, another column called Seq can be added, which will contain the sequence number of each row (which corresponds to patient) within a center. That is, rows for center A will be numbered 1…10, those for center B will be numbered 1…20, etc.

Standard statistical packages, such as Stata, SAS, R, or SPSS, can all implement the above method. It is also possible to coax a spreadsheet program such as Excel into doing the same job, or to do the whole task “by hand” using a table of random numbers. But for production work, it is preferable to write a computer file that contains the actual input commands for a statistical package to do the job, so that there is an audit trail that allows the method to be checked.

It is also possible, but unnecessary, to make treatment-group assignments in smaller blocks of, say, 2 within each center. In this case, the number of patients at each center is known in advance, so all patients at a center can be allocated within a single block to maximize unpredictability.

4. 4.

1. (a) Randomized trials are not statistically efficient for detecting rare or delayed, but nonetheless serious, unintended effects of a therapy. For example, the investigators noted that the HERS trial had insufficient power to determine whether long-term use of hormones would affect the risk of breast cancer, given that “only” 2,763 women were included and that the total length of the study was only 4.1 years.

2. (b) Some exclusions intended chiefly to protect participants’ safety were:

• History of intolerance to hormone therapy

• History of endometrial cancer

• History of deep vein thrombosis or pulmonary embolism

• History of breast cancer or abnormal mammogram

• Abnormal liver function tests

Some exclusions intended chiefly to protect the trial’s internal validity were:

• Unlikely to remain geographically accessible for study visits for 4+ years

• Alcoholism or drug abuse

• Disease (other than CHD) judged likely to be fatal within 4 years

3. (c) Given the large sample size, it is reasonable to assume that randomization balanced the groups well on unknown or unmeasured potential confounders, including the nature and extent of preexisting coronary heart disease.

4. (d) No. The less-frequent use of lipid-lowering drugs among estrogen recipients was probably because estrogens had a similar effect on lipid levels, as reported in the study abstract. Placebo had no such effect. But because lipid-lowering drug use would thus be causally downstream from the treatment that was assigned at random, it cannot be considered a confounding factor. Part of the overall effect of hormone replacement on heart disease risk may include making it less likely that the recipient would receive other lipid-lowering drugs.

5. 5.

1. (a) A verdict on superiority is generally based on whether there is a statistically significant difference in outcomes. In this case, the 95% confidence interval for the difference in percentage clinically cured excluded 0, so Drug A would be considered superior to Drug B by this criterion.

2. (b) A verdict on noninferiority is based on whether we can confidently infer from the trial’s results that the true difference in outcomes lies below the prespecified noninferiority threshold. In this case, the confidence interval for the difference in percentage clinically cured extended from 3% to 18%, which straddles the threshold of 10%. This result implies that the true difference in percentage clinically cured could plausibly be either above or below the noninferiority threshold. The trial thus did not show that Drug B is noninferior to Drug A by this criterion.

3. (c) The terms superiority and noninferiority, while well entrenched, can occasionally lead to some apparently paradoxical combinations. The difficulty is that each involves a binary verdict, but these verdicts are based on two different thresholds. Superiority depends on whether the confidence interval for the difference in outcomes excludes zero. Noninferiority depends on whether that confidence limit excludes Δ‎, a prespecified non-zero value at which somewhat worse outcomes for a new treatment would be balanced by its other advantages over a standard treatment. In this case, the confidence limits excluded zero but included Δ‎. The results suggest that Drug A probably does cure a larger percentage of afflicted women than Drug B, but its margin of effectiveness may or may not be large enough to sway the choice between drugs.

(Another theoretically possible result would be a confidence interval for the differences in percentage clinically cured extending from, say, 3% to 8%, which would exclude both zero and Δ‎. By the superiority/noninferiority terminology, such a result would imply that Drug A was superior to Drug B, but that Drug B was noninferior to Drug A!)

Confusion of this sort can be largely avoided by focusing instead on the observed difference in outcomes and its confidence interval once the trial is completed. This information conveys the degree of uncertainty about comparative effectiveness once evidence from the trial is available, and it allows users to apply their own decision threshold in choosing between the alternatives. See Mulla et al. (2012) and Piaggio et al. (2006) for more discussion of these issues.

6. 6.

1. (a) They feared that randomizing individual patients within a unit would pose too great a risk of “contamination” if those who were randomized to the control group felt short-changed, got access to hip pads on their own, and started wearing them.

2. (b) Randomization was conducted at the treatment-unit level, but the statistical testing for this table was apparently done as if individual patients had been randomized. (Note that the n’s atop the columns total to 1,801, not 20.) The mix of patients was evidently quite variable among treatment units, with some units catering to older adults, some to adults with dementia, etc. When entire treatment units were randomized, all adults in a unit that served generally older people or people with dementia were assigned en bloc to one of the treatment groups.

Randomizing clusters generally does not balance the treatment arms as evenly as randomizing individuals, because fewer units are randomly allocated. Proper statistical significance testing must account for the cluster randomization and would no doubt have shown many fewer baseline comparisons to be statistically significant in this instance.

3. (c) Yes, it could have led to bias. Many control-group participants probably would have declined participation had they been assigned to the hip-pad group, but they remained in the study. No comparable subjects remained in the hip-pad group—they declined to take part. Hence selection bias occurring after randomization may have rendered two groups of participants dissimilar.

This bias might have been avoided if potential participants had been asked first if they would be willing to be randomized to either a hip-pad group or a control group. Only those who consented would then have been randomized and taken part in the rest of the study.

4. (d) For power = 80%, β‎ = 0.2, which corresponds to Zβ‎ ≈ 0.84. For a two-tailed test at α‎ = 0.05, Zα‎ ≈1.96.

$Display mathematics$

## Notes:

1. Technically, they are pseudo-random numbers. Computers are deterministic machines and cannot generate truly random numbers. Instead, they apply well-studied numerical algorithms to carry out calculations that produce a stream of digits that behave as though they were random (Knuth, 1997). If the underlying algorithm is known, every digit in the sequence can be predicted with certainty; but otherwise, the sequence appears to be random and can pass various tests for randomness.