Understanding clinical trials in palliative care research
Introduction
Components of the randomized, controlled trial (RCT) have been part of clinical research for several centuries but the modern concept of the importance of the random selection of a control group can be traced to one of the first studies of streptomycin for pulmonary tuberculosis conducted by the British Medical Research Council and published in 1948( 1 , 2 ). Prior to that time, the evaluation of health care relied on what today would be considered anecdotal evidence. The RCT represents the gold-standard for the evaluation of the efficacy of new medical therapies. Despite their status, several potential, scientific. and ethical difficulties continue to limit the use of RCTs in some clinical contexts such as palliative care, and hinder the generalizability of their results in others. Understanding these limitations and how to apply the results of such studies to clinical care is an important process.
By presenting the various strengths and limitations that are common to all RCTs and how they apply to trials of palliative care interventions, we will consider how decisions made regarding trial design, conduct, and analysis can influence the trial’s results. Basic issues in the analysis of clinical trial data as they apply to the interpretation of RCT results will also be considered. We hope that this chapter will provide palliative care clinicians with a proper understanding of the structure and inherent problems with clinical trials. No research experiment can ever be perfect, but the information provided by RCTs is extremely useful as a basis for evidence-based approaches to clinical care. With this knowledge, the reader of science should be able to ascertain whether the results of published trials are (1) likely to be valid and (2) likely to apply to their patients.
The anatomy of a trial
Study design issues must be considered from conceptualization, through implementation and ending with the interpretation of results. Subtle flaws in a trial design or conduct may lead to inappropriate conclusions. Decisions made regarding every component of an RCT can dramatically influence the quality of the data and the outcome of a trial.
Even when a clinical trial is perfectly designed, there is no guarantee of finding the right answer to a specific research question. Because of random variation, there is always a probability of reaching a false positive or a false negative conclusion simply by chance. A statistical analysis is conducted primarily to (1) summarize the data to estimate the size of the effect being observed that can be attributed to the treatment being tested and (2) to estimate the probability that the results obtained occurred simply by chance. The conventional selection of a p-value cut-off point of 0.05 means that we are willing to accept a 1/20 probability of getting a false positive answer by chance alone. As a result, replication of a trial’s results is always preferable before clinicians can confidently make decisions about patient care, since no single trial should ever be considered definitive proof of the presence or absence of efficacy.
The question
The initial step in designing any trial or understanding the results is to define what research question is being asked. This can seem like a relatively simple process, but it is often a tremendous challenge to replicate a clinical reality in the research setting. To properly design a clinical trial requires reducing an important clinical question into a testable hypothesis. Many clinically relevant questions cannot be studied because the appropriate population is not available, the number required may be prohibitively large, or for ethical issues. This is especially true in studies of palliative care, where randomizing patients to a potentially less effective treatment may not be considered ethical.
In such situations it is often necessary to modify the research question to one that is more readily answerable. It is important to understand that getting an answer to a different question, may or may not allow its application to the original clinical scenario. Attempting to answer the right question in the setting of an RCT may result in compromises in the study design, sacrificing either precision or protection from bias. In some cases, a poor study design may reduce the value of the knowledge, and hence alter the risk–benefit calculation used to justify the research( 3 , 4 ). From the outset, an alternative design should be considered if the investigators are not clear about the clinical importance of answering the proposed question.
The choice of an appropriate outcome and the measure to be used to collect the data is an important step in ensuring that the research question can be answered with a clinically relevant primary outcome that can be measured, analysed, and the results interpreted. This must be done a priori, before the data is collected or analysed. In order to maintain acceptable limits on the probability of arriving at an incorrect result by chance, both the clinical importance of the effect of an intervention and the characteristics of the data to be collected, must be defined. Although multiple secondary outcomes are often also tested within a single trial, each should be identified at the outset, and none should subsequently replace the primary outcome after the data have been collected and analysed (a posteriori).
Choosing outcomes is often dependent on the disease area to be studied. A recently published summary of outcomes created in conjunction with the Agency for Healthcare Research and Quality (AHRQ) have specified areas, including symptom management, quality of life (QOL) including function and satisfaction, family burden, and quality-of-death measures. There are a number of evaluative tools that have been used over time or recommended by various groups, but they are beyond the scope of this chapter and well described elsewhere(5 ). Almost all of the measures used for palliative care studies are patient oriented self reports, which require a careful approach to analysis and interpretation.
Types of trial
Part of the process of choosing the question is choosing the basic design of the trial. While there are many different variations on an RCT, the two basic formats are to design a trial to demonstrate an effect (either beneficial or harmful) of an exposure (or treatment) over no exposure; or to design a trial to demonstrate equivalence between two different treatments. In both cases, the goal is to demonstrate an outcome that has a less than 1/20 probability of occurring by chance.
Efficacy trials
An efficacy trial is defined by demonstrating that the two groups are not the same (i.e. that the null hypothesis is false). This trial benefits from the fact that demonstrating two things are different is substantially easier than proving that two things are the same. The calculation of sample size is based on established probability calculations that take into consideration the underlying variability of the measurement and the size of the effect to be evaluated. The majority of this chapter will focus on the issues involved in the design and interpretation of an efficacy trial.
Non-inferiority and equivalence trials
Given the concerns about the difficulties in conducting a placebo-controlled trials and the maintenance of blinding, investigators occasionally attempt to show that a new drug is either ‘no worse than’ (a non-inferiority trial), or ‘as good as’ (an equivalence trial) a treatment that is commonly accepted as effective. In evaluating therapeutics for conditions in which the risks of placebo assignment are widely regarded as too great, such as thrombolytic agents for acute myocardial infarction or stroke, active-controlled, non-inferiority trials are the standard ( 6 , 7 ).
These trials are presently not considered standard for problems such as hypertension, hyperlipidaemia, and pain, because the risks of temporarily foregoing active treatment are not as obvious, and because there are several potential problems with interpreting such studies( 9 – 13 ). The first problem is that equivalence trials essentially aim to confirm the conventional null hypothesis of no treatment difference. One way to purposefully reduce the chance of showing a difference is to conduct a poorly designed study. This may create inappropriate incentives for conducting ‘sloppy’ research(14 ). Second, such trials generally require larger numbers of participants, because equivalence or non-inferiority must be documented within relatively narrow margins. Third, demonstrating that two treatments are the same does not show that either of them works. This problem is related to ‘assay sensitivity’, because such trials require the assumption that the standard therapy would have proven superior to placebo had a placebo arm been included in the study design used (9 ). Because of these concerns, current regulatory guidelines still call for placebo-controlled trials to evaluate treatments for problems such as pain and other symptoms(15 ).
Randomization
The most important feature of an experimental clinical trial is the equivalence of the comparison groups at baseline such that any differences measured at the end can be attributed to the differences between the treatments administered. Random selection of subjects from a sufficiently large single population equally distributes known and unknown factors that might otherwise influence the outcome, such as age, sex, and disease severity.
This is in contrast to, observational studies (e.g. case–control, cohort, or cross-sectional studies) which depend on nature to set up the experiment. In this type of study, there is a substantial possibility that known or unknown factors may create differences between the comparison groups at baseline, limiting our ability to attribute any subsequent changes to the treatment. Unmeasured confounding or bias can potentially leading to the wrong conclusion from the study. In experimental studies randomization is the primary mechanism used to create equivalence between the comparison groups. By minimizing the possibility of differences at baseline, an RCT enable investigators to more confidently attribute observed changes over time to the assigned treatments.
True randomization is accomplished by generating a set of random numbers, and distributing them via a mechanism that protects the integrity of the random assignment. A centrally managed randomization scheme may help to ensure consistent application of the procedure across sites and staff. Central control of the randomization will also prevent members of the study team, from knowingly or unknowingly influencing the assignment, especially if they are not be blinded to a patient’s treatment allocation.
Randomization works correctly only when sufficient numbers of patients are enrolled to ensure an equal distribution of all important factors. In smaller trials, or in large, multi-centre trials with few participants from a given centre, chance alone may cause significant differences in the distributions of important demographic or disease-related characteristics between groups. In order to reduce the likelihood of such occurrences, investigators can use a block randomization scheme to ensure that selected participant characteristics will be equally distributed. For example, if investigators wished to guarantee an equal sex distribution among two treatment arms at multiple sites, they may randomize in blocks of six participants each, within which three participants would be male and three would be female.
Control group
The purpose of the control group is to provide an appropriate comparison for the treatment group, in order to be able to attribute causality to the difference between the treatments given to each group. This difference in treatment can be as specific as an individual effect of a medication, such as an opioid, or as complex as a whole system approach to care, such as inpatient versus outpatient hospice care. Assuming that the treatment groups are equivalent at baseline (through randomization), then the differences seen in outcome of the groups can be attributed to differences in the treatment. In addition, the degree to which blinding is applied to each group is important for our ability to interpret the resulting data (see following paragraphs). There are three primary types of control groups that can be used in clinical trials: (1) a no-treatment control; (2) a placebo control; and (3) an active control.
Each type provides different information in comparison to the active treatment group, and the usefulness will depend on the research question that is being explored by the study. The no-treatment control group can provide information about changes that happen based on the (1) natural history of the disease (there is a normal variation in the status of any disease state) and (2) regression to the mean (patients with severe symptoms tend to get better). The placebo-treated control group, when properly blinded, also controls for the mind–body interaction that occurs either from participating in a clinical trial or because of the subject’s belief in the therapy. The mind–body response is an especially important part of many symptomatic therapy trials. An active control group is best thought of as a diagnostic test of the design and conduct of the clinical trial. By administering a drug known to be active in the disease being tested, a positive result provides evidence that the study has been properly designed and conducted. If the experimental agent then does not demonstrate an effect, the negative result is more convincing. Conversely, if the active agent does not produce an effect, the design and conduct of the trial are called into question. In this situation, a negative result with the experimental agent is as likely to be due to the problems in the design as a true lack of efficacy.
No-treatment-controlled trials
A no-treatment control group, where participants receive no intervention or a delayed intervention, is applicable in two primary situations The first situation arises when there are practical and/or ethical problems with using a placebo or sham control. For example, it is often difficult to construct an appropriate sham intervention for many trials of surgical interventions. Even if adequate shams could be constructed, some feel that assigning patients to receive an invasive, but non-active, intervention is unethical(16 ). In such situations, a randomly assigned control group is clearly better than not having a control group, but the results must be viewed cautiously since there are many factors that can affect a subject’s response to a treatment, when both the subject and the investigator know which group they are in. As discussed in the following paragraphs, at least the study staff that collect and record the outcome measures should be blinded to the subjects group if possible.
The second situation occurs when a goal of the trial is to determine the magnitude of the mind–body placebo effect. The placebo-treated group will have a response that is a mix of the natural history of disease and regression to the mean, along with the mind–body placebo effect. By having a no-treatment control group, the mind–body placebo effect can be estimated. In a meta-analysis of 114 trials employing both placebo- and no-treatment-controls, the placebo effect was less than might be expected, but in symptomatic relief the effect can be large(17 ). In pain research, the placebo-control group patients typically have a more favourable outcome than those in no-treatment-control groups(17 ).
Placebo-controlled trials
A placebo control group is the best known and most widely used of the possible control groups. A placebo is defined as an inactive treatment designed to mimic as closely as possible, the characteristics of the active treatment. The purpose is to have the control group treated exactly the same as the treatment group except for the specific component being tested. The usefulness of a placebo assumes that at least the study subject will be blinded to the type of therapy they are receiving. Creating a placebo for a drug trial is relatively straight forward. An inactive substance is formulated to have similar appearance and route of administration as the active treatment. Procedure-oriented therapies are much harder to mimic and therefore, it is significantly harder to obtain true blinding (see following paragraphs). In the absence of blinding, the placebo group is equivalent to the no-treatment group.
The primary benefit of using a placebo control group, rather than a no-treatment control group, is that it enables the specific efficacy of the new intervention to be distinguished from the many non-specific effects of all therapies, including the well-known mind–body interaction (also called the placebo effect)( 18 , 19 ). The placebo control group is a group of patients who are treated with a placebo. The response measured in this group includes all three separate processes, namely: (1) natural history of the disease; (2) regression to the mean; and (3) the mind–body interaction. The mind–body interaction is a change in brain function that, at least temporarily, leads to improvement in the bodily signs or symptoms. The mind–body interaction is also sometimes know as a non-specific action, while the direct affect of the therapy on the disease is known as a specific action of the therapy.
Assuming a simple additive model of treatment effects, the magnitude of the placebo effect in a given study can be estimated by the mean (or median) response in the placebo group and subtract the response in the no-treatment group. In addition, the placebo group response can be subtracted from the mean response in the active treatment group to estimate the specific efficacy of the new intervention. Though the existence of true placebo effects across a broad range of clinical interventions can vary depending on the disease being treated, the treatment modality, and the outcome expectation of the patients, the effect is generally larger in studies of the treatment of symptoms and in the management of pain(17 ).
A placebo control group assumes the study will be conducted in a double-blind fashion. This helps to avoid the biases that may ensue if patients, investigators, or both, knew who would be receiving which treatment. But there are also costs to using a placebo control. The first, and most obvious, is that placebo-controlled trials require that some patients be given active treatment despite the existence of other potentially effective interventions. The ethics of placebo-controlled trials in such settings remains a hotly debated topic( 8 , 18 – 20 ), and is considered further elsewhere in this book. The second cost to conducting placebo-controlled trials is that, while they remain the gold-standard for documenting absolute efficacy, they do not always answer a clinically relevant question. For practicing clinicians, who have several symptomatic therapies at their disposal, knowing whether another medication works better than nothing is not as important as knowing how the new therapy compares to the existing standard of care(21 ).
Participant selection
Another critical decision for investigators designing trials, and for clinicians who use the data, regards the selection of study participants. There are two conflicting priorities: (1) ensuring similarities between participants in the experimental and control groups and (2) testing a new treatment in a sample of patients likely to reflect all those who could benefit from using the intervention. To meet the first goal, investigators would attempt to enrol patients who are relatively homogenous so there are fewer differences to even out with randomization. Strict inclusion and exclusion criteria allow greater confidence that the observed differences in outcomes are attributable to the treatments being compared, rather than to undetected confounding variables related to the compositions of participants in each group.
By contrast, meeting the second goal requires enrolling participants from a more heterogeneous population. Because of the large interpersonal variability inherent in such a population, this approach can substantially increase the number of participants required to assure that the trial has adequate statistical power to document a treatment difference, if one exists. Despite this disadvantage, enrolling a heterogeneous sample allows sub-group analyses to be conducted, and so potential variations in a treatment’s efficacy among higher- and lower-risk patients may be identified. There are thus advantages and disadvantages to enrolling more or less homogeneous participants. As a result, early investigations of efficacy are commonly conducted using a select group of participants, whereas later, more definitive trials attempt to enrol more broadly representative patient samples. Physicians should, therefore, consider the composition of a given trial’s sample in order to determine the extent to which the results are generalizable to their own patients.
In palliative care populations, there is the additional issue of the frailty of the population and the potential lack of stability in their disease state over time. Finding patients who will remain relatively stable for the duration of the trial can be a difficult challenge. In addition, vulnerable populations may make choices that are not always consistent with the goals of a trial, either to participate out of desperation or to not participate because they do not want to be part of an experiment. There is frequently the additional problem of the ability of some patients to understand enough about their disease to be able to give an informed consent. When patients are cognitively challenged in addition, the process of recruitment can become a seemingly overwhelming task. The ethical issues surrounding these problems are considered elsewhere in this book.
Blinding
Over the last century, a growing understanding of the ability of the mind to influence functions of the body, along with the desire to enhance the experimental rigor of clinical trials has increased appreciation of the need for blinding. Recall that the primary goal of a clinical trial is to ensure that any changes between groups seen at the end of the trial may be attributed to a specific treatment being studied. To accomplish this, not only must all comparison groups be similar at the start, but participants in all groups should feel that they have the same probability of getting the real treatment.
Thus, the blinding of the study participants is of substantial importance, and investigators must design the study to prevent the participants from unblinding themselves. In particular, if a medication has a specific taste, common side effects, or other distinctive traits, it is important that the placebo treatment mimic these characteristics as closely as possible. Even in evaluating more invasive interventions, sham procedures have occasionally been employed such as making skin incisions(22 ) or burr holes in the skull(23 ) to mimic the real surgical procedures.
In addition to creating a suitable placebo, investigators should plan to determine whether the blinding was maintained by asking participants what treatment they think they received and why. Such questions should be posed to participants occasionally during the trial, and at the trial’s completion (24 ). If the blinding is successful, the participants’ guesses should be no more accurate than chance (e.g. 50 per cent in a typical two-arm trial). Blinding can be difficult to maintain ( 24 – 32 ). Study participants can often predict their receipt of placebo due to the absence of side effects, or their receipt of the real treatment by noting adverse effects of the intervention.
When only participants are blinded to the treatment received, the study is labelled as a single-blind trial. The standard use of the term double-blind applies when both the participants and investigators are blinded, assuming that the investigator is collecting the outcome data. If the investigator is not collecting the data, it is critical to blind the person who is, so as to minimize the chance that evaluators will more favourably rate those receiving the innovative treatment, thereby biasing the trial toward finding a benefit of that treatment. Even in studies where the subject fills out their own forms, blinding of the investigator remains essential to minimize the possibilities that they would impart different levels of enthusiasm, or prescribe different co-interventions, to patients in the different treatment groups.
Sample size
The statistical power of an RCT to show a difference between treatments is determined by: (1) the number of participants to be enrolled; (2) the effect size (treatment difference) that is deemed to be clinically important; (3) the variability of the outcomes in the two groups; and (4) the p-value (or Type I error rate, or α) chosen to connote statistical significance (typically set at 0.05). In general, the size of the sample to be tested is the variable investigators most commonly adjust to obtain adequate power (i.e. β ≥ 80 per cent although some prefer 90 per cent, which equals 1 minus the Type II error rate)—that is, an adequate probability of detecting a meaningful treatment difference when one truly exists.
It is a truism that with a sufficiently large sample size, any real difference between groups, no matter how small or clinically irrelevant, can be shown to be statistically significant. The converse is also true: a large, clinically important difference between treatments can fail to reach statistical significance when inadequate numbers of participants are enrolled.
The most common method of calculating the sample size required to achieve 80 per cent power (or greater) is to first determine: (1) the size of the effect that would be considered clinically important; (2) the anticipated response in the control group; and (3) the expected variability of the outcomes in both groups. This last determination may be particularly difficult to estimate, and should, when possible, be based on evidence from prior studies of similar diseases and/or treatments.
An alternate method used when the sample size is fixed is to calculate the size of the effect that would need to be present to produce a statistically significant outcome. This approach is rarely preferable to setting the sample size to detect a specified difference but is commonly used when the available population is fixed.
Outcome measurement
Another critical decision to be made in planning an RCT involves how to measure the chosen outcome of interest. For example, if investigators are interested in studying the effects of a new antihypertensive agent on systolic and diastolic blood pressure, should they measure these values with a mercury sphygmomanometer or via an arterial line? In addition to how the outcome will be measured, investigators must further consider when and how often to measure the outcome. Are single readings once each week adequate, or should participants be equipped with ambulatory blood pressure monitors to obtain multiple readings throughout the day? Finally, investigators must consider how to account for other variables that could alter the measurement, such as body position when the blood pressure is assessed.
Regardless of what measurement technique is chosen, it should be characterized by three features. First, the measurement should be reliable—if the same measure is used repetitively in the same person under identical conditions without this person’s condition changing, then the measure should produce the same results each time. Second, the measure should be valid—it should measure exactly what it is intended to measure. Third, the measure should be responsive—it should change over time if the condition being measured has truly changed. Though a full discussion of these concepts is beyond the scope of this chapter, the topics are well covered in many textbooks(33 ). If the outcome measure has been not routinely used in other similar research, its reliability, validity, and responsiveness should be formally tested and documented in the intended population.
The criteria of reliability, validity, and responsiveness also depend on what form the outcome takes. For example, in pain management, the primary goal is to improve the patient’s subjective sense of comfort. For this purpose, investigators might ask a simple question, such as, ‘Do you feel better, yes or no?’ Because such a measure has only two possible responses, it may not provide an adequately responsive measure of pain relief.
To help differentiate the level of response, investigators might ask, ‘What percentage of pain relief do you get from the treatment?’ However, such questions require patients to remember their previous conditions. Alternatively, investigators could use a 0–10 numerical rating scale at both the beginning and end of the study to measure the change in pain over time. Deciding which measurement is most appropriate for a given situation should be informed by considerations of how much change in the measure would be important to the patient, and the ability of the chosen scale to detect such a change.
Another measurement concerns in palliative care trials relate to the fact that a change in pain or nausea may only provide one component of an overall change in quality of life. Thus, symptomatic reports may be considered surrogate markers for changes in the more important outcome, the overall quality of life. The use of surrogate markers is a widespread practice in clinical trials. For example, investigators routinely monitor changes in serum cholesterol as a surrogate measure for the risk of myocardial infarction. However, using surrogate measures requires making the assumption that a reduction in cholesterol will lead to a reduced risk of myocardial infarction. If the use of an experimental analgesic agent relieves pain but produces substantial side effects, the patient may not consider its use as an improvement in quality of life. Therefore, if investigators wish to know an intervention’s effects on both the level of pain and the overall quality of life, then they must employ tools to measure both. Since there is no single measurement strategy that is universally applicable, it is important to carefully consider whether the measured outcome is appropriate to answer the research question being posed in the clinical trial.
Analysis
Like other aspects of a clinical trial, the specific analytic strategy should be defined before commencing the study. Many different analytic approaches are possible and each will produce an answer to a slightly different research question. It is important that the chosen strategy be appropriate to evaluate the primary research question, and be compatible with the numerical distribution of the data collected. The primary role of the analysis is to summarize the data (size of the effect) and to provide an estimate of the likelihood that the result was obtained by chance alone.
Effect size
The first, and most important, result of any analysis is the summary of the size of the effect resulting from the experimental therapy. In RCTs, the size of the effect is estimated by determining a summary value for the primary outcome in each treatment groups, and then calculating the difference between these values to reveal the difference in the treatment effect. There are only two primary forms for the summary value for a set of trial data: (1) the central tendency (e.g. mean, median, or mode) of the response among participants, or (2) the proportion of participants who achieve a defined level of response.
For example, in a hypertension trial, investigators can report the mean change in diastolic blood pressure (central tendency), or the proportion of patients who achieve a diastolic blood pressures below 90 mmHg (proportion of responders). If one were interested in the effect of an intervention on hospice length of stay, it might be acceptable to report either the median time spent in the programme for each group (central tendency), or the proportion of patients in each group who die within a week or some other predefined time period. Finally, in trials of pain management, in which the outcome of reported pain symptoms is provided on a numeric scale, investigators might either report the mean response in each group, or the percentage of patients in each group reporting pain reductions of 33 per cent (or 50 per cent) or greater in pain intensity. In each case, the units of these summary values should correspond to the units of the outcome measure.
Choices regarding how to best present the summary measures should reflect the type of information that is most relevant for practicing clinicians. For most health-care providers, the question of interest is whether a given treatment will work for a given patient rather than the average change. The average change does not provide a unique answer to the question of the number of people who are likely to improve. For example, suppose investigators reported that the mean response in the active treatment group was an improvement of 10 per cent on a standard pain scale. This same result could apply to data indicating that: (1) every patient in the active treatment group improved by 10 per cent (a unimodal distribution); (2) half of the patients in the treatment group improved by 20 per cent and half had no improvement (a bimodal distribution); or (3) half of the patients in the active treatment group improved by 40 per cent, and half deteriorated by 20 per cent (a bimodal distribution in which some patients improve and other deteriorate). Because these three descriptions of the underlying data could yield strikingly different clinical decisions, it is important to present an analysis of the proportions of patients in each group who improve or deteriorate by a clinically important amount.
What is a clinically important difference?
A common concern about presenting the proportion of ‘responders’ in each group is the need to define a level of response to be considered clinically important. Thus, the determination of a clinically important difference (CID) in a patient’s symptoms plays a key role in the interpretation of symptomatic studies. Two methods for determining the CID are ‘expert opinion’, and an assessment of how changes in symptom scales correspond to responses to global questions( 34 – 36 ). Regardless of the method used, however, each requires that a somewhat arbitrary decision be made in defining the scale to be considered the standard.
Recent studies of pain have adopted an alternative method of displaying response data, but graphing the proportion of responders at each possible outcome level for all the groups in a clinical trial(37 ). This display is a form of a cumulative distribution and allows the readers of the published report to select the level of improvement that they feel is clinically important and then determine the difference between the various groups at that level.
Statistical considerations
Statistical significance (p values and confidence intervals)
A p-value ≥ 0.05 is the most commonly accepted statistical test of the probability that a given result occurred by chance. However, this value is strictly arbitrary and is an indication that we are willing to accept a 1/20 chance of getting the wrong answer. This traditional method of hypothesis testing, in which p-values are reported to quantify the significance of a result, is gradually being replaced by methods used to gauge the range of plausible results that are compatible with the data. The most common method for presenting this range is to report a point estimate of the effect size, along with a 95 per cent confidence interval around this estimate. A 95 per cent confidence interval will include the true population value of the effect 19 times out of 20 (95 per cent). Thus, it can help readers determine the uncertainty inherent in any result—the narrower the interval, the more precise the estimate of the true effect, and thus, the more confident readers can be that the reported result is ‘right’.
Multiple comparisons
It is also important to realize that when investigators choose a p-value of 0.05 as an acceptable Type I (false positive) error rate, this value only applies to a single comparison between groups. In most clinical trials, however, performing multiple comparisons can be informative. The greater the number of comparisons, the more likely it is that at least one of them will be spuriously positive by chance alone. If an a priori decision is made to perform multiple comparisons, the p-value must be adjusted. Of the several available methods for adjusting this value, the simplest is to divide the p-value for one comparison by the number of comparisons to be performed, and to then use this new p-value as the cut-off for statistical significance across all analyses, called the Bonferroni adjustment(38 ). While valid, this is a very conservative estimate, and alternative methods have been developed to deal with multiple comparisons(39 ).
A related issue regards the distinction between comparisons chosen a priori, and those which investigators choose to conduct post hoc, or after the data has been collected. There are times when post hoc comparisons can be informative, but the results of such analyses should never be considered conclusive because they were not explicitly planned at the outset. Rather, results of post hoc analyses may be considered exploratory, intended to guide future investigations. Authors can help highlight this distinction by reporting which comparisons were chosen a priori, and which were not.
Evidence from secondary measures
In addition to the reported effect size and statistical significance of the primary outcome results, corroborative evidence from secondary outcome analyses should be used to support a study’s hypothesis. If multiple related measures are obtained, and the analyses of each show similar results, then it is less likely that any one of the positive results arose by chance. While there is no specific statistical test to document this phenomenon, showing that multiple related measures, all producing similar types of effects, lends support to the conclusions drawn from the primary outcome.
Evaluating side effects
Evaluating side effects of interventions tested in clinical trials is subject to the same considerations as those used to evaluate measures of efficacy. The major difference is that, in many cases, the side effects to be evaluated are not specified a priori, but evaluated only when they are observed to occur. Since many common side effects can occur spontaneously and independent of the treatment received, it is important to compare the relative incidence in the active treatment and control groups. Differing rates of side effect must be evaluated with caution, since clinical trials are rarely powered to detect such differences, and the large number of different side effects that are possible make it likely that one or more of the differences observed, will be due to chance. Such differences should not be ignored; but observing similar results in multiple trials can increase one’s confidence that the findings may be specifically attributed to the treatment received.
Publication
Thorough presentation of methods and results
Given that many components of a trial are central to interpreting the results, it is vital that trial reports be accurate, complete, and objective in their presentation of all important aspects of the trial. In particular, the a priori hypothesis should be clearly stated, and the discovery of other findings properly identified. All randomized participants must be accounted for in the publication, and an intention-to-treat analysis of all participants is typically appropriate, even when some participants drop out early or never receive their assigned intervention. Subsequent sub-group analyses can focus on those who complete the trial, but these should not be considered as the primary result. A careful description of the randomization and blinding procedures is also important to assure readers that the trial was properly conducted. Finally, brief descriptions of the rationale behind the choice of measurement tools and analytic strategies can be helpful.
Negative studies are as important as positive ones
There is now good evidence for a publication bias against negative studies, since authors prefer to write up positive ones and editors prefer to publish the same( 40 , 41 ). This can lead to difficulties for clinicians who want a true picture of the nature of the evidence for a particular treatment.
Potential limitations of RCTs
There are several issues inherent in the design and conduct of RCTs that may threaten the internal validity of the results—that is, the likelihood that the treatment comparison is free from bias. Furthermore, even when the comparison is internally valid, the external validity, or generalizability of the results, can be limited. Finally, because the conditions in which trials are conducted only weakly approximate clinical reality, physicians must be cautious in using the results as the only guide in clinical decision making. We will briefly discuss each of these potential problems in the following paragraphs. More detailed discussions of these issues are provided by Feinstein(42 ) and by Kramer and Shapiro(43 ).
Under-enrolment
Under-enrolment occurs when too few research participants are enrolled to provide adequate statistical power to answer the study’s primary research questions. The inability to recruit sufficient numbers of eligible patients is the most common cause of insufficient statistical power in RCTs( 44 – 49 ). Such under-enrolment has been attributed to characteristics of: (1) clinicians who refer their patients( 50 – 52 ); (2) patients who choose to be screened(53 ) or enrolled(54 ); (3) investigators who design the trials(55 ); and (4) institutions at which the trials are conducted( 56 , 57 ).
Among the challenges to adequate participant recruitment, potential participants’ reluctance to enrol in RCTs is likely to be the most formidable, especially in palliative care populations. It has been observed that patients are generally less willing to participate in RCTs than in non-randomized, observational studies(43 ). In addition to yielding unacceptably high probabilities for type II errors, the resulting under-enrolment substantially reduces the trial’s precision in quantifying the treatment effect.
Selective enrolment
Even when properly designed and carefully conducted, clinical trials can only provide information specific to the population from which the study participants were drawn. Selective enrolment occurs when particular sub-groups within the target population enrol in proportions greater or less than their representation in that population( 58 , 59 ). If this population does not include, for example, elderly patients, women, or children, then applying the results to these clinical populations requires extrapolation. While extrapolating results may sometimes be reasonable, it must always be done cautiously because both the beneficial and adverse effects of an intervention can vary across populations.
The level of response detected by an individual trial will depend on the patient population enrolled. For example, when first studying a novel treatment for a condition for which there is no adequately affective treatment, all patients with the condition are more likely to be willing to volunteer. By including people with relatively early or mild symptoms, the response rate may be higher than expected, although the response in the placebo group may also be larger. In contrast, when a treatment is tested in a population where an effective treatment already exists, only people who do not obtain a response to the available treatments are likely to enrol. This more recalcitrant group may have a lower response rate than expected in the total population, thereby underestimating the treatment’s potential usefulness.
Poor participant adherence
In RCTs, patients often do not adhere completely to their prescribed treatment regimens(43 ). Especially if patients believe they are receiving a non-preferred treatment, their enthusiasm for the trial, and subsequent adherence to their assigned treatment, may wane. This is further complicated if the patients have access to and decide to take either the experimental therapy or a concomitant additional therapy outside of the trial. This occurs more frequently in trials where patients are able to overcome the blinding. There is accumulating evidence that many study participants make concerted efforts to unblind themselves, and that participants who become aware of their treatment assignment maybe more likely to drop out of the study. For example, many patients assigned to the placebo groups in both the initial phase II trial of AZT for AIDS patients(60 ), and in a randomized trial of vitamin E for patients with Alzheimer’s disease(61 ), appear to have become unblinded, and even to have obtained the active agents outside the trial( 62 – 64 ). Even more problematically, widespread unblinding in one AIDS Clinical Trial Group study(65 ) not only allowed approximately 9 per cent of those assigned to the placebo to receive zidovudine, but contributed to the drop-out rate in the placebo group being one-third higher than it was in the active treatment group(66 ).
Participant non-adherence and drop-out can substantially bias the results of a trial(67 ). Though intention-to-treat analyses may mitigate this bias, if non-adherence or drop-out rates are higher in one group than in the other, such analyses may also prevent a true effect of treatment from being detected. Thus, investigators should make concerted efforts to monitor participant adherence and drop-outs. When such problems exist, the results of the trial must be interpreted cautiously.
Ethical issues in palliative care research (see also Chapter 7.6)
As with all clinical research, palliative care studies require informed consent of the participants and, when cognitive impairment is an issue, from the appropriate family member or medical surrogate. Especially in situations where curative therapies are not likely to be effective, there are a number of important issues to consider and a detailed discussion of this topic is covered in a separate chapter in this book. The most important issue is to consider the balance between the ethical issues of right of the individual to receive companionate care and the needs of the population for information on the efficacy and safety of specific therapies. When conducting clinical trials, the investigator must carefully protect the rights and well-being of the subjects in the study. One possible alternative is the use of innovative approaches for the conduct of clinical trials(68 ). Although beyond the scope of this chapter, trials designs such as response-adaptive randomization procedures (e.g. ‘play-the-winner’ or ‘drop-the-loser’) may be more ethically appropriate for the testing of therapies in conditions that may have significant consequences on the quantity and/or quality of life that may result in the palliative care population. Such designs focus on minimizing the expected treatment failures while maintaining the power and randomization benefits(69 ). ‘Add-on’ trials, where a new treatment is added to the current treatments the patient is receiving, may reduce the consequences to the individual patients from being randomized to a placebo treatment. Building in rescue strategies to the trial design can also reduce the potential risk to study participants(70 ). Crossover trial designs may also be useful in studies of diseases and symptoms that are relatively stable. Using patients as their own control markedly increases the power of the study, but concerns about carryover effects between treatment periods are a serious risk to the validity of the study( 71 , 72 ).
Conclusions
In this chapter, we have outlined several fundamental considerations for investigators planning clinical trials, and for clinicians attempting to discern the applicability of such trials to their practices. We have given special consideration to the nuances of clinical trials for palliative care interventions. In summary, randomized, controlled trials remain the best available means of evaluating novel palliative care interventions, and for determining how these interventions may be optimally used. Despite the strengths of the design, readers of trial reports should be mindful of the many difficulties inherent in extrapolating from the results obtained in a trial setting to the use of these same interventions in clinical practice.
References
1. Medical Research Council. (1948). Streptomycin treatment of pulmonary tuberculosis. British Medical Journal, 2, 769–82.
Find This Resource
2. Anonymous. (1998). Fifty years of randomised controlled trials. British Medical Journal, 317.
Find This Resource
3. Freedman, B. (1987). Scientific value and validity as ethical requirements for research: A proposed explication. IRB, 9, 7–10.
Find This Resource
4. Emmanuel, E.J., Wendler, D., and Grady, C. (2000). What makes clinical research ethical? Journal of the American Medical Association, 283, 2701–11.
Find This Resource
5. Mularski, R.A., Rosenfeld, K., Coons, S.J. et al. (2007). Measuring outcomes in randomized prospective trials in palliative care. Journal of Pain and Symptom Management, 34(1 Suppl), S7–S19.
Find This Resource
6. Anonymous. (1997). A comparison of continuous infusion of alteplase with double-bolus administration for acute myocardial infarction. The Continuous Infusion versus Double-Bolus Administration of Alteplase (COBALT) Investigators. New England Journal of Medicine, 337, 1124–30.
Find This Resource
7. Anonymous. (1997). A comparison of reteplase with alteplase for acute myocardial infarction. The Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO III) Investigators. New England Journal of Medicine, 337, 1118–23.
Find This Resource
8. Temple, R., Ellenberg, S.S. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Annals of Internal Medicine, 133, 455–63.
Find This Resource
9. Anonymous. (1999). Single-bolus tenecteplase compared with front-loaded alteplase in acute myocardial infarction: the ASSENT-2 double-blind randomised trial. Assessment of the Safety and Efficacy of a New Thrombolytic Investigators. Lancet, 354, 716–22.
Find This Resource
10. Temple, R.J. (1997). When are clinical trials of a given agent vs. placebo no longer appropriate or feasible? Controlled Clinical Trials, 18, 613–20.
Find This Resource
11. Temple, R. (1996). Problems in interpreting active control equivalence trials. Accountability Research, 4, 267–75.
Find This Resource
12. Jones, B., Jarvis, P., Lewis, J.A. et al. (1996). Trials to assess equivalence: the importance of rigorous methods. British Medical Journal, 313, 36–9.
Find This Resource
13. Fleming, T.R. (2000). Design and interpretation of equivalence trials. American Heart Journal, 139, S171–6.
Find This Resource
14. Ellenberg, S.S. and Temple, R. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 2: Practical issues and specific cases. Annals of Internal Medicine, 133, 464–70.
Find This Resource
15. Food and Drug Administration. (2001). Guidance for industry: E 10: Choice of control group and related issues in clinical trials. Department of Health and Human Services, Rockville, MD.
Find This Resource
16. Hrobjartsson, A. and Gotzsche, P.C. (2001). Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment. New England Journal of Medicine, 344, 1594–602.
Find This Resource
17. Chaput de Saintonge, D.M. and Herxheimer, A. (1994). Harnessing placebo effects in health care. Lancet, 344, 995–8.
Find This Resource
18. Freedman, B., Glass, K.C., Weijer, C. (1996). Placebo orthodoxy in clinical research II: Ethical, legal, and regulatory myths. Journal of Law and Medical Ethics, 24, 252–9.
Find This Resource
19. Freedman, B., Weijer, C., Glass, K.C. (1996). Placebo orthodoxy in clinical research I: Empirical and methodological myths. Journal of Law and Medical Ethics, 24, 243–51.
Find This Resource
20. Kleijnen, J., de Craen, A.J.M., Everdingen, J.V. et al. (1994). Placebo effect in double-blind clinical trials: a review of interactions with medications. Lancet, 344, 1347–9.
Find This Resource
21. Rothman, K.J., Michels, K.B. (1994). The continuing unethical use of placebo controls. New England Journal of Medicine, 331, 394–8.
Find This Resource
22. Halpern, S.D., Karlawish, J.H.T. (2000). Placebo-controlled trials are unethical in clinical hypertension research. Archives of Internal Medicine, 160, 3167–8.
Find This Resource
23. Cobb, L.A., Thomas, G.I., Dillard, D.H. et al. (1959). An evaluation of internal-mammary-artery ligation by double-blind technic. New England Journal of Medicine, 260, 1115–8.
Find This Resource
24. Macklin, R. (1999). The ethical problems with sham surgery in clinical research. New England Journal of Medicine, 341, 992–6.
Find This Resource
25. Freeman, T.B., Vawter, D.E., Leaverton, P.E. et al. (1999). Use of placebo surgery in controlled trials of a cellular-based therapy for Parkinson’s Disease. New England Journal of Medicine, 341, 988–92.
Find This Resource
26. Morin, C.M., Colecchi, C., Brink, D. et al. (1995). How “blind” are double-blind placebo-controlled trials of benzodiazepine hypnotics? Sleep, 18, 240–5.
Find This Resource
27. Karlowski, T.R., Chalmers, T.C., Frenkel, L.D. et al. (1975). Ascorbic acid for the common cold: A prophylactic and therapeutic trial. Journal of the American Medical Association, 231, 1038–42.
Find This Resource
28. Howard, J., Whittemore, A.S., Hoover, J. et al. (1982). The Aspirin Myocardial Infarction Study Research Group. How blind was the patient blind in AMIS? Clinical Pharmacology and Therapeutics, 32, 543–53.
Find This Resource
29. Brownell, K.D. and Stunkard, A.J. (1982). The double-blind in danger: Untoward consequences of informed consent. American Journal of Psychiatry, 139, 1487–9.
Find This Resource
30. Byrington, R., Curb, D.J., and Mattson, M.E. (1985). Assessment of blindness at the conclusion of the beta-blocker heart attack trial. Journal of the American Medical Association, 253, 1733–6.
Find This Resource
31. Rabkin, J.G., Markowitz, J.S., Stewart, J. et al. (1986). How blind is blind? Assessment of patient and doctor medication guesses in a placebo-controlled trial of imipramine and phenelzine. Psychiatry Research, 19, 75–86.
Find This Resource
32. Moscussi, M., Byrne, L., Weintraub, M. et al. (1987). Blinding, unblinding and the placebo effect: An analysis of patients’ guesses of treatment assignment in a double-blind clinical trial. Clinical Pharmacology and Therapeutics, 41, 259–65.
Find This Resource
33. Fisher, S. and Greenberg, R.P. (1993). How sound is the double-blind design for evaluating psychotropic drugs? Journal of Nervous and Mental Disease, 181, 345–50.
Find This Resource
34. Basoglu, M., Marks, I., Livanou, M. et al. (1997). Double-blindness procedures, rater blindness, and ratings of outcome: Observations from a controlled trial. Archives of General Psychiatry, 54, 744–8.
Find This Resource
35. Streiner, D.L. and Norman, G.R. (2003). Health Measurement Scales: A practical guide to their development and use, 3rd edition. New York: Oxford University Press.
Find This Resource
36. Jaeschke, R., Singer, J., and Guyatt, G.H. (1989). Measurement of health status. Ascertaining the minimal clinically important difference. Controlled Clinical Trials, 10, 407–15.
Find This Resource
37. Jaeschke, R., Guyatt, G.H., Keller, J. et al. (1991). Interpreting changes in quality-of-life score in N of 1 randomized trials. Controlled Clinical Trials, 12, 226S–33S.
Find This Resource
38. Todd, K.H. (1996). Clinical versus statistical significance in the assessment of pain relief. Annals of Emergency Medicine, 27, 439–41.
Find This Resource
39. Farrar, J.T., Dworkin, R.H., Max, M.B. (2006). Use of the Cumulative Proportion of Responders Analysis (CPRA) Graph to Present Pain Data over a Range of Cut-off Points: Making Clinical Trial Data More Understandable. Journal of Pain and Symptom Management, 30(4), 369–77.
Find This Resource
40. Hilsenbeck, S.G., Clark, G.M. (1996). Practical p-value adjustment for optimally selected cutpoints. Statistics in Medicine, 15(1), 103–12.
Find This Resource
41. Liu, Q., Li, Y., and Boyett, J.M. (1997). Controlling false positive rates in prognostic factor analyses with small samples. Statistics in Medicine, 16(18), 2095–101.
Find This Resource
42. Begg, C.B. and Berlin, J.A. (1988). Publication bias: a problem in interpreting medical data. Journal of the Royal Statistical Society A, 151, 419–63.
Find This Resource
43. Reidenberg, M.M. (1998). Decreasing publication bias. Clinical Pharmacology and Therapeutics, 63, 1–3.
Find This Resource
44. Feinstein, A.R. (1983). An additional basic science for clinical medicine: II. The limitations of randomized trials. Annals of Internal Medicine, 99, 544–50.
Find This Resource
45. Kramer, M.S. and Shapiro, S.H. (1984). Scientific challenges in the application of randomized trials. Journal of the American Medical Association, 252, 2739–45.
Find This Resource
46. Freiman, J.A., Chalmers, T.C., Smith, H., et al. (1978). The importance of beta, the type II error and sample size in the design and interpretation of the randomized controlled trial: survey of 71 “negative” trials. New England Journal of Medicine, 299, 690–4.
Find This Resource
47. Altman, D.G. (1980). Statistics and ethics in medical research III: How large a sample? British Medical Journal, 281, 1336–8.
Find This Resource
48. Collins, J.F., Bingham, S.F., Weiss, D.G., et al. (1980). Some adaptive strategies for inadequate sample acquisition in Veterans Administration cooperative clinical trials. Controlled Clinical Trials, 1, 227–48.
Find This Resource
49. Hunningshake, D.B., Darby, C.A., Probstfield, J.L. (1987). Recruitment experience in clinical trials: Literature summary and annotated bibliography. Controlled Clinical Trials, 8, 6S–30S.
Find This Resource
50. Meinert, C.L. (1986). Patient recruitment and enrollment. Clinical trials: Design, conduct, and analysis, pp. 149–58. New York: Oxford University Press.
Find This Resource
51. Nathan, R.A. (1999). How important is patient recruitment in performing clinical trials? Journal of Asthma, 36, 213–6.
Find This Resource
52. Taylor, K.M., Margolese, R.G., and Soskolne, C.L. (1984). Physicians’ reasons for not entering eligible patients in a randomized clinical trial of adjuvant surgery for breast cancer. New England Journal of Medicine, 310, 1363–7.
Find This Resource
53. Taylor, K.M. (1992). Physician participation in a randomized clinical trial for ocular melanoma. Annals of Ophthalmology, 24, 337–44.
Find This Resource
54. Taylor, K.M., Feldstein, M.L., Skeel, R.T., et al. (1994). Fundamental dilemmas of the randomized clinical trial process: results of a survey of the 1,737 Eastern Cooperative Oncology Group investigators. Journal of Clinical Oncology, 12, 1796–805.
Find This Resource
55. Greenlick, M.R., Bailey, J.W., Wild, J., et al. (1979). Characteristics of men most likely to respond to an invitation to be screened. American Journal of Public Health, 69, 1011–5.
Find This Resource
56. Barofsky, I. and Sugarbaker, P.H. (1979). Determinants of patient nonparticipation in randomized clinical trials for the treatment of sarcomas. Cancer Clinical Trials, 2, 137–46.
Find This Resource
57. Collins, J.F., Williford, W.O., Weiss, D.G., et al. (1984). Planning patient recreuitment: Fantasy and reality. Statistics in Medicine, 3, 435–43.
Find This Resource
58. Begg, C.B., Carbone, P.P., Elson, P.J., et al. (1982). Participation of community hospitals in clinical trials. Analysis of five years of experience in the Eastern Cooperative Oncology Group. New England Journal of Medicine, 306, 1076–80.
Find This Resource
59. Shea, S., Bigger, Jr., T., Campion, J., et al. (1992). Enrollment in clinical trials: Institutional factors affecting enrollment in the Cardiac Arrhythmia Suppression Trial (CAST). Controlled Clinical Trials, 13, 466–86.
Find This Resource
60. Mant, D. (1999). Can randomised trials inform clinical decisions about individual patients? Lancet, 353, 743–6.
Find This Resource
61. Halpern, S.D., Metzger, D.S., Berlin, J.A., et al. (2001). Who will enroll? Predicting participation in a phase II AIDS vaccine trial. Journal of Acquired Immune Deficiency Syndrome, 27, 281–8.
Find This Resource
62. Fischl, M.A., Richman, D.D., Grieco, M.H., et al. (1987). The efficacy of azidothymidine (AZT) in the treatment of patients with AIDS and AIDS-related complex: A double-blind, placebo-controlled trial. New England Journal of Medicine, 317, 185–91.
Find This Resource
63. Sano, M., Ernesto, C., and Thomas, R.G. (1997). A controlled trial of selegiline, alpha-tocopheral, or both as treatment for Alzheimer’s disease. New England Journal of Medicine, 336.
Find This Resource
64. Kodish, E., Lantos, J.D., Siegler, M. (1990). Ethical considerations in randomized controlled clinical trials. Cancer, 65, 2400–4.
Find This Resource
65. Epstein, S. (1996). Impure science: AIDS, activism, and the politics of knowledge. Berkeley: University of California, Berkeley Press.
Find This Resource
66. Karlawish, J.H.T. and Whitehouse, P.J. (1998). Is the placebo control obsolete in a world after donepezil and vitamin E? Archives of Neurology, 55, 1420–4.
Find This Resource
67. Volberding, P.A., Lagakos, S.W., Koch, M.A., et al. (1990). Zidovudine in asymptomatic Human Immunodeficiency Virus infection: a controlled trial in persons with fewer than 500 CD4-positive cells per cubic millimeter. New England Journal of Medicine, 322, 941–9.
Find This Resource
68. Merrigan, T.C. (1990). You can teach an old dog new tricks: How AIDS trials are pioneering new strategies. New England Journal of Medicine, 323, 1341–3.
Find This Resource
69. Peto, R., Collins, R., and Gray, R. (1995). Large-scale randomized evidence: Large, simple trials and overviews of trials. Journal of Clinical Epidemiology, 48, 23–40.
Find This Resource
70. Streiner, D.L. (2007). Alternatives to placebo-controlled trials. Canadian Journal of Neurological Sciences, 34(Suppl 1), S37–41.
Find This Resource
71. Rosenberger, W.F. and Huc, F. (2004). Maximizing power and minimizing treatment failures in clinical trials. Clinical Trials, 1(2), 141–7.
Find This Resource
72. Boers, M. (2003). Add-on or step-up trials for new drug development in rheumatoid arthritis: a new standard? Arthritis and Rheumatism, 48(6), 1481–3.
Find This Resource
73. Simon, L.J. and Chinchilli, V.M. (2007). A matched crossover design for clinical trials. Contemporary Clinical Trials, 28(5), 638–46.
Find This Resource
74. Garcia, R., Benet, M., Arnau, C., et al. (2004). Efficiency of the cross-over design: an empirical estimation. Statistics in Medicine, 23(24), 3773–80.
Find This Resource