3.1 Systematic public availability of randomized trial results

In early breast cancer all clinically apparent disease can, by definition, be removed surgically. Following such surgery, adjuvant systemic treatments involving various cytotoxic, hormonal or other therapies may be considered. Despite numerous clinical trials of adjuvant therapy, however, there has been uncertainty as to the net effects of such treatments, particularly on survival. The present report arises from a collaboration that started in 19841,2 between the principal investigators of these randomized trials of adjuvant therapy for early breast cancer that included some evaluation of tamoxifen, of cytotoxic therapy, of radiotherapy3 or of ovarian ablation that was "unconfounded" (i.e. not mixed up with other evaluations: see below).

There have already been over 300 randomized trials assessing various different primary treatments and different adjuvant treatments for early breast cancer, and a few of the main therapeutic questions have already been addressed by several dozen different trials, some published and some not. The purpose of the present report is to bring together the principal results from as many as possible of the trials that evaluated certain adjuvant treatments (particularly tamoxifen or cytotoxic therapy), so that reviews of the trial evidence on those questions can be more reliable and informative than a review of just the published literature.4 The methods and data for each of the trials involved are presented separately by each trial group in the brief trial summaries that form an Appendix [not reproduced here] to this report.

As examples illustrating some of the uses to which such data can be put, systematic overviews of the principal results from certain categories of these trials are also presented using statistical methods that are particularly suited to trial overviews.5,6 The purpose of these overview analyses is not, however, to impose any particular interpretation or method of analysis of the trial results, but merely to demonstrate the ability of an appropriate overview of many trial results to yield informative conclusions. Different readers may prefer to review only a limited number of the trials: if so, the data for such sub-analyses are available in this report. Similarly, different readers may prefer to review only certain subcategories of all women, or to use different statistical methods of review: if so, the data from each separate trial are available in sufficient detail to permit various alternative approaches. Thus, one of the chief functions of the present report is just to make available unbiased data from all (or at least from as many as initially possible) of the relevant randomized trials in early breast cancer, so as to facilitate the construction, discussion and publication of various different interpretations of the trial evidence. The main biases to be avoided are discussed below.

3.2 Informative overviews even of somewhat different trials, despite heterogeneity of design and of size of treatment effect

To illustrate the circumstances to which overviews of different trial results may be relevant, the question of the net effects on mortality of adjuvant treatment with tamoxifen can be considered. There are already a few dozen randomized trials of adjuvant tamoxifen, but they are heterogeneous in their entry criteria, their treatment schedules, their follow-up procedures and their methods for treating relapses. At first sight such heterogeneity may appear to be an important reason not to perform overviews. A converse view, however, is that if a certain type of treatment is of some benefit in one trial then a somewhat similar type of treatment will probably also be of at least some benefit in another trial, and that even though these benefits are probably not the same size in each trial they will, but for the play of chance, probably tend to point in the same direction in each study. If so, then an overview of the separate tamoxifen trials might provide a useful test of whether or not such treatment has any therapeutic effect, together with at least some indication of the sizes of therapeutic effect that are "typical" in the circumstances of these particular trials.

At one extreme, therefore, each tamoxifen trial might be considered in virtual isolation from all others, while at the opposite extreme all tamoxifen trials might be considered together. Both of these extreme views have some merit, and the pursuit of each by different people will perhaps prove more illuminating than too definite an insistence on any one particular approach. This will be still more the case for the chemotherapy trial results, since several very different regimens have been tested, ranging from a few days of cyclophosphamide to some months or years of fairly intensive multiple-agent chemotherapy. Moreover, even chemotherapy schedules that nominally involve the same agents - cyclophosphamide, methotrexate, and 5-fluorouracil (CMF), for example - may differ so substantially in intensity or duration that their effects are substantially different.

3.3 Importance of reliably assessing MODERATE treatment effects

Opinions differ as to the likelihood of large (e.g. 50% or more) differences in long-term survival being produced by the types of treatment for early breast cancer that are usually compared in randomized trials. Some, remembering past experience with childhood leukemia7 and Hodgkin's disease,8 expect large improvements to emerge, while others regard the expectation of large differences in mortality to be somewhat unrealistic.9 There appears, however, to be much wider agreement that moderate (e.g. 15% or 25%) differences in mortality might well exist and that reliable knowledge of such effects could be humanly important when choosing what treatments to recommend for an individual woman. For example, "just" a 20% reduction in a 50% risk of death would represent avoidance of death in 1 out of every 10 women treated, which is substantial. If large mortality differences exist, then discovering them would in general be more important than discovering moderate differences. But, even in the absence of large differences the reliable demonstration, or reliable refutation, of any moderate differences may still be important. For example, each year in North America alone early breast cancer is diagnosed in well over 100,000 women.l0 Although the primary treatment can eliminate all clinically apparent disease, more than 30,000 of these women who originally presented with only early breast cancer will eventually suffer disease recurrence and die. Early breast cancer is also common in other parts of the world, so a widely practicable treatment that produced a reduction in mortality as small as 15% could avoid or delay several thousand deaths a year, provided it was adopted widely by the medical profession.

3.4 Two main reasons for using overviews of properly randomized trials to assess MODERATE treatment effects: avoiding selection biases and reducing random errors

Reliable detection (or refutation) of treatment effects that are only moderate in size requires the reliable exclusion of (i) moderate biases and (ii) moderate random errors, either of which might obscure (or mimic) moderate treatment effects. Each of these requirements may be difficult to meet adequately without a proper overview of the unconfounded randomized trials.4

First, without a systematic search for all relevant randomized trials an unrepresentative selection of the trial results may be reviewed. The selection biases caused by this might not matter much if treatment had a large effect on long-term mortality, but these and other selection biases do matter if moderate treatment effects are to be assessed reliably, or if an ineffective treatment is to be reliably recognized as such.

Second, unless comparisons of the effects of treatment are based on trials that together include several hundred (or, preferably, several thousand) deaths, the random play of chance can produce favorable or unfavorable random errors that are comparable in size with any moderate mortality reductions that might exist. Such large numbers of deaths are difficult to achieve in individual trials but may be achievable in an overview of the results of many different trials.

3.5 Review of many trials, with no data-dependent omissions, to limit selection bias in assessment of treatment effects

Even after a complete and exhaustive review just of the published literature there may be some biases in the selection of the trial results that could then be reviewed, since detailed results of particular trials may remain unpublished unless those results are exceptional.11 Hence, overviews based only on the published literature may well produce moderately biased results.l2 Moreover, even a complete review of nothing but the published literature might be such a time-consuming task (especially if additional information needs to be sought by correspondence with some authors) that only fairly determined investigators would be likely to achieve it. Others may be satisfied with an incomplete review that excludes some of the lesser known publications, but this might introduce some further selection bias since even among published studies the favorable or unfavorable play of chance in the results of a trial may substantially influence how well known that trial becomes. Thus, trials that appear to have particularly promising results (or, for some treatments, particularly unpromising results) are likely to be among the best known. Even if any selective biases that this produces are only moderate in size, moderate biases may still make realistically moderate treatment effects impossible to assess reliably. The biases that can be introduced just by selective exclusion of certain randomized trial results can, however, be avoided by systematic review of all (or of an unbiased subset) of the randomized trials ever undertaken.

3.6 Other selection biases either in design or in analysis of trials

Overviews may also be of some limited help in controlling other selection biases in the design or in the analysis of trials. Selection biases in design can be produced not only by failure to allocate treatment properly at random, but also by post- randomization withdrawal of selected patients (see below: "Proper randomization"). Selection bias in analysis can be produced not only by undue emphasis on just a limited number of trials, but also by unduly data-dependent emphasis on the apparent effects of treatment in particular subcategories of patient (for real examples, see Table 1 and the associated discussion on "Selection bias from subgroup analyses").

3.7 Proper randomization (with no subsequent exclusions) to limit selection bias in design of trials of moderate treatment effects

Comparisons using historical controls,13 data-base "efficacy" analyses14 or other non-randomized methods may, no matter what precautions are taken, be subject to moderate biases, the exact size of which cannot be predicted reliably. For example, in a recent United Kingdom Medical Research Council leukemia trial, a significant improvement (P=0.003) in mortality was seen between the first and second half of the same supposedly homogeneous trial.13 Hence, if some completely ineffective additional treatment had been specifically introduced halfway through that study, then a historically controlled "evaluation" of that treatment might have misleadingly concluded that that treatment worked.

Knowledge of the next treatment allocation before patient entry is confirmed (for example, where randomization lists are publicly available, allowing foreknowledge, or where allocation is alternate or based on odd/even dates or record numbers) can produce biases that are uncorrectable even if the original sequence of treatments was completely random. Trials that permit foreknowledge to bias patient entry are not, in fact, properly randomized, although they are often mistakenly described as such.

Non-randomized methods may sometimes suffice as a crude means of deciding whether or not large therapeutic effects exist, but they are generally of little value for reliable detection, or refutation, of moderate therapeutic effects. Hence, the present overviews are restricted to properly randomized trials.

3.8 Selection biases from subgroup analyses: statistical difficulty in the assessment of qualitative "interactions" and of quantitative "interactions"

Patients with early breast cancer may be very different from each other, and the treatment appropriate for one may not be appropriate for another. Ideally, therefore, what is wanted is not only an answer to the question "Is this treatment good on average for a wide range of patients?", but also an answer to the question "For which recognizable categories of patient is this treatment good?". In other words, the ideal would be a reliable description of the categories most likely to benefit from treatment. This ideal is, however, difficult to attain.

"Interactions" - that is, differences between the effects of treatment in different categories of women - may be of two types that can have quite different practical medical consequences. If treatment improves the prognosis appreciably in one category of women but does so to a negligible extent, or not at all, in another category then this is a qualitative interaction. If, however, treatment improves the prognosis appreciably in both categories but the improvement is just somewhat bigger in one category than in the other then this is a quantitative interaction. Unfortunately, the direct use of clinical trial results in particular subgroups of women to refute or to demonstrate any type of interaction is often extremely difficult - and, even if statistically significant evidence of an interaction is found, this may still fall far short of providing reliable evidence of a qualitative interaction.15-18

One possible determinant of the size of the absolute benefit of any therapy in some particular category of patient is the absolute risk of death (or recurrence) without treatment. For example, the number of regional lymph nodes containing breast cancer deposits, divided into three standard categories (NO, N1-3, N4+), is an important prognostic feature. If, therefore, some treatment reduces the risk of death by a similar proportion (e.g. one-quarter) in all patients, the absolute benefit during the first few years after treatment may be greater in poor-prognosis patients (e.g. 40% dead reduced to 30%) than in good-prognosis patients (e.g. 8% dead reduced to 6%). It might, therefore, be useful to know whether the proportional risk reduction (i.e. the "relative risk") produced by a particular treatment really is approximately similar for good-prognosis and for poor-prognosis women.

These and other questions about "interactions" between patient characteristics and treatment effects are easy to ask but surprisingly difficult to answer reliably. This is because quite striking-looking interactions can often be produced just by the play of chance, and these can mimic or obscure some of the moderate treatment effects that one might realistically expect. For example, if the patients in a trial with a highly significant (2P<0.002) overall benefit of treatment were categorized by some completely absurd criterion (e.g. their astrological birth signsl7,19) then the effects of treatment in different patient categories may, just by the play of chance, appear to be quite large and highly significant in some categories, but small and non-significant in others (Table 1). For treatments that have a smaller proportional effect on the odds of survival than on the odds of recurrence-free survival, such chance fluctuations between the apparent effects of treatment in different categories of women particularly affect the mortality analyses. Thus, just from direct analyses of data on overall mortality it may be difficult to determine reliably which categories of patient can expect the greatest proportional risk reductions. This is so even when only a small number of categories are considered (e.g. N0, N1-3, N4+), and is still more so when the patients are divided into many categories, based perhaps on more than one criterion (e.g. on both age and nodal status).

In principle the statistical difficulties might be limited by restricting subgroup analyses to those based on biological or clinical plausibility, but in practice this may offer little protection, for some sort of plausible-sounding rationale can usually be invented for almost any result.

There are two main remedies for these statistical difficulties, but the extent to which each is helpful is a matter on which informed judgements differ. The first is to use the overall mortality results as a guide (or at least a context for speculation) as to the qualitative effects of treatment in each particular category,l5 and to give proportionately less weight to the actual mortality results observed in that category than to extrapolation of the overall results (see below: "Specific categories of patient"). The second is to be influenced not only by mortality data but also by data on recurrence-free survival (see section 3.10 below: "Use of recurrence data").

3.9 Specific categories of patient or of treatment: data-dependent emphasis on subgroup analyses versus indirect extrapolation of overall analyses

In addition to the overall analysis of all the unconfounded randomized trials of some particular type of treatment, such as tamoxifen or chemotherapy, many different subgroup analyses will generally be presented. Among these, however, the play of chance alone is likely to yield several false negative or false positive results. So, for example, if the overall analysis yields results that are not clearly significant, data-dependent emphasis on just a few subgroup analyses that happened to be conventionally significant might well be misleading since these might be "false positive" results generated by chance alone (or at least considerably magnified by the play of chance). Conversely, if the overall analysis yields highly significant evidence that some treatment does, overall, produce a moderate reduction in mortality, subgroup analyses could well generate "false negative" results merely by chance. In the latter instance, data-dependent emphasis on subgroup analyses that are not conventionally significant (or that point in a direction opposite to that of the overall analysis) might again be misleading.

Because direct analyses just among specific categories of women or of trials can give misleading results, data-dependent emphasis on particular such analyses may lead to importantly biased conclusions. Paradoxically, therefore, even effects among specific categories of women or of trial may best be assessed indirectly by approximate extrapolation from the apparent effects of treatment among all women in a wide class of trials. This would not be true if the treatment effects in the specific category of interest were qualitatively different from the overall effects. But, it might well be true (even though the proportional risk reductions are not exactly the same size in different circumstances) as long as the two treatment effects do at least point in the same direction as each other (see section 5.16: "Practical meaning of a clear effect of treatment in an overview of many trials").

3.10 Use of recurrence data, as well as mortality data, to study interactions unbiasedly

Trials usually observe more recurrences than deaths and, more importantly, the proportional effects of treatment on the odds of an unfavorable outcome may be larger for disease recurrence than for death, especially if breast cancer deaths are diluted by other causes of death that are largely unaffected by treatment. These two factors (more recurrences than breast cancer deaths and, particularly, larger treatment effects on recurrence) have meant that several trials in early breast cancer have been large enough just on their own to demonstrate statistically significant evidence for an effect of treatment on recurrence. They also mean that, from a purely statistical viewpoint, the relative effects of treatment in different subcategories of women or of treatments may be measured more accurately for recurrence than for death.

For example, subgroup analyses based on subcategorization of a 10 standard deviation difference in recurrence among all women will be less subject to the play of chance than are subgroup analyses based on subcategorization of a statistically significant (2P<0.0001) difference in mortality of "only" 4 standard deviations among all women. The statistical difficulties that dominate the assessment of interactions may therefore be considerably less severe for recurrence than for mortality analyses. Hence, it may be useful to analyze not only death but also recurrence when trying to determine whether the proportional effects of treatment are substantially different in different categories of patient (e.g. N0, N1-3, N4+). This remains equally true, of course, when considering any other subsets of women or of trials - and, in particular, when considering indirect comparisons between the apparent effects in different trials of two subcategories of a particular class of treatments.

A delay of recurrence is not, of course, necessarily equivalent to a delay of death. Indeed, there are instances of treatments (radiotherapy, perhaps) that substantially delay local recurrence without any substantial effect on early mortality (or on distant recurrence) in most women.3 However, for treatments that definitely can reduce breast cancer mortality, assessment of whether their effect on mortality is likely to be markedly different, either between different categories of patient or between different categories of trial, may be assisted by separate analyses in each of those different categories of the apparent effects of treatment on recurrence (or, for localized therapies, on distant recurrence). The extent to which information from recurrence analyses should influence the interpretation of mortality analyses is, however, a matter of judgement on which opinions differ. The aim of the present report is, therefore, merely to make both survival and recurrence-free survival analyses separately available for consideration.

3.11 Use of overviews to assess effects of treatment on other specific causes of early death, or other rare endpoints

In large populations there is observational epidemiologic evidence that ovarian ablation decreases the incidence of breast cancer but increases the incidence of myocardial infarction (MI),20,21 that postmenopausal hormones decrease the incidence of MI but increase the incidence of endometrial cancer,22,23 that various contraceptive pills increase the incidence of MI, pulmonary embolus and stroke but decrease the incidence of ovarian cancer,24 and that estrogenic treatment of prostate cancer is usefully pailiative but increases the incidence of MI.25 Thus, the effects of hormonal factors on the primary incidence rates of various vascular and neoplastic diseases can be substantial but difficult to predict. If, therefore, ovarian ablation, tamoxifen or other hormone-related treatments are to be used in breast cancer - and particularly if prolonged treatment is envisaged - it is important to evaluate their effects not only on breast cancer but also on other causes of death. Likewise, chemotherapy26 and radiation27,28 may induce leukemia or various solid tumors, though they may decrease the immediate likelihood of developing a second primary breast cancer that is large enough to be detectable.

A principal goal of therapy is to reduce overall mortality, irrespective of the actual cause of death, and it might appear that an analysis of the effects of treatment on uncommon other causes of early death would not be of much relevance to this particular goal. But, that may not necessarily be the case. For example, if a particular treatment reduces the risk of death from breast cancer but increases the risk of death from myocardial infarction, then (as in prostate cancer25) the balance of cancer risk factors and coronary risk factors could importantly influence the choice of treatment for many patients, particularly those at low risk of death from cancer. Moreover, even if total mortality is reduced in the first few years by a particular treatment for breast cancer, specific adverse effects may become more important later (especially if, in the above example, the annual incidence of myocardial infarction eventually begins to exceed the breast cancer recurrence rate). In any such circumstances, analyses of overall early mortality could be seriously misleading, whereas cause-specific time-to-death analyses may help interpret the implications of the randomized trials appropriately.

In absolute terms, any effects of treatment on rare causes of death may be too small to be reliably detected even by cause-specific analyses of individual trials, although they might be assessed reliably by an overview of many trials. But if, as is the case with the present data, information about cause-specific mortality is available from only a limited number of all trials, it is particularly important to avoid selective emphasis on trials where cause-specific mortality is available just because it indicates something peculiar. For this reason, the present mortality analyses do not distinguish between different causes of death, which means that any fatal side-effects cannot be studied properly. (It also reduces the statistical power of the analyses, especially among older women, a number of whom may have died of unrelated causes.) This will, however, be rectified in future overviews.

3.12 Use of overviews to improve the reliability of data from particular trials

Another contribution of overviews to the correct interpretation of trial evidence is that they involve the scrutiny of each study, which may lead to recognition and, in some cases, correction of methodological errors in certain trials (or to the exclusion of seriously flawed studies). For example, biases may be produced by the loss, or withdrawal some time after randomization, of some patients who deviate from their allocated treatment or scheduled follow-up (or by replacement of non-compliers with new patients), but these biases can be corrected by restoring any excluded randomized patients (and excluding inappropriate replacements whose treatment allocation was not properly random). Similarly, biases produced by differences in the completeness of follow-up between treatment groups can sometimes be reduced or eliminated by imposing a common cut-off date on all follow-up, or by using national mortality records to get unbiased data. In the present study, exhaustive efforts have been made to seek additional data from the trials reviewed so that any such biases may be minimized.