# Section 5: Statistical Methods

#### 5.1 Medical assumptions: (1) modest sizes of treatment effects, and (2) differences of size but not of direction between effects in different circumstances

The two fundamental assumptions that underlie an overview of many trials are not statistical, but medical. One - the human importance of mortality differences that are only moderate in size - has already been discussed at length. The other fundamental assumption is more abstract: although the same type of treatment would probably not produce exactly the same size of therapeutic effect in different circumstances, it would probably at least produce effects that tended to point in the same direction. For example, if two trials have tested approximately similar treatments then, but for the play of chance, they should yield results on specific endpoints that, although perhaps not the same size, at least point in the same direction. Likewise, if a particular treatment is of some value in one category of women then it is probably also of some value in other categories. (More formally, this second medical assumption is that although "quantitative" interactions may be common, there are not likely to be any unanticipated "qualitative" interactions between the effects of treatment on particular modes of death and the sort of information that is commonly available at the start of a trial.)

Consider, for example, two properly randomized trials of tamoxifen in early breast cancer that differ somewhat in the type of breast cancer patient, the type of primary treatment, the proportions also given adjuvant chemotherapy, the tamoxifen dose and duration, the type of treatment on relapse, the definition of relapse, the completeness and duration of follow-up, the proportion of deaths attributable to breast cancer, etc. Because of this heterogeneity, it is unlikely that the real difference in risk between treated and control patients in both of these trials will be exactly the same **size**. Despite this, it is still quite likely that any such differences will tend to point in the same **direction**. In other words, if tamoxifen is of some real value in the circumstances of one trial then it is probably also of some real value in the circumstances of the other, although of course in the **actual** results of one particular trial the play of chance may well be at least as big as the treatment effect that is to be measured, and could therefore obscure this similarity of direction. The same is true when the real treatment effects in each of several dozen tamoxifen trials are considered: common sense suggests that although some differences must exist between the real sizes of the effects in different trials, the real directions of those effects would probably be the same. Although this underlying direction may well be obscured by the play of chance in some of the individual trial results, it is more likely to stand out clearly when many different results are reviewed together.

#### 5.2 Comparisons only of like with like, based on "Observed minus Expected" (O-E) differences in each separate trial

The statistical methods to be used in the overview analyses reflect these medical assumptions. They are of maximal statistical sensitivity for the detection of certain modest treatment effects; and, they do not implicitly assume that the real risk reductions in different trials are the same size, but merely that they will tend to point in the same direction. Nor do they require the obviously unjustified assumption that patients in one trial can be compared directly with patients in other trials. The basic principle involved is to make comparisons of treatment with control only within one trial and to avoid completely any direct comparisons of patients in one trial with patients in another. This is achieved by calculating, within each separate trial, the standard quantity "Observed minus Expected" (O-E) for the number of deaths among treatment-allocated patients.^{29-31} O-E is **negative** if the treated group fared **better** than the controls (and positive if it fared worse). In an evenly balanced trial O-E is approximately equal to half the number of deaths avoided: so, for example, an O-E of -7.5 would suggest avoidance of about 15 deaths (see Table 2(a) and its footnotes).

These O-E values, one from each trial, can then simply be added up^{31} (with the variance of the grand total of the individual O-E values given by the sum of the individual variances of each O-E value; this effectively leads to the results of each trial being given a "weight" in the overall assessment that depends appropriately on the amount of statistical information provided by it). If treatment did nothing, then each of the individual O-E values could equally well be positive or negative, and their grand total would likewise differ only randomly from zero: Table 2(b). If, on the other hand, treatment reduced the risk of death to some extent in most or all of the trials, then any individual O-E value would be likely to be somewhat negative (i.e. favoring treatment), so when they are all added up their grand total may be clearly negative. Such arguments do not, of course, assume that the size of treatment effect is the same in all patients, or in all trials; indeed, it is probably not.

#### 5.3 Routine stratification of all main analyses by age (<50, 50+) and by year (1, 2, 3, 4, 5+) of follow-up

As a second step towards the ideal of comparing only like with like, the "expected" numbers of events can be calculated separately in different patient categories within each trial, and the separate O-E values for each category can then be added up to get the overall O-E value for that trial. So, for example, in the present report, expected numbers of events have been calculated separately for women aged under 50 and for women aged 50 or older (see, for example, Table 3M). Moreover, within each age group they have first been calculated separately for each separate year of follow-up and then added up to get the overall O-E for that particular trial. (This yields a "logrank" year-of-death analysis.^{31}) In principle, such "stratified" comparisons may be slightly preferable. In practice, however, the overall result that they yield for each major trial may differ only slightly from that yielded by a crude calculation based simply on comparison within each trial of total mortality in all treatment-allocated patients with total mortality in all control patients.* Stratification by these age groups corresponds approximately with menopausal status. (Two trials that included only postmenopausal women did not provide a subdivision of their results by age: their patients are classified as "50+".)

#### 5.4 Additional stratification of selected analyses for the available information on nodal status or Estrogen Receptor status

For a substantial majority of all patients some information on what was recorded about their axillary lymph node status was available that could be used to classify patients into four main categories (the last of which included "nodal status not reported"). In addition, for a substantial minority of those in tamoxifen trials some information about Estrogen Receptor (ER) measurements in the excised primary tumor was available that could likewise be used to classify patients into four main categories (the last being "ER measurement not reported"). These nodal or ER categories could be used in a cautious search for interactions. Alternatively, they could be used for the "retrospective stratification" of analyses of the overall effects of treatment. But, in such a large data set retrospective stratification of the overall analyses is likely to make no material difference to the overall results (unless, by mistake, a stratum of "unknown" values for the stratifying factor had failed to be defined and used).

#### 5.5 Arithmetic procedures for calculation of Observed minus Expected (O-E) numbers of events among treatment-allocated patients, and for obtaining "two-sided" significance levels ( 2 P )

Suppose that a total number N of patients are randomized into a study with a fraction f allocated active treatment (and hence 1-f allocated control), and suppose that in both groups combined a total number D of these N patients die. If there is no difference in mortality between the treatment and control groups, the "Expected" number of deaths among the treatment-allocated patients is given by fD. Let O denote the Observed number of deaths among these treatment-allocated patients. The quantity Observed minus Expected (O-E) then differs only randomly from zero, its variance can be shown^{5} to be given by the formula fD(1-f)(N-D)/(N-1), and its standard deviation (sd) is the square root of this variance. An example of the calculation of O-E and its variance is given in Table 2.

As is usually the case in statistics, (O-E) values that differ from zero by 1 sd or less could easily arise just by the play of chance, and although the play of chance provides a less plausible (two-sided P-value* = 0.05) explanation for differences of about 2 sd, such differences can also arise just by chance, especially when many different comparisons are scrutinized.

#### 5.6 Interpretation of P-values: statistical significance and medical judgement

In an unbiased overview of many properly randomized trial results the number of patients involved may be large, so even a realistically moderate treatment difference may produce a really extreme P-value (e.g. 2P<0.0001). When this happens, the P-value alone is generally sufficient to provide proof beyond reasonable doubt that a real treatment effect does exist. In contrast, P-values that are not conventionally significant, or that are only moderately significant (e.g. 2P=0.01), may be much more difficult to interpret appropriately. There may be circumstances (particularly in sub-analyses of an overall analysis that is not clearly significant) when the existence of a real treatment benefit should not be accepted despite an apparent difference of 2 or 2.5 (or even, perhaps, almost 3) standard deviations between treatment and control. Conversely, there may be circumstances (particularly in sub-analyses of a highly significant overall analysis, or when there is strong indirect evidence from other sources) when statistical significance is of little concern, and when the existence of a real treatment benefit may be quite firmly accepted despite the apparent difference not being conventionally significant (or even, perhaps, being slightly unfavorable). In summary, **over-emphasis on formalistic questions of which differences are moderately significant and which are not is a serious statistical mistake**, and it may have serious medical consequences. Elsewhere,^{30,31} there is more extensive discussion of the common sense medico-statistical principles that underly appropriate interpretation of P-values in trials.

#### 5.7 RESULTS OF RADIOTHERAPY TRIALS as an example of the summation of (O-E) values from different trials to provide an overall test of the "null hypothesis" of no treatment effect

As an example of the use of (O-E) for the combination of information from related studies, consider the question of whether radiotherapy after mastectomy affects survival in early breast cancer. There have been about 30 trials of radiotherapy after mastectomy, information from 19 of which was obtained in 1985. To describe the statistical methods, an analysis of the results from just these 19 trials is given in Table 3M.*

In Table 3M, the Grand Total (GT) of all the separate O-E values from the radiotherapy trials is -9.0, suggesting that there may have been about 18 deaths avoided in the radiotherapy groups. The standard deviation (sd) of this grand total is 27.8, however, so that the grand total is less than half a standard deviation away from zero (z=GT/sd = 0.3; NS). Even if radiotherapy had no net effect whatever on survival in the circumstances of these trials, such a small difference from zero could well have arisen just by the play of chance. This overview is not as complete as might be wished because data from about a dozen other trials were still not available in 1985 (Table 3M footnote); for future overviews, however, those will be sought.

#### 5.8 Use of (O-E) values to provide a description of the typical reduction in the odds of treatment failure (i.e. to describe the "alternative hypothesis")

For practical purposes, what is required in the assessment of an effective treatment is not only evidence that the treatment does do something (i.e. a "test of the null hypothesis") but also an estimate of how big, and hence how medically worthwhile, the effect of treatment is likely to be (i.e. a description of the "alternative hypothesis"). Fortunately, the quantities already calculated (i.e. the standard deviation of the grand total, and z, the number of standard deviations by which the grand total differs from zero) can provide not only a statistically sensitive test of whether treatment has any effect but also a useful description of the size of the treatment effect that is "typical" of the set of trials being reviewed. A convenient and appropriately weighted estimate of the typical treatment effect is provided by the "typical odds ratio" (TOR) - that is, the typical ratio of the odds of an unfavorable outcome among treatment-allocated patients to the corresponding odds among controls. (For example, a typical odds ratio of 0.8 would correspond to a reduction of about 20% in the odds of an unfavorable outcome.) The typical odds ratio is estimated in Table 3M and elsewhere by the surprisingly simple formula exp(z/sd),* with approximate 95% confidence limits exp(z/sd±l.96/sd). The typical percent reduction in the odds of an unfavorable outcome is similarly estimated by r=100-100exp(z/sd), with the approximate standard deviation for this estimated reduction being -r/z.

The typical odds ratio suggested by the partial overview of radiotherapy trials in Table 3M is 0.99, with 95% confidence limits of 0.92 to 1.06. Inclusion of an odds ratio of 1.00 in the confidence interval indicates that the result is not conventionally significant, while the extremes of the interval indicate that the data from these particular 19 trials are readily compatible both with a small mortality reduction and with a small mortality increase.

#### 5.9 Graphical display methods for separate trial results, and for an overview

Data for each of the radiotherapy trials in Table 3M are provided separately for women aged under 50 and 50 or older at randomization. Those trials (e.g. 64B1 and 64B2; see Section 6 [not reproduced here]) that consist of more than one part are further subdivided into their constituent parts for separate analysis. In principle this is appropriate and necessary, but in practice it can produce a mass of numbers that is difficult to grasp without juxtaposing graphical and numerical presentations of the data (Figure 1M). In the tabular parts of Figure 1M, for each separate trial the results of the analyses for women aged under 50 and 50 or over - and, for trials with more than one part, the results for each part - are added together, to give just one overall result for that one trial. In the graphical part of Figure 1M (and of all other such figures), a **solid square** indicates the apparent effect of treatment in that trial (i.e. the ratio of the annual odds of death among treatment-allocated patients to that among controls), and a **horizontal line** indicates the 99% confidence limits for this odds ratio.

The visual impression given by the horizontal lines may be slightly misleading, for a small, uninformative trial gives a long, visually striking confidence interval, whereas a large, reliable trial gives a small, unobtrusive confidence interval. To reverse this visual impression, therefore, the sizes of the solid squares have been chosen to be directly proportional to the amount of information each trial contains. So, a large informative trial yields a large black square and a small trial that is much less informative yields a small black square.* The areas of the solid squares indicate that nearly half of the available information comes from one particularly large trial (70B) which has a slightly unpromising result, and that about half the remaining information comes from three moderately large trials (64B, 70A, 71B), none with particularly promising results. Formal summation of the 19 separate O-E values, one per trial, confirms that these studies provide little evidence overall of any real effect on survival. The "typical odds ratio" suggested by this overview of just 19 of the radiotherapy trials is 0.99 ± 0.04, which is nowhere near statistically significant (z = -0.3). In Figure 1M (and in subsequent figures), the statistical reliability of the overview result is depicted by a black square in the left-hand margin (with size proportional to the "information content"** of the overview), and a 95% confidence interval for the "typical odds reduction" suggested by the overview is depicted by a diamond-shaped symbol. It can be seen from the diamond-shaped 95% confidence interval in Figure 1M how reliable the overall result is. Of course, since there are so many individual trials it was quite likely that just by the play of chance one or two of them might have yielded results that are 2 to 2.5 sd away from the overall average. Because of this multiplicity of comparisons, fairly extreme (99%) confidence limits have been plotted for the many individual trials. It can be seen from the 99% confidence intervals for the separate trial results that each individual trial result is statistically compatible with the overall result. An overview is not subject to such multiple comparison problems, so for each overview result 95% limits have been plotted. **This convention (99% limits for individual trials, 95% limits for overviews) is used throughout the present report.** An additional general convention is that the symbol ± will be used to denote **one** standard deviation (as, for example, in "a mortality reduction of 16% ± 3").

#### 5.10 Similarities between risk ratios, death rate ratios and odds ratios when event rates are low

In practice, when comparing mortality (or recurrence) among treated and control patients in trial analyses when substantial proportions have not yet suffered the endpoint of interest, "risk ratios", "death rate ratios" and "odds ratios" are often not importantly different from each other as methods for describing trial results. Consider, for example, a hypothetical group of 100 patients suffering a steady death rate of 1% per month. After 3 years, only about 30 of the original group would be dead (and not 36, since the death rate of 1% per month refers to those alive at the beginning of each month, and their number decreases as time goes by). Thus, the **risk** of death at 3 years would be 30/100 while the **odds** of death would be 30/70. A 20% reduction in the monthly death rate would reduce the death rate from 1.0% to 0.8% per month. So, by 3 years, only 25 would then be dead. Thus, the risk of death would be 25/100 and the odds of death would be 25/75. Comparison of these with the previous figures (30/100 and 30/70 respectively) shows that in this particular example a 20% reduction in the monthly death rate corresponds after 3 years to a 17% reduction in the risk of death, and a 23% reduction in the odds of death. The percent reduction in odds will always be the largest, but the similarity of these three figures suggests that it often does not matter much whether percentage reductions in the risk of death, in the death rate or in the odds of death are used to describe trial results. For reasons of arithmetic simplicity, use will chiefly be made of percentage reductions in the odds of death (or, for trials where data are available separately for each year of follow-up, percentage reductions in the annual odds of death: see below).

#### 5.11 Logrank "year of death" analyses for statistical significance tests, and for estimation of reductions in annual odds of death

In most of the trials to be reviewed, results are available separately in year 1 (i.e. during the year following randomization), year 2, year 3, year 4 and year 5+ (where "5+" includes all events recorded during or after the fifth year). The numbers of deaths in the treatment-allocated and control-allocated groups in a particular period were related to the numbers of patients in each group that were still alive and being followed up at the start of that period (using methods described in Table 2). For each trial this yields up to five separate (O-E) values and their corresponding variances, one per time period (with fewer than five in those trials with fewer than five years of follow-up). The sum of these five (or fewer) values for one trial yields the logrank test statistic for a year-of-death analysis of that trial, with the logrank variance given by the sum of the five variances. The logrank statistic was first recommended in 1966 for the analysis of single cancer trials,^{29} and it was first recommended in 1976 for the unbiased, statistically efficient combination of information from an overview of the results of several different cancer trials.^{30,31}

The chief advantage of the availability of information from each separate year is not that logrank analyses are more sensitive than crude analyses, for the improvement in sensitivity is only small in trials where most patients survive. It is rather that logrank analyses readily permit separate analyses of the effects of treatment in each separate year, which may improve medical understanding (as long as the important statistical uncertainties in the apparent effects in each separate year are taken properly into account). Hence, when they are available, annual logrank O-E values will be used in preference to overall crude O-E values. In providing a description of the alternate hypothesis, the calculation of exp(z/sd) now yields an estimate of the typical ratio of (and, from this, the percentage reduction in) the **annual** odds of death. The estimate for a given year is, of course, based only on data derived from trials that have follow-up and deaths reported in that year. (Similarly, for analyses of disease-free survival exactly the same statistical methods can be applied to the numbers of patients suffering a first recurrence or prior death.)

#### 5.12 Life-table estimation for descriptive purposes

Suppose that appropriate analysis of a particular set of trials yielded, in years 1, 2, 3, 4 and 5+, a particular set of five odds ratios. In principle, the meaning of these could be illustrated by applying them either to the failure rates of good-prognosis women (yielding a pair of estimated survival curves that are both "good-prognosis") or to the failure rates of poor-prognosis women (yielding a pair of estimated survival curves that are both "poor-prognosis"). Arbitrarily, we shall generally choose to illustrate them by applying them to a hypothetical category of women whose prognosis in years 1, 2, 3, 4 and 5+ is the average of that for all patients in these particular trials.* This yields a pair of survival curves that will in practice be rather similar to those that would have been obtained by crudely mixing all the trials together. The survival curves actually obtained in this way, however, are not subject to any of the objections that might be raised by analyses of a mixture of many trials. In particular, they can differ systematically only if **within** trials the treatment-allocated patients differ systematically from those allocated control.

#### 5.13 Test of heterogeneity between several different trial results

In the case of a treatment (such as radiotherapy, perhaps) that appears to have little or no overall effect on mortality it may be plausible that in each separate trial it will have little or no overall effect (except for the known side-effects of radiotherapy, which should be slight during the first decade or so^{28}), in which case a test for heterogeneity would be expected to yield a null result. In confirmation of this expectation, the formal chi-squared test of heterogeneity for the 19 radiotherapy results in Figure 1 does yield a completely non-significant result (chi-square = 14.9 on 17 degrees of freedom, NS). In general, however, when different trials address a particular type of treatment some degree of heterogeneity of the **real** effects of the treatment regimens in these trials must be expected.

Standard statistical tests for heterogeneity between many different trials (or patient categories) are, however, of limited value, partly because they are statistically insensitive, and partly because some heterogeneity of the real effects of treatment in the different trials (or categories) is likely to exist no matter what a formal test for heterogeneity may indicate. In general, therefore, even if a standard "chi-squared" test does not provide conventionally significant evidence of any heterogeneity at all, some important heterogeneity may well still exist that failed to be detected because the heterogeneity test is so crude. Sometimes, when the trials or categories can be arranged in some meaningful order, a test for trend can be used instead, and is then likely to be more informative than a test for heterogeneity. But, whether or not there is some real heterogeneity (and whether or not some test for trend or for heterogeneity happens to yield a conventionally significant result), this does not invalidate the standard overview techniques used to analyze the trials.

#### 5.14 Arithmetic details of tests for trend, for heterogeneity and for interaction

If treatment effects are evaluated in various different circumstances (e.g. in each of 4 different age groups, or in each of 40 different trials) then the following approximate procedures will be used to test for a trend between the separate results where there is a natural ordering (e.g. from younger to older), or for heterogeneity among them where no natural ordering exists.

(a) **Test for trend:** The circumstances are numbered in their natural order (e.g. 1 = age <40, 2 = age 40-49, 3 = age 50-59, 4 = age 60+). O-E and its variance, V, are calculated separately for the treatment effect in each circumstance (e.g. O_{1}-E_{1} and V_{1} for circumstance 1), omitting any circumstances for which the variance is zero, i.e. in which there is no useful information. Let k denote the number of circumstances remaining (i.e with non-zero V). Next, the following values are calculated:

A = V

_{1}

+V

_{2}

+V

_{3}

+... B = 1.V

_{1}

+2.V

_{2}

+3.V

_{3}

+... C = 1.1.V

_{1}

+2.2.V

_{2}

+3.3.V

_{3}

+... D = (O

_{1}

-E

_{1}

)+(O

_{2}

-E

_{2}

)+(O

_{3}

-E

_{3}

)+··· E = 1.(O

_{1}

-E

_{1}

)+2.(O

_{2}

-E

_{2}

)+3.(O

_{3}

-E

_{3}

)+.. . F = (O

_{1}

-E

_{1}

)

^{2}

/V

_{1}

+(O

_{2}

-E

_{2}

)

^{2}

/V

_{2}

+(O

_{3}

-E

_{3}

)

^{2}

/V

_{3}

+...

A test for a trend between the odds ratios produced by treatment in these different circumstances may be based on calculation of the quantity (E-DB/A). If there is no real heterogeneity between the odds ratios, then it can be shown that this quantity will differ only randomly from zero, and that its standard deviation (sd) will be approximately √(C-BB/A). Values more extreme than ±1.96 sd would therefore correspond approximately to 2P<0.05, etc. Provided the effects of treatment are not large, the statistical properties^{5,31} of O-E and V imply that this trend test is asymptotically efficient at detecting a steady multiplicative trend in the odds ratios that are produced by treatment on going from one circumstance to the next.

(b) **Test for heterogeneity:** A test for heterogeneity may be obtained by calculating the quantity (F-DD/A). If there is no real heterogeneity between the odds ratios in the k different circumstances being considered, then this quantity will be distributed approximately as a standard chi-squared distribution with k-1 degrees of freedom. Such tests for heterogeneity among several circumstances can, however, be very crude (see above). Hence, if a test for trend makes medical sense (as, for example, when there is a natural ordering between the circumstances being considered) then a test for trend should generally be used rather than a test for heterogeneity, for a trend test is likely to be much more sensitive to any real differences that may exist between the sizes of the treatment effects in different circumstances.

(c) **Test for "interaction" between the treatment effects in just two different circumstances:** In this case, the tests for trend and for heterogeneity (with k=2) can be shown to yield identical significance levels.

#### 5.15 Practical meaning of a clear effect of treatment in a single large trial

Suppose, hypothetically, that only one trial had ever addressed a particular therapeutic question, that the trial was extraordinarily large, and that it yielded a very definite 45% mortality reduction with very tight 95% confidence limits (e.g. 40%-50%). What would the practical implications be? Although this strong result would still not imply that another large trial of some approximately similar treatment for some approximately similar patients must also yield a 40-50% effect, it might well imply that such a trial should be expected to yield an effect in the same direction and of **approximately** similar size (e.g. a 30% reduction, or a 60% reduction, perhaps). Likewise, although the original trial result would not guarantee that attempts to use that treatment in the future would produce a 40-50% mortality reduction outside trials (since future patients could well differ in various ways from the trial patients and there could well be important differences between the care of patients in trials and out of trials), an approximately similar effect on mortality might be expected. In making sensible use of really clear results from a single clinical trial, the key assumption is merely that, in somewhat different medical circumstances, therapeutic effects in the same direction but of only approximately similar magnitude are still likely to exist. Extrapolation too far, of course, may lead to mistaken decisions about treatment, but so too may failure to extrapolate far enough. Thus, even for a single trial result, an estimated risk reduction with tight confidence limits implies only that similar, but not necessarily identical, treatment effects will be achieved in other circumstances.

#### 5.16 Practical meaning of a clear effect of treatment in an overview of many trials

Exactly the same is true of the "typical mortality reductions" and associated confidence limits derived from an overview of many trials. As with a single trial, the statistical calculations address the question "Given the studies that were undertaken, what range of results are statistically compatible with the actual data?" They do **not** involve saying "If different trials had been performed, what would have been seen?"

After calculation of the "typical mortality reduction" and its associated confidence limits, medical judgement - with its attendant uncertainties and disputes - is needed to help determine the circumstances to which that result is likely to be approximately relevant, just as was the case with a large single trial result.

It should be noted that just as a positive trial result does not guarantee that all patients will benefit from the treatment being tested, so too a null result in a trial or an overview does not guarantee that no patient will benefit. It does set limits on what the average difference in medium-term mortality is likely to be but, for example, the null result in Figure 1M is easily compatible with the suggestion that several dozen of the 4000 treated patients might have been protected from death within the first decade or so by radiotherapy, but that the play of chance has obscured this. (It is also compatible, at least in principle, with the opposite possibility of several dozen deaths having been caused by radiotherapy, although any serious adverse effects on causes of death other than breast cancer are likely to be revealed more reliably by the cause-specific mortality overviews planned for future reports than by the present all-causes mortality analyses.) Thus, although it is important to be aware of the clinical trial results, it is also important to be aware of their statistical limitations.

#### 5.17 Fixed-effect "assumption-free" methods, and random-effect "assumed representativeness" methods

The general approach that has been described in the present report for analyzing and interpreting overviews is sometimes called the "fixed effects" method, because the overall result that it gets (by comparing like with like within each separate trial) is not directly influenced by any heterogeneity among the true effects of treatment in different trials. This terminology is, however, unsatisfactory, for it misleadingly suggests that any heterogeneity between the true effects of treatment in different trials is assumed to be zero in this general approach, whereas in fact no such unjustified assumptions are involved - indeed, the "assumption-free" method might be a better name.

When several trials have addressed similar questions it might appear that formal statistical analyses of the heterogeneity of their findings (perhaps assessing it by some "random effects" statistical method) are needed to augment the use of medical judgement in determining how far an overview of their results can be trusted. But, the statistical assumptions needed for such statistical methods to be of direct medical relevance are unlikely to be met. In particular, the different trial designs that were adopted would have to have been randomly selected from some underlying set of possibilities that includes the populations about which predictions are to be made. This is unlikely to be the case, since trial designs are adopted for a variety of reasons, many of which depend in a complex way on the apparent results of earlier trials. Moreover, selective factors that are difficult to define may affect the types of patients in trials, and therapeutic factors that are also difficult to define may differ between trials, or between past trials and future medical practice. Finally, tests of the heterogeneity of the results of many trials may be biased by a tendency for trials with extreme results in either direction to stop recruitment early. Thus, whereas various "fixed effects" methods may actually be assumption-free, various "random effects" methods may unjustifiably assume representativeness.* In view of these difficulties, such "assumed representativeness" (random-effects) methods are not used in the present report.

Even if formal statistical estimates of the degree of heterogeneity that exists may not help much in judging how far the overall results should be trusted, it is obviously sensible to scrutinize thoughtfully any "outliers" in an overview of many trial results. There are, however, many different medical questions that may reasonably be asked of such a large body of data, and different readers may prefer different statistical methods. Hence, in the accompanying reports, the data are presented in sufficient detail to allow a variety of different analyses of them.