Research Series: Article 2
July 30, 2019
Multiplicity: Entanglement in Clinical Trial Design and Analysis
By Raphael Yaakov and Brian McManus
Clinical trials fail for a variety of reasons ranging from improper dose selection or device application, non-optimal assessment schedule, low-quality data collection to study design flaws. It is not uncommon for a clinical study to have up to 13 endpoints.1 While multiple endpoints can be immensely useful in characterization of a disease, treatment benefit or patient experience, multiplicity poses several problems. The most obvious concern is the increased likelihood of a false positive finding—a Type I error. In other words, it is the probability that at least some of the results will be found to be statistically significant despite no underlying effect. Biostatisticians refer to this as experiment-wise or family-wise error. Another important consideration is sample size and study power. The use of multiple or co-primary endpoints increases Type II error, the probability of a false negative, and decreases statistical power. Multiple endpoints may demand increased sample size. However, increasing sample size will not solve this problem if co-primary endpoints are not independent; otherwise, it is not statistically meaningful.
In a review of 59 multi-arm trials, Wason et al. found that only 49% of the published studies performed statistical adjustments for multiplicity.2 This may be because statistical adjustments often appear to be too complex or impractical. Additionally, it is common for design features to be modified adaptively based on the results of the interim analysis, which adds another layer to the problem. There are a number of effective statistical approaches that can be employed to manage multiple endpoints. Some statistical methods have a built-in adjustment to keep alpha at 0.05 across all comparisons; a post-hoc test following Analysis of Variance (ANOVA) is a good example of this. In Bonferroni adjustment, p < 0.025 is required to control overall alpha at 0.05 across two primary endpoints. Another approach is to rank the endpoints in order of importance. The hierarchical testing evaluates the most important endpoint first, and if it yields p < 0.05, the effect is concluded to be real. Then the next most important endpoint is tested using p < 0.05. A technique used largely in medical research and digital image analysis is to control the false discovery rate (FDR). This strategy involves defining the proportion of false discoveries among total rejections; it limits the false discovery to a reasonable fraction.
The FDA’s draft guidance on multiple endpoints in clinical trials provides a granular view on the problems posed by multiple endpoints and highlights statistical approaches that should be critically evaluated (Docket No. FDA-2016-D-4460). The focus of FDA’s guidance is to control Type I error for primary and secondary endpoints. Lack of appropriate adjustment leaves room for researchers to publish positive results that show statistical significance without any underlying effect. On the other side of the coin, there is also a risk to produce a negative outcome (an effective treatment is deemed ineffective—Type II error). Often the statistical analysis plan (SAP) is not finalized prior to the first interim analysis, which may delay critical discussion on study design. It is highly recommended that a biostatistician is engaged at the time of protocol discussion. Given the implication of clinical findings, it is imperative to have a purposeful discussion on study design, endpoints and statistical approach during the initial study planning.
Raphael’s experience spans across phase I-IV drug and device multinational trials. He currently serves as Vice President of Clinical Operations at SerenaGroup, a collaborative group dedicated to wound care research.
Brian McManus has worked in clinical research for nearly 20 years, working with sponsors at a Clinical Research Organization on dozens of trials with a focus on collecting high quality endpoint data. As the former manager of a wound imaging core lab, he is well versed in the nuances of running wound healing trials. Working as the Director of Clinical Operations for eKare, Inc., he is invested in helping research clients collect wound healing data optimally.
The views and opinions expressed here are those of the author and do not necessarily reflect the official policy or position of any other agency, organization, employer or company.
___________________________
1Tufts Center for the Study of Drug Development. Tufts University. Impact Report. 2012.
2Wason JM, Stecher L, Mander AP. Correcting for multiple-testing in multi-arm trials: is it necessary and is it done?. Trials. 2014;15:364. Published 2014 Sep 17. doi:10.1186/1745-6215-15-364.