The Bland–Altman Limits of Agreement is a popular and widespread means of analyzing the agreement of two methods, instruments, or raters in quantitative outcomes. An agreement analysis could be reported as a stand-alone research article but it is more often conducted as a minor quality assurance project in a subgroup of patients, as a part of a larger diagnostic accuracy study, clinical trial, or epidemiological survey. Consequently, such an analysis is often limited to brief descriptions in the main report. Therefore, in several medical fields, it has been recommended to report specific items related to the Bland–Altman analysis. Seven proposals were identified from a MEDLINE/PubMed search on March 03, 2020, three of which were derived by reviewing anesthesia journals. Broad consensus was seen for the a priori establishment of acceptability benchmarks, estimation of repeatability of measurements, description of the data structure, visual assessment of the normality and homogeneity assumption, and plotting and numerically reporting both bias and the Bland–Altman Limits of Agreement, including respective 95% confidence intervals. Abu-Arafeh et al. provided the most comprehensive and prudent list, identifying 13 key items for reporting (Br. J. Anaesth. 2016, 117, 569–575). The 13 key items should be applied by researchers, journal editors, and reviewers in the future, to increase the quality of reporting Bland–Altman agreement analyses.
The Bland–Altman Limits of Agreement (BA LoA), or simply Bland–Altman plots, are used widely in method comparison studies with quantitative outcomes, as evidenced by more than 34,753 citations of the seminal Lancet paper to date. In this analysis, a pair of observations is made from the same subject, with two different methods. Subsequently, the means and differences of these pairs of values for each subject are displayed in a scatter plot. The plot usually also shows a line for the estimated mean difference between the two methods (a measure of the bias between the two methods), and lines indicating the BA LoA (within which approximately 95% of all population differences would lie). Use of the BA LoA assumes that the differences are normally distributed.
Kottner et al. pointed out that agreement and reliability assessment is either conducted in dedicated studies with a respective primary focus or as a part of larger diagnostic accuracy studies, clinical trials, or epidemiological surveys that report agreement and reliability as a quality control. The latter is often done in subsamples, resulting in small to moderate sample sizes; sample sizes as small as 10 are, by no means, an exception. Such a supplementary agreement or reliability analysis is often limited to brief descriptions in the main report, lacking details for sufficient transparency.
The Guidelines for Reporting Reliability and Agreement Studies (GRRAS) comprise a comprehensive checklist of 15 items that support the transparent reporting of agreement and reliability studies. Item no. 10 and 13 relate to the description of the statistical analysis and reporting of estimates of reliability and agreement, including measures of statistical uncertainty. However, agreement and reliability studies can easily become complex investigations when considering different sources for variation in the data, leading naturally to repeatability coefficients based on variance components analyses. This is why Item no. 10 and 13 are neither specific with respect to agreement analysis performed by means of BA LoA.
During the past two decades, researchers have attempted to establish reporting standards for BA plots in various fields. In a review of methodological reviews, Gerke identified reporting standards for BA agreement analyses and singled out the most comprehensive and appropriate list.
Seven publications, published before March 03, 2020, were identified. Three out of seven studies were published in anesthesia journals, while the remaining stemmed from various fields (Table 1).
Table 1: Characteristics of studies proposing reporting items for BA analysis. N/A: not applicable. This table was reproduced from Gerke.
|Publication||Field/Area||Search Approach or Target Journals||Time Frame||Evidence Base|
|Flegal (2019)||Self-reported vs. measured weight and height||Unrestricted; reference lists of systematic reviews, repetition of 2 PubMed searches of these, “related articles” in PubMed||1986–May 2019||N = 394 published articles|
|Abu-Arafeh (2016)||Anesthesiology||Anaesthesia, Anesthesiology, Anesthesia & Analgesia, British Journal of Anaesthesia, Canadian Journal of Anesthesia||2013–2014||N = 111 papers|
|Montenij (2016)||Cardiac output monitors||N/A||N/A||Expert opinion|
|Olofsen (2015)||Unrestricted||N/A||N/A||Narrative literature review and Monte Carlo simulations|
|Chhapola (2015)||Laboratory analytes||PubMed and Google Scholar||2012 and later||N = 50 clinical studies|
|Berthelsen (2006)||Anesthesiology||Acta Anaesthesiologica Scandinavica||1989–2005||N = 50|
|Mantha (2000)||Anesthesiology||Seven anesthesia journals||1996–1998||N = 44|
Sixteen reporting items were proposed across these seven studies:
Broad consensus was seen for the a priori establishment of acceptable LoA (Item #1); estimation of repeatability of measurements in case of available replicates within subjects (#3); visual assessment of a normal distribution of differences and homogeneity of variances across the measurement range (#4); and plotting and numerically reporting both bias and the BA LoA, including respective 95% confidence intervals (#6–9). A description of the data structure (#2), between- and within-subject variance (or stating that confidence intervals for the BA LoA were derived by accounting for the inherent data structure; #11), and distributional assumptions (#13) followed. Only one review raised the issue of a sufficiently wide measurement range (#10), sample size determination (#14), or correct representation of the x-axis (#15). Upfront declaration of conflicts of interest (#16) also appeared only once, but this can generally be presumed to be covered by the ethics of authorship. Besides, there seems to be a tacit consensus of the fact that the x-axis must show average values of the two methods compared (#15), as also discussed by Bland and Altman. The issue of sample size determination (#14) was discussed in more detail by Gerke.
The list of reporting items proposed by Abu-Arafeh et al. was the most comprehensive (13 out of 16 items), followed by those proposed by Montenij et al. (10 out of 16 items) and Olofsen et al. (9 out of 16 items). The latter two lists were complete subsets of Abu-Arafeh et al.’s list, with the exception of Item #14 on the list by Montenij et al. The most recently published list by Flegal et al. comprised items that were derived as a modified version of those suggested by Abu-Arafeh et al. Specifically, they omitted items related to statistical software and repeated measurements, as the latter are rarely applied in studies entailing self-reported weight and height.
A worked example for the reporting items proposed by Abu-Arafeh et al. can be found elsewhere. Such an extended analysis can, generally speaking, easily accompany the report of the main study as Online Supplemental Material. Journal space restrictions are no longer a valid argument for reducing the reporting of agreement or reliability to a few lines of the main report.
The work of Abu-Arafeh et al. represents the most comprehensive and prudent list of reporting items for BA analysis, identifying 13 key items. Considering GRRAS as a broad reporting framework for agreement and reliability studies, Abu-Arafeh et al. concretized its Item 10 (statistical analysis) and Item 13 (estimates of reliability and agreement including measures of statistical uncertainty) in the context of the Bland–Altman analysis in method comparison studies. A rigorous application of and compliance with the 13 key items recommended by Abu-Arafeh et al. will increase both the transparency and quality of such agreement analyses. Researchers, journal editors, and reviewers are obligated to employ more diligence in producing and assessing BA analyses in the future.