Knee injuries account for the largest percentage of sport-related, severe injuries (i.e., injuries that cause more than 21 days of missed sport participation). The improved treatment of knee injuries critically relies on having an accurate and cost-effective detection. In recent years, dDeep-learning-based approaches have monopolized knee injury detection in MRI studies.
1. Introduction
1.1. Backdrop
Knee injuries account for the largest percentage of sport-related, severe injuries (i.e., injuries that cause more than 21 days of missed sport participation)
[1,2,3,4][1][2][3][4]. Anterior cruciate ligament (ACL) ruptures represent more than 50% of the cases, affecting 200,000 individuals in the United States each year
[1,5,6,7][1][5][6][7]. Knee cartilage lesions affect around 900,000 individuals in the United States every year, resulting in over 200,000 surgical procedures
[5,6,7,8][5][6][7][8]. Menisci injuries are the second most common knee impairment, with an incidence of 12–14%
[9] and a prevalence of 60–70 cases per 100,000 in the United Kingdom
[2]. ACL injuries alone account for an expenditure of more than $7 billion in the United States
[10]. Both short- and long-term pain, disability, and negatively affected, health-related quality of life have all been strongly associated with knee injuries
[11,12,13][11][12][13]. In regard to young and athletic individuals, the more time they spend engaging in occupational and/or recreational activities, the higher predisposition to knee injuries they have, which, in turn, contributes to a higher likelihood of developing osteoarthritis (OA)
[14]. On average, half of the individuals, that have an injury that involved ACL and/or meniscal tear develop radiographically confirmed knee OA ten to 20 years post-injury
[15,16][15][16]. Another two possible consequences of knee injuries are: (i) structural muscle injuries of the lower limb
[17]; and (ii) tendinopathies
[18]. All the above reflect the direct and indirect (lost wages, productivity, and disability) socio-economic burden conferred on the society by knee injuries. The high prevalence of knee injuries in the general population, and the resulting socio-economic impact, have created a necessity for developing accurate and cost-effective procedures that can detect and quantify the severity of knee injuries. Early diagnosis and, consequently, treatment of ligament rupture, menisci tear, and/or cartilage lesion can prevent early onset of knee OA
[1].
Arthroscopy is considered the “gold-standard” for the diagnosis of intra-articular knee pathologies, but is limited by potential complications and its invasive nature
[19]. Therefore, magnetic resonance imaging (MRI) is the most widely used, non-invasive imaging technique for diagnosing knee injuries
[20,21][20][21]. However, the MRI-based diagnosis of knee injuries can be a very challenging procedure, with the experience of clinicians playing a critical role in image interpretation. Human-based image interpretation pitfalls, such as subjectivity, distraction, and fatigue, as well as diagnostic uncertainties, often lead to erratic diagnoses, hindering the optimal management of knee injuries
[22,23][22][23]. Moreover, clinical-diagnostic discrepancies among non-musculoskeletal radiologists and orthopedic surgeons are commonly encountered in everyday clinical practice
[11].
Due to the above-listed factors, as well as the exponentially increasing number of clinical examinations, the idea of using computers for improving the challenging task of image interpretation of medical examinations has been recently adopted by the scientific community
[24]. Imaging data proliferation, algorithmic advances, and recent technological advances in fast computing have already resulted in a strong push towards the utilization of artificial intelligence (AI) algorithms in medical image analysis. The term AI broadly refers to any method that enables computers to mimic human intelligence
[25]. Deep learning (DL) in particular is a class of machine-learning (ML) algorithms that is currently driving the AI boom
[26]. Numerous applications of DL in medical image analysis have been reported, including skin cancer classification, diabetic retinopathy detection, lung nodule detection, and mammography cancer detection, among others
[27]. The aforementioned AI-empowered solutions are expected to revolutionize medical sectors by improving the accuracy and productivity of different diagnostic and therapeutic measures in clinical practice
[20].
Drawing attention to the diagnosis of knee injuries, several early DL studies have exhibited better performance than traditional ML techniques, while in some cases they have proved to be even superior to radiologists
[26]. However, the previously published review studies in the MRI field were either focused on other application domains (e.g., fracture detection
[28]) or limited to the performance of the proposed networks without paying attention to their specifics (learning methodology, processing stages, technical limitations etc.)
[29].
1.2. Machine Learning in a Nutshell: Definitions and Terminology
To enhance the understanding of the readers and for the sake of completeness, this section quickly presents the relevant terminology and definitions with respect to ML and DL algorithms used in the studies involved here. ML is a branch of AI that focuses on the development of algorithms that automatically learn to make accurate predictions by relying on experience (data) rather than on hard-coded instructions.
Supervised ML systems (
Figure 1) operate in two phases: the learning phase (training) and the testing one. In a traditional ML pipeline, a feature extraction/selection stage (also referred to as feature engineering) is first implemented to extract or identify the most informative features
[16]. These features can be extracted from the input images, employing various algorithms including grey-level co-occurrence matrix (GLCM), first- and second-order statistics, and shape/edge features, among others
[30]. Next, a ML model is fit to the extracted features and the optimal model parameters are obtained. During the testing phase, the trained model is shown previously unseen samples (represented as images or features extracted from images), which are then classified. As opposed to traditional programming, where the rules are manually crafted by a programmer, a supervised ML algorithm automatically formulates rules from the data.
Figure 1.
Examples of typical machine-learning and deep-learning pipelines.
DL
[31] is a subfield of ML that sets an alternative architectural paradigm by shifting the process of extracting features from images to the underlying learning mechanism. The most informative features for the task at hand are extracted by the algorithm itself. The mainstream DL architecture for computer vision applications is the convolutional neural network (CNN). A CNN typically consists of multiple building blocks (layers such as convolutional, pooling, and fully connected) that automatically extract increasingly abstract spatial hierarchies of features. The CNN training is carried out via a backpropagation algorithm. The huge popularity of CNNs is attributed to certain characteristics they possess, such as weight sharing and spatial invariance.
Transfer learning is a common strategy where a network, that was pre-trained on a big dataset, is partly re-used to provide decisions on a problem with a different dataset. The main idea behind transfer learning is that generic features learned on a large dataset could be useful and applicable to other domain tasks with a potentially limited amount of accessible data. Numerous pre-trained networks are currently available, such as DenseNet
[32], AlexNet
[33], and VGG
[34]. When employing DL with transfer learning for feature extraction, the pre-trained network is treated as an arbitrary feature extractor: the input image propagates through multiple layers until it reaches a pre-specified layer, the outputs of which are considered as the finally extracted features (
Table 1).
Table 1. Brief presentation of the feature extraction techniques, as well as the ML and DL models, and the main procedures that were reported in the papers.
Brief presentation of the feature extraction techniques, as well as the ML and DL models, and the main procedures.
Category
|
Models
|
Description
|
Custom localization technique |
|
| 5-fold cross-validation |
|
| N/A/0.983 and 0.980 on the |
| Chiba and Stanford knee datasets, respectively |
|
ACL tear
|
3
|
Rizk et al. [65][48]
|
2021
|
3D CNN
|
CNN-based localization model
|
1 T (54%)–1.5 T (9.7%)–3 T (36.3%)
|
Custom localization technique |
|
This aims to transfer knowledge from one task to another different but related target task. This is often achieved by reusing the weights of a pre-trained model, to initialize the weights in a new model for the target task. Transfer learning can help to decrease the training time and achieve lower generalization error. |
|
2. CKnee Injurrent Insighty Detection Using Deep Learning on MRI Studies
Figure 2 shows an increasing trend in adopting ML-based studies in this application area, with most of the papers being published from 2017 onwards (whilst the first ML-based paper on the field was published in 2013). Medical imaging, and specifically MRI, has to be seen as one of the most instructive assets in the field of knee injury diagnosis. The proliferation of MRI data has facilitated the effective training of ML and DL networks towards the development of: (i) novel methodologies that could enhance the medical experts’ domain knowledge and understanding of MRI; and (ii) new, data-driven tools that could enable a more reliable, fast, and fully automated detection of knee injuries. The main characteristics of the proposed MRI-based learning algorithms and pipelines were identified along with the data sources investigated (
Table 2).
Figure 2.
Temporal evolution chart depicting the number of ML papers per category published each year since 2013.
Table 2. Results of studies.
No.
|
Author
|
Year
|
AI Model Used
|
Pretrained CNN
|
MRI (T)
|
Localization Technique
|
Validation
|
Performance (Accuracy/AUC)
|
Application Domain
|
Feature extraction
|
Histogram of oriented gradient (HOG) [35]
|
This is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image.
|
|
| ten-fold cross validation |
|
| Meidal = N/A/0.93, Lateral = N/A/0.84 |
|
| Meniscus tear |
|
[ | 37]
|
KNN algorithm is a simple, easy-to-implement supervised ML algorithm that can be used to solve both classification and regression problems. It works by (i) finding the distances between a query and all the examples in the data, (ii) selecting the K nearest neighbors of the query, and (iii) voting for the most frequent label (in the case of classification) or averaging the labels (in the case of regression).
|
4
|
Dai et al. [57][49]
|
2021
|
TransMed
|
N/A
|
Support vector machines (SVMs) [38]
|
SVMs is a supervised method that identifies a hyperplane that best divides the data into two classes. To separate the two clouds of data points, there are many possible hyperplanes that could be chosen. The objective of the SVM algorithm is to find a slab that has the maximum thickness, i.e., the maximum distance between data points of the different classes.
|
Shallow artificial neural networks (ANNs) [39]
|
The ANN vaguely simulates the way the human brain analyzes and processes information. They consist of sequential layers: input, hidden and output layers. The hidden layer processes and transmits the input information to the output layer.
|
Deep Learning
|
Convolutional neural networks (CNNs) [40]
|
This is a class of DL algorithms commonly used in computer vision and pattern recognition. CNNs are a specific type of neural networks that are generally composed of the following layers: (i) input layer, (ii) convolution layers, (iii) pooling layers and (iv) fully connected layers. The convolution layers use filters that perform convolution operations as they are scanning the input with respect to its dimensions. Pooling is a down-sampling operation, which is typically applied after a convolution layer. The fully connected layers operate on a flattened input where each input is connected to all neurons in the next layer and are usually found towards the end of CNN architectures to optimize objectives such as class scores.
|
Region based convolutional neural networks (R-CNNs) [41]
|
The method of detecting and classifying objects in an image is known as object detection. R-CNN (regions with convolutional neural networks) is a deep learning technique that blends rectangular area proposals with convolutional neural network functionality. The R-CNN algorithm is a two-stage detection method.
|
Deep residual networks [42]
|
A residual neural network (ResNet) is an ANN variant that uses residual mapping and shortcut connections to tackle the problem of vanishing and exploding gradients that is characteristic of deep CNNs. As a consequence of this, deep residual networks achieve better performance when compared to plain very deep networks, whereas their training is easier as well. Typical ResNet models are implemented with double- or triple-layer skips that contain nonlinearities such as rectified linear unit (ReLUs) and batch normalization in between.
|
3D-CNNs |
1
|
Awan et al. [55][46]
|
2021
|
CNN
|
ResNet-14
|
1.5 T
|
They applied normal approach to localize based upon region of interest (ROI)
|
5-fold cross-validation
|
92%/(healthy tear = 0.98, partial tear = 0.97 and fully ruptured tear = 0.99)
|
ACL tear
|
Generalized search tree (GIST) [30]
|
GIST descriptor represents holistic spatial scene properties (spatial envelope) of an image. It summarizes gradient information on different spatial scales and orientations by splitting the image into a grid of cells on several scales and convolving each cell using a Gabor filter bank from different perspectives.
|
2
|
Jeon et al. [56][47]
|
2021
|
|
Gray-level co-occurrence matrix (GLCM) [36]
|
GLCM is a way of extracting second-order statistical texture features. In particular, the texture of an image is estimated by calculating how often pairs of pixels with specific values and a certain spatial relationship occur.
|
|
| 3 T & 1.5 T |
|
| N/A |
|
120 exams
|
ACL tear = 94.9%/0.98, Abnormality = 91.8%/0.976, Meniscus tear = 85.3%/0.95
|
ACL tear—Meniscus tear—Abnormalities
|
5
|
Astuto et al. [58][50]
|
2021
|
3D CNN
|
N/A
|
3 T
|
V-Net
|
Hold out (15% of sample)
|
N/A/from 0.83 to 0.93
|
ACL tear—Meniscus tear—Cartilage Lession
|
6
|
Fritz et al. [15]
|
2020
|
DCNN
|
N/A
|
1.5 T (64%)–3 T (36%)
|
To visually localize the tear, the software computes the class activation map (CAM) of the last convolution layer in the CNN and maps it to an axial knee image
|
Hold out (10% of sample)
|
Medial = (86%/0.88), Lateral = (84%/0.78), Overall = (N/A/0.96)
|
Meniscus tear
|
7
|
Namiri et al. [54][51]
|
2020
|
CNN
|
N/A
|
3 T
|
three-dimensional V-Net |
[ |
43 |
]
|
A 3D CNN is simply the 3D generalization of 2D CNNs. It takes as input a 3D volume or a sequence of 2D frames (e.g., slices in an MRI scan). Then kernels move through 3 dimensions of data producing 3D activation maps. Overall, they learn powerful representations of volumetric data.
|
3D CNN |
|
| VGGNet, AlexNet, and SqueezeNet |
|
| Hold out (10% of sample) |
|
3D-model = (89%/sensitivity of 89% and specificity of 88%), 2D-model = (92%/sensitivity of 93% and specificity of 90%)
|
ACL tear
|
8
|
Zhang et al. [6]
|
2020
|
CNN
|
3D DenseNet, VGG16, ResNet
|
1.5 T (74%)–3 T (26%)
|
-
|
Hold out (20% of sample)
|
Custom = (95.7%/0.96), ResNet = (NA/0.95), VGG16 = (NA/0.86)
|
ACL tear
|
9
|
Germann et al. [24]
|
2020
|
DCNN
|
N/A
|
1.5 T–3 T
|
They cropped manually
|
Out of the 5802 MRI studies, 4802 were used for training, 500 for validation, and 500 for initial testing
|
N/A/0.94
|
ACL tear
|
10
|
Azcona et al. [52]
|
2020
|
CNN
|
MRNet, ResNet18, Resnet50 and ResNet152, ImageNet
|
3 T (56.6%)–1.5 T (43.4%)
|
-
|
N/A
|
NA/0.96–N/A/0.91–N/A/0.94
|
ACL tear—Meniscus tear—Abnormalities
|
|
11
|
Computer Vision Transformers [44]
|
When data is modelized as a sequence of embeddings, the Transformer model is a basic yet scalable technique that can be used for any type of data. Even without typical convolutional pipelines, transformers can be utilized to provide SOTA results in Computer Vision. It is a DL network that extracts inherent properties of the interest domain via the self-attention technique.
|
|
| 3 T & 1.5 T |
|
| Traditional Machine Learning
|
k-nearest neighbor (K-NN) |
Chang et al. [8]
|
2019
|
CNN
|
ResNet
|
1.5 T–3 T
|
The object localization CNN was implemented as a fully convolutional network based on U-net architecture
|
5-fold-cross-validation
|
96.7%/0.97
|
ACL tear
|
Procedure
|
Training
|
The standard procedure involves a dataset of paired images and labels (x, y) for training and testing, an optimizer (e.g., stochastic gradient descent, Adam [45]), and a loss function to update the model parameters. The aim of the training is to find the optimal values for the network parameters so that the loss function is minimized.
|
12
|
Liu et al. [53]
|
2019
|
CNN
|
LeNet-5, DenseNet, VGG16, AlexNet
|
N/A
|
They used object detection technique YOLO
|
50 subjects test set (14% of the sample)
|
N/A/0.98
|
ACL tear
|
Data augmentation
|
Data augmentation is a strategy that artificially generates more training samples to increase the diversity of the training data. This can be done via applying affine transformations (e.g., rotation, scaling), flipping or cropping to original labeled samples.
|
13
|
Couteaux et al. [61][54]
|
2019
|
CNN
|
ResNet-101, ConvNet, R-CNN
|
N/A
|
To localize both menisci and identify tears in each meniscus, they used the Mask R-CNN framework
|
54 cases and the model with the highest validation accuracy was selected
|
N/A/0.90
|
Meniscus tear
|
Dropout
|
|
Dropout is a regularization method that randomly drops some units from the neural network during training, encouraging the network to learn a sparse representation. It is used to reduce overfitting.
|
14
|
Pedoia et al. [63][55]
|
2019
|
2D U-Net, CNN
|
N/A
|
3 T
|
-
|
Hold out (20% of sample)
|
Sensitivity of 89.81% and specificity of 81.98%
|
Meniscus tear
|
Loss function
|
15
|
The metric to assess the discrepancy between model predictions and labels is called loss function. The gradients of the loss function are used to update the weights of the neural networks.
|
| Roblot et al. [62][56]
|
2019
|
CNN
|
AlexNet, MRNet
|
N/A
|
They used object detection technique Fast RCNN & Faster RCNN
|
The algorithm was thus used on a test dataset composed of 700 images for external validation
|
72.5%/0.85
|
Meniscus tear
|
Transfer learning
|
16
|
Nicholas Bien et al. [27]
|
2018
|
CNN
|
AlexNET, MRNet
|
3 T (56.6%)–1.5 T (43.4%)
|
-
|
120 exams
|
86.7%/0.97–72.5%/0.85–N/A/0.94
|
ACL tear—Meniscus tear—Abnormalities
|
17
|
Liu et al. [66][57]
|
2018
|
CNN
|
VGG16
|
3 T
|
-
|
fellowship trained musculoskeletal radiologist (R.K., with 15 years of clinical experience)
|
N/A/0.92
|
Cartilage lesion
|
18
|
Stajduhar et al. [48][58]
|
2017
|
HOG + linSVM, HOG + RF, GIST + rbfSVM, GIST + RF
|
N/A
|
1.5 T
|
Manual extraction of a rectangular ROI
|
10-fold cross validation
|
(Injury detection problem, complete rupture) = (N/A/0.89, N/A/0.94), (N/A/0.88, N/A/0.94), (N/A/0.889, N/A/0.91), (N/A/0.88, N/A/0.90) respectively with the models
|
ACL tear
|
19
|
Mazlan et al. [51][59]
|
2017
|
SVM
|
N/A
|
N/A
|
They use cropping technique
|
Hold out (10% of sample)
|
100%/N/A
|
ACL tear
|
20
|
Zarandi et al. [60]
|
2016
|
IT2FCM, PNN
|
N/A
|
N/A
|
-
|
Hold out (20% of sample)
|
0 and 1 mode: 90%/N/A
Binary mode: 78%/N/A
|
Meniscus tear
|
21
|
Fu et al. [59][61]
|
2013
|
SVM
|
N/A
|
N/A
|
Active Contours without Edges method. This method combines Active Contours with Level Sets and is called ACLS
|
5-Fold cross validation
|
SVM model: N/A/0.73
SFFS + SVM: N/A/0.91
|
Meniscus tear
|
22
|
Abdullah et al. [50][62]
|
2013
|
BP ANN, K-NN
|
N/A
|
N/A
|
-
|
5-fold and 6-fold
|
BP ANN: 94.44%/N/A
k-NN: 87.83%/N/A
|
ACL tear
|
Although there is no clear acceptance of a “gold-standard” methodological pipeline for diagnosing knee abnormalities using MRI data, it was observed that a number of processing steps were commonly employed in the majority of the reported studies.
Figure 3 visualizes a DL pipeline that was adopted by most of the papers, including a pre-processing step, localization (optionally) by identifying regions of interest, and, finally, a CNN-based classification step. Data augmentation was employed by a significant number of papers in the detection of ACL injuries
[6[6][27][46][47][49][50][51][52],
27,52,54,55,56,57,58], in papers where meniscus injuries were investigated
[27,52,57[27][49][50][52][55][56],
58,62,63], and, finally, in studies focusing on cartilage lesion abnormalities
[27,52][27][52]. In particular, the available MRI images were modified (via a number of image transformations such as random rotations, shifting, flipping, and the addition of noise) to expand the training dataset, and thus help to improve the performance and ability of the employed DL models to generalize. Localization was employed in papers from all three subcategories: (i) ACL studies
[6,8,24,48,53,54,55,56,58][6][8][24][46][47][50][51][53][58]; (ii) meniscus injuries detection studies
[15,58,59,60,61,62,63,65][15][48][50][54][55][56][60][61]; and (iii) for diagnosing lesion abnormalities
[66][57]. Segmentation or objection detection algorithms were applied in the aforementioned studies to extract areas of interest, enabling the application of CNN-based models on focused and more relevant parts of the initially available images. Given that the region of interest (ROI) may appear in slightly different positions within an image and may have different aspect ratios or sizes, identifying ROIs with an automatic manner has been proven to be a crucial processing step.
Figure 3.
A typical DL pipeline for ACL detection.
CNN-inspired networks were identified as the dominant approach in the task of extracting informative features from either ROIs or entire MRIs and finally classifying them as normal (healthy) or abnormal (indicating either partial or complete tears). Transfer learning was preferred in most of the cases, allowing the training of big and powerful deep architectures, even if the amount of available data was limited. As networks require a lot of information to be trained from scratch, this technique essentially ‘steals’ knowledge from already pre-trained large networks. Specifically, ResNet variants were used in five papers
[6,8,52,55,61][6][8][46][52][54] here, whereas VGG
[34], AlexNet
[33], and MRNet
[27] were used three times
[6,27,52,53,62,66][6][27][52][53][56][57]. Other pre-trained networks that were used at least once in this survey are: DenseNet
[32], Le-Net
[68][63], ImageNet
[33], and R-CNN
[41]. In five
[48,50,51,59,60][58][59][60][61][62] out of the 22 studies of the present survey, more traditional ML pipelines were applied, including a separate feature engineering step (where features were manually extracted from images). SVM classification was the preferred classifier in most of the cases.
Despite the excellent capability of CNNs to come up with valuable image representations, these models lack the capacity for capturing long-range relationships. To deal with this limitation, recent research studies
[44,69][44][64] have proposed employing Transformer-based architectures for various image recognition tasks. The Transformer
[70][65] is a neural network architecture that relies on global self-attention mechanisms, and it was initially designed for sequence-to-sequence prediction. Papers that used this architectural paradigm have indeed achieved state-of the-art results
[71,72][66][67] in many natural language processing (NLP) tasks. Dai et al.
[57][49] were the first to employ a Transformer-based architecture for the MRI-based knee injury detection task. In particular, their hybrid (Transformer and CNN) model was used to extract features that pick up the long-range dependencies between MRI and other modalities.
3. Conclusion
Notwithstanding the huge potential of AI to improve the medical domain, the DL-based methods have yet to achieve significant deployment in clinical environments. This mainly ensues as a result of: (i) the intrinsic black-box nature of the DL algorithms; and (ii) the high computational cost. Explainable AI aims at building trust in the AI algorithms by providing medical experts with a diagnostic rationale behind the AI decision processes. The goal of the lightweight DL field is to develop models that have shallower architecture and are also faster and more data-efficient, while retaining the high-performance standards. Jeon et al.
[56][47] were the first to get to grips with the clinical deployment of the MRI-based knee injury diagnosis. To this end, they proposed to use post-inference visualisation tools (such as CAM and Grad-CAM), and they also incorporated attention modules, Gaussian positional encoding, squeeze modules, and fewer convolutional filters.