Automatic inspection approaches can alleviate these disadvantages and provide more consistent and objective facial palsy diagnosis and evaluation methods, providing neurologists with an efficient decision-supporting tool
[6]. The automatic quantitative evaluation of facial palsy has been a subject of research for many years. Several approaches use optical markers attached to human faces to determine the degree of palsy
[7][8], as well as full-face laser scanning
[9][10] or electroneurography (ENoG) and electromyography (EMG) signals. The latter approaches, although very accurate, require specialized high-cost equipment and a constrained clinical environment and presuppose physical interventions, which are obtrusive and uncomfortable. Moreover, the patients themselves cannot perform these approaches on their own to monitor their progress at home.
Recent advancements in image analysis algorithms, combined with the increasingly affordable cost of high-resolution capturing devices, resulted in the development of efficient, simple, and cost-effective vision-based techniques for medical applications, reporting impressive state-of-the-art performances
[11][12][13]. The diagnosis of various diseases is greatly assisted by facial abnormalities recognition using computer vision
[14][15], dynamically incorporating facial recognition into artificial intelligence (AI)-based medicine
[16][17]. Automatic image-based facial palsy could accelerate the diagnosis and progress evaluation of the disease, offering a non-invasive, simple, and time- and cost-saving method that could be used by the palsy patients themselves without the presence of a human expert.
2. Machine Learning-Based Facial Palsy Detection and Evaluation
Traditional machine learning methods are based on encoding facial palsy with facial asymmetry-related mathematical features. A portable automatic diagnosis system based on a smartphone application for classifying subjects to healthy or palsy patients was presented by Kim et al.
[18]. Facial landmarks were extracted, and an asymmetry index was computed. Classification was implemented using Linear Discriminant Analysis (LDA) combined with Support Vector Machines (SVMs), resulting in 88.9% classification accuracy. Wang et al.
[19] used Active Shape Models (ASMs) to locate facial landmarks, dividing the face in eight regions and Local Binary Patterns (LBPs) used to extract descriptors for recognizing patterns of facial movements in these regions, reaching the highest recognition rate of up to 93.33%. In
[20], He et al. extracted features based on LBPs in the spatial–temporal domain in both facial regions and validated their method using biomedical videos, reporting an overall accuracy of up to 94% for the HB grading. In
[21], the authors automatically measure the ability of palsy patients to smile using Active Appearance Models (AAMs) for feature extraction and facial expression synthesis, providing an average accuracy of 87%. McGrenary et al.
[22] quantified facial asymmetry in videos using an artificial neural network (ANN).
Early research into facial asymmetry analysis was also studied by Quan et al.
[23], who presented a method for automatically detecting and quantifying facial dysfunctions based on 3D face scans. The authors extracted a number of feature points that enabled the segmentation of faces in local regions, enabling specific asymmetry evaluation for regions of interest rather than the entire face. Gaber et al.
[24] proposed an evaluation system for seven palsy categories based on an ensemble learning SVM classifier, reporting an accuracy of 96.8%. The authors proved that their proposed classifier was robust and stable, even for different training and testing samples. Zhuang et al.
[25] implemented a performance evaluation between various feature extraction techniques and concluded that 2D static images with Histogram of Oriented Gradients (HOG) features tend to be more accurate. The authors proposed a framework in which landmark and HOG features were extracted, Principal Component Analysis (PCA) was employed separately to the features, and the results were used as inputs to an SVM classifier for classification into three classes, demonstrating performance of up to 92.2% for the entire face. The same research group, as shown in
[26], demonstrated a video classification detection tool, namely the Facial Deficit Identification Tool for Videos (F-DIT-V), exploiting HOG features to find a 92.9% classification accuracy. Arora et al.
[27] tested an SVM and a Logistic Regressor on generated facial landmark features, achieving 76.87% average accuracy with SVM. In
[28], laser speckle contrast imaging was employed by Jiang et al. to monitor the facial blood flow of palsy patients. Then, faces were segmented into regions based on blood distribution features, and three HB score classifiers were tested for their classification performance: a neural network (NN), an SVM, and a k-NN, achieving an accuracy of up to 97.14%. A set of four classifiers (multi-layer perceptron (MLP), SVM, k-NN, multinomial logistic regression (MNLR)) was also comparatively tested in
[29]. The authors explored regional information, extracting handcrafted features only in certain face areas of interest. Experimental results reported up to 95.61% correct facial palsy detection and 95.58% correct facial palsy assessment in three categories (healthy, slight palsy, and strong palsy).
All previous methods are based on hand-crafted features. Deep learning methods can automatically learn discriminative feature from the data, without the need to compute them in advance. Deep learning models have accomplished state-of-the-art performances in the field of medical imaging
[30]. Based on the above, most of the recent works in vision-based facial palsy detection and evaluation employ deep features. Storey and Jiang
[31] presented a unified multitask convolutional neural network (CNN) for the simultaneous object proposal, detection and asymmetry analysis of faces. Sajid et al.
[32] introduced a CNN to classify palsy into five scales, resulting in a 92.6% average classification accuracy. Xia et al.
[5] suggested a deep neural network (DNN) to detect facial landmarks in palsy. Hsu et al.
[33] proposed a deep hierarchical network (DHN) to quantify facial palsy, including a YOLO2 detector for face detection, a fused neural architecture (line segment network—LSN) to detect facial landmarks, and an object detector, similar to Darknet, to locate palsy regions. Preliminary results of the same method were published in
[34]. Guo et al.
[35] investigated the unilateral peripheral facial paralysis classification using GoogLeNet, reaching a classification accuracy of up to 91.25% for predicting the HB degree.
Storey et al.
[36] implemented a facial grading system from video sequences based on a 3D CNN model using ResNet as the backbone, reporting a palsy classification accuracy of up to 82%. Barrios Dell’Olio and Sra
[37] proposed a CNN for detecting muscle activation and intensity in the users of their mobile augmented reality mirror therapy system. In
[38], Tan et al. introduced a facial palsy assessment method, including a facial landmark detector, a feature extractor based on EfficientNet backbone and semi-supervised extreme learning to classify features, reporting an 85.5% accuracy. Abayomi-Alli et al.
[39] trained a SqueezeNet network with augmented images and used the activations from the final convolutional layer as features to train a multiclass error-corrected output code SVM (ECOC-SVM) classifier, reporting an up to 99.34% mean classification accuracy. In
[40], computed tomography (CT) images were used to train two geometric deep learning models, namely PointNet++ and PointCNN, for the facial part segmentation of healthy and palsy patients for facial monitoring and rehabilitation. Umirzakova et al.
[41] suggested a light deep learning model for analyzing facial symmetry, using a foreground attention block for enhanced local feature extraction and a depth-map estimator to provide more accurate segmentation results.
Table 1 summarizes basic information from all the aforementioned studies, including the followed methodology, dataset, and performance results. Details regarding the mathematical modeling of machine learning and deep learning classification models can be found in
[42][43][44][45][46].
Table 1. Methodologies for facial palsy (FP) detection.
From the information included in Table 1, useful conclusions can be drawn. The lack of available datasets designated for palsy detection and evaluation is obvious. Most research teams develop their own private sets to test their algorithms. The most used public dataset among the referenced works is the YFP dataset; however, it refers to a limited video dataset. The videos are converted into image sequences; however, low dysfunctions cannot be easily visible from only one image and, thus, a sequence of frames needs to be examined to draw conclusions. Moreover, the dataset is labeled but facial landmark points are not annotated. From Table 4, it can be observed that deep learning methods lead to better performance results compared to machine learning methods or methods relying on hand-crafted features.