Swallowing is a natural yet essential part of the daily life. Human performs spontaneous swallowing (saliva and food/drink) 0.98 times per minute on average
[1]. With different definitions and measurement techniques, Lear et al.
[2] suggested that humans swallow approximately 200 to 1000 times a day, while Rudney et al.
[3] reported that spontaneous swallows are performed by healthy humans 18 to 400 times per hour. However, some people may have difficulty swallowing, especially aged people or people with chronic conditions. Swallowing difficulty is also termed dysphagia, in which dysphagic individuals have problems chewing and swallowing food or liquids, experience pain during swallowing, or even be unable to swallow. Notably, the bolus may enter the airway and lungs, leading to aspiration pneumonia, which is fatal but clinically silent
[4]. Dysphagia is generally chronic but deteriorates with the worsening of cognition and functions in the progression of dementia or other neurological disorders
[5,6][5][6]. Therefore, continuous monitoring or assessment could be necessary to identify the stage at high risk of choking or aspiration for timely management and rehabilitation
[7,8][7][8]. In addition, dysphagia patients may be reluctant to eat due to the fear of choking, pain, or difficulty that causes malnutrition, dehydration, depression, and anorexia
[9]. More than one-third of older adults reported dysphagia or swallowing disorders during their lifetime, which were associated with stroke, diabetes, Parkinson’s, and Alzheimer’s disease
[10,11][10][11]. Howden
[12] and Ney et al.
[13] reported that the prevalence of dysphagia could be 22% and 40% for seniors aged over 50 and 60, respectively. A recent survey reported that swallowing difficulty was reported in one in every six adults, and some of them might not seek medical care
[14].
2. Instrument Configuration
Among the 11 eligible articles, five of them utilized only acoustics (microphone)
[32[29][30][31][32][33],
33,34,35,40], one utilized only accelerometers in the instrument
[39][34], and five applied a multimodal system
[30,31,36,37,38][35][36][37][38][39]. However, two articles on multimodal systems did not fully describe the modalities other than acoustics
[30,37][35][38]. Other multimodal systems involved surface electromyography (sEMG), mechanomyography (MMG), and airflow pressure sensor.
As shown in
Table 1, a single microphone for detecting swallowing sounds appeared in three articles
[32,35,40][29][32][33]. Skowronski et al.
[40][33] made use of a miniature surface-mounted microphone and characterized the signal using Human Factor Cepstral Coefficients
[41][40], which was originally used for automatic speech recognition. Bi et al.
[32][29] developed the “AutoDietary” system using a throat microphone. The system also displayed the food type recognition results for the users for personal health management. Kurihara et al.
[35][32] customized the device by attaching a bi-directional electret condenser microphone on the ends of an air tube to detect the swallowing microphone through the pressure propagation along the air tube. Two studies employed two microphones but with different principles
[33,34][30][31]. The major laryngeal microphone was used to record the swallowing sound directly in both cases. On the one hand, Fukuike et al.
[34][31] further improved the system accuracy by adding a condenser microphone on the nostril. On the other hand, Fontana et al.
[33][30] used the condenser microphone to detect the swallowing sound in the subsonic range. Additionally, Amft and Troster
[31][36] integrated a stethoscope microphone with sEMG of the cricopharyngeus muscle to recognize swallowing. They also presented separate analyses on dietary movement activity and chewing activity recognition using other sensors
[31][36].
Table 1. Instrument setting, location, and assessment procedures in the reviewed articles.
Author (Year) |
Sensors |
Location |
Swallowing or Non-swallowing events |
Afkari [30] | Afkari [35] |
Miniature ACC (NM) |
Level of thyroid cartilage |
dry (saliva) swallowing, drink 100 mL of water as fast as possible. |
sEMG (NM) |
Level of cricopharyngeus muscle |
Omnidirectional electret MIC (NM) |
Level of cricoid cartilage |
Amft and Troster [31] | Amft and Troster [36] |
sEMG (Nexus-10, MindMedia) |
Collar at infra-hyoid throat region |
Participants were allowed to move, chew, & speak normally during the recording. The participants were asked to drink 5 mL & 15 mL of water, eat a spoonful of yogurt, & 2 cm3 of bread in one piece. |
Stethoscope MIC (ECM-C115, Sony) |
Collar below hyoid |
Bi et al. [32] | Bi et al. [29] |
Throat MIC [NM] |
Over neck close to the jaw |
Apple, carrot, chip, cookie, peanut, walnut, water. |
Fontana et al. [33] | Fontana et al. [30] |
Condenser MIC (CZN-15E) |
thyroid cartilage level, one side of the neck |
Start with 5 min quiet sitting, 5 min reading aloud, a meal of 4 food items (apple, 40 g crackers, low-fact yogurt, 250 mL water) was consumed at unlimited time. |
Piezoelectric MIC (IASUS NT, IASUS Concept Ltd.) |
Over laryngopharynx |
Fukuike et al. [34] | Fukuike et al. [31] |
Condenser MIC (WM-61A, Panasonic, Osaka, Japan) |
Fixed on a silicone tube and placed inside the left nostril |
Taking a meal and stepping on a foot pedal when swallowed.
Yawn, cough, sigh, throat clearing, gargling, and sipping tea.
|
Laryngeal MIC (SH-12iK, Nanzu, Shizuoka, Japan) |
Over anterior larynx |
Kurihara et al. [35] | Kurihara et al. [32] |
Bi-directional electret condenser MIC (EM114, Primo Co., Ltd.) |
MIC attached to air tube hung over neck with anterior opening |
swallowing nothing, tea (10 mL), tea with a thickener (10 mL), rice cake (10 g). |
Lee et al. [36] | Lee et al. [37] |
Dual axis ACC (ADXL322) |
Below thyroid cartilage aligned in anterior-posterior and superior-inferior axes |
Water, barium suspension (Ba), nectar-thick apple juice (Ne), honey-thick apple juice (Ho), spoon-thick apple juice (Sp). |
Submental mechanomyography (developed by Silva and Chau [42]) | Submental mechanomyography (developed by Silva and Chau [41]) |
On the geniohyoid |
Pressure Transducer (PTAFLITE, Glass Technologies) |
At nasal cannula |
Makeyev et al. [37] | Makeyev et al. [38] |
Throat microphone (IASUS NT, IASUS Concept Ltd.) * |
Over laryngopharynx |
Start with 10 min silent, 10 min reading aloud, meal of mixed size consumed at an unlimited time (including cheese pizza, yogurg, apple, peanut butter sandiwtch), 10 min silent, 10 min reading aloud.
|
Sazonov et al. [38] | Sazonov et al. [39] |
Throat microphone (IASUS NT, IASUS Concept Ltd.) * |
Over laryngopharynx |
20 min rest, a meal, then 20 min rest.
|
Sejdic et al. [39] | Sejdic et al. [34] |
Dual-axis accelerometer (ADXL322) |
Anterior to cricoid cartilage, along anterior-posterior and superior-inferior axes |
dry (saliva) swallow, drink water in natural & chin-tucked position |
Skowronski et al. [40] | Skowronski et al. [33] |
Miniature surface-mounted MIC (VT506, Voice Technologies, Zurich, Switzerland) |
Laterally below the cricoid cartilage |
5 mL liquid, dry swallow, head move, yawn, sniff, tongue move, speech, hum, throat clear, cough. |
Accelerometry measurements were presented in three papers
[30][35] and two incorporated in the multimodal system
[36,39][34][37]. Afkari
[30][35] implemented a tri-modal system using miniature accelerometers, sEMG, and omnidirectional electret microphone, while Lee et al.
[36][37] targeted the nasal airflow measured by a pressure transducer and the submental MMG developed previously
[42][41]. All these devices made use of biaxial accelerometers aligned in anterior-posterior and superior-inferior directions
[30,36,39][34][35][37].
There were variations in the locations of the sensors, which may depend on the types and the suspension methods. Although few studies vaguely mentioned that the sensors shall be attached over the laryngopharynx, thyroid cartilage and cricoid cartilage were two anatomical landmarks highlighted
[30,36,39,40][33][34][35][37]. The sensors could be glued or taped to the throat surface
[30,39][34][35], collared
[31][36], or in the form of a necklace
[33,34,35,36][30][31][32][37].
3. Assessment Protocol for Swallowing
Since swallowing is a continuous process, segmenting a time frame to stamp the swallowing episode is essential to define the “sample counts” for evaluating accuracy. The episode stamping method could be classified as event-based or episode-based. Two studies attempted both event-based and episode-based approaches for the evaluation
[37,38][38][39]. For the other studies, five
[30,32,34,35,39][29][31][32][34][35] adopted the event-based approach, and four
[31,33,36,40][30][33][36][37] adopted the episode-based approach, respectively.
For event-based stamping, the conditions were controlled, and the researchers instructed the participants to perform one maneuver at a time, in which the event could be easily labeled for a period. For the epoch-based approach, the participants were often free to conduct a series of activities at each time. Then, the time was sliced into several non-overlapping time units (epochs) by algorithms or data processing techniques and was then manually labeled by revisiting the videotape. Alternatively, participants might be asked to press a button or pedal during their swallowing process for labeling
[33,34][30][31].
The swallowing protocol could be broadly classified as non-swallowing maneuvers and swallowing maneuvers, while some studies attempted to have a fine-grained classification within these two categories (Table 1). For non-swallowing, the dry swallow was referred to as saliva swallowing
[30[33][34][35],
39,40], while assessing non-swallowing through silence or talking was often implemented through an epoch-based approach (detailed in the next paragraph)
[31,33,37,38][30][36][38][39]. Some studies investigated different types of throat movements as non-swallowing events, including yawning, coughing, sighing, sniffing, throat clearing, gargling, speech, and tongue moving
[34,40][31][33]. Besides, it shall be noted that Fukuike et al.
[34][31] considered sipping tea as a non-swallowing maneuver. On the other hand, there was no consensus on the kinds of food to prompt swallowing events. For the epoch-based approach, participants were asked to take a meal with a variety of food without controlling participants to eat one kind of food at a time during the data collection. Besides, drinking water appeared in most of the articles
[30,31,32,33,36,39[29][30][33][34][35][36][37],
40], while yogurt was the most famous semifluid food
[31,33,37][30][36][38]. For solid food, bread, crackers, cookies, pizza, sandwiches, fruit, and peanuts were some examples considered
[31,32,33,37][29][30][36][38].
4. Segmentation and Feature Extraction Strategy
Researchers had to identify whether a swallowing event happened within a time frame because of the continuous nature of swallowing. Two studies manually segmented the time window
[30[33][35],
40], while four studies specified the duration of the segmented time window, ranging from 200 ms to 1.5 s
[31,33,36,37][30][36][37][38]. Fukuike et al.
[34][31], Kurihara et al.
[35][32], and Sejdic et al.
[39][34] utilized the semblable wave period, template matching, and minimum description length-based segmentation, respectively. Two studies accounted for randomized sampling concepts in the segmentation process, including the Hidden Markov Model (HMM) conducted by Bi et al.
[32][29] and the grid search conducted by Sazonov et al.
[38][39].
For the feature extraction strategy, four studies exploited the time-domain raw signals for classification
[30,33,34,39][30][31][34][35], while one made use of the frequency-domain raw signals
[38][39]. Predetermined features were computed for analysis in three articles
[32,35,36][29][32][37]. For example, Amft and Troster
[31][36] considered and fused the spectral features (band energy, autocorrelation coefficient, and energy) and EMG features (total and maximum). Three studies performed some data reduction processes and established specific index parameters before the classification process
[31,37,40][33][36][38], such as using Principal Component Analysis (PCA).
5. Classification and Performance
Depending on the nature of the classification (i.e., swallowing vs. non-swallowing or classification of different food types) and the stamping approach (i.e., event-based vs. epoch-based), studies might apply different classification approaches. In order to classify/identify the swallowing event, three studies applied a threshold-based approach
[30[30][31][35],
33,34], while others implemented statistical or machine learning models
[31,32,35,36,37,38,39,40][29][32][33][34][36][37][38][39]. These models included logistic regression, decision tree, Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Artificial Neural Network (ANN), etc.
For the threshold-based approach, a swallowing event was often recognized whenever the collected signal exceeded a predefined threshold value for more than a certain time. Nevertheless, the cut-off level or time range was not adequately justified in the papers, and most of them were empirical. Amft and Troster
[31][36] applied compared acoustics, accelerometry, and EMG data with a set of reference voltages and integrated them by a logic gate (AND) but without justifying the source of the reference set. Fontana et al.
[33][30] established individualized threshold levels based on the collected signal during a reading task. They also suggested that the time range threshold shall be 0.6 s
[33][30], which was an estimated time for a complete swallow
[38][39]. On the other hand, Fukuike et al.
[34][31] decided to use twice the mean baseline as the threshold level, and a recognized event shall last longer than 0.35 s.
For the evaluation of classification performance, accuracy, sensitivity, specificity, and positive predictive value (PPV) are common evaluation metrics. Sensitivity and PPV are also sometimes termed precision and recall from the perspective of information retrieval in the field of data science
[43][42]. Sensitivity represented the proportion of recognizing a swallowing event/class when that event/class did occur, while specificity was the proportion of recognizing not a swallowing event/class when that event/class had not occurred. Accuracy is the ratio of correct classifications over the total number of tests. Besides, one study
[32][29] supplemented the receiver operating characteristics (ROC) curve to demonstrate the discrimination capacity.
As a rule of thumb, classifiers required an independent dataset for training and testing (model evaluation) to better evaluate the generalizing capability. Sejdic et al.
[39][34] evaluated the model using both synthetic tests and real swallowing signals. Despite a different number of folds, most of the model-based classifiers applied k-fold cross-validation, while Kurihara et al.
[35][32] adopted a leave-one-out approach. In addition, Lee et al.
[36][37] calculated the accuracy metrics based on a bootstrapping augmentation after a 10-fold cross-validation of the model to account for the unbalanced class sizes.
The 11 reviewed articles involved 15 classifiers in the data synthesis. There was a high variation in accuracy level among studies, ranging from 68.2% to 96.8%. Some studies had a classification performance as unreliable as a random guess (40–60%). Besides, despite that the accuracy metric of the review articles is generally satisfactory, the outcomes of other metrics (such as sensitivity, specificity, and PPV) could be quite different between studies. For example, Makeyev et al.
[37][38] attained 44% sensitivity and 99% specificity in their epoch-based SVM model. Amft and Troster
[31][36] got 20% positive predictive value and 68% sensitivity in their classification method using the agreement of detectors. The reason could be due to the problem of imbalanced class size, especially for epoch-based approaches.