Swallow Detection with Acoustics and Accelerometric-Based Wearable Technology: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , , , , ,

Swallowing disorders, especially dysphagia, might lead to malnutrition and dehydration and could potentially lead to fatal aspiration. Benchmark swallowing assessments, such as videofluoroscopy or endoscopy, are expensive and invasive. Wearable technologies using acoustics and accelerometric sensors could offer opportunities for accessible and home-based long-term assessment. Identifying valid swallow events is the first step before enabling the technology for clinical applications. 

  • dysphagia
  • deglutition disorder
  • eating disorder

1. Introduction

Swallowing is a natural yet essential part of the daily life. Human performs spontaneous swallowing (saliva and food/drink) 0.98 times per minute on average [1]. With different definitions and measurement techniques, Lear et al. [2] suggested that humans swallow approximately 200 to 1000 times a day, while Rudney et al. [3] reported that spontaneous swallows are performed by healthy humans 18 to 400 times per hour. However, some people may have difficulty swallowing, especially aged people or people with chronic conditions. Swallowing difficulty is also termed dysphagia, in which dysphagic individuals have problems chewing and swallowing food or liquids, experience pain during swallowing, or even be unable to swallow. Notably, the bolus may enter the airway and lungs, leading to aspiration pneumonia, which is fatal but clinically silent [4]. Dysphagia is generally chronic but deteriorates with the worsening of cognition and functions in the progression of dementia or other neurological disorders [5,6]. Therefore, continuous monitoring or assessment could be necessary to identify the stage at high risk of choking or aspiration for timely management and rehabilitation [7,8]. In addition, dysphagia patients may be reluctant to eat due to the fear of choking, pain, or difficulty that causes malnutrition, dehydration, depression, and anorexia [9]. More than one-third of older adults reported dysphagia or swallowing disorders during their lifetime, which were associated with stroke, diabetes, Parkinson’s, and Alzheimer’s disease [10,11]. Howden [12] and Ney et al. [13] reported that the prevalence of dysphagia could be 22% and 40% for seniors aged over 50 and 60, respectively. A recent survey reported that swallowing difficulty was reported in one in every six adults, and some of them might not seek medical care [14].
Swallowing assessment or monitoring is imperative to facilitate early diagnosis, management, or rehabilitation to reduce mortality and improve the quality of life for dysphagia individuals. Nowadays, the Videofluoroscopic Swallowing Study (VFSS) and Fiberoptic Endoscopic Evaluation of Swallowing (FEES) are golden standards for instrumented assessment [15]. VFSS applies a dynamic fluoroscopic imaging technique to visualize the detailed swallowing process in oral, pharyngeal, laryngeal, and oesophageal regions in real-time [16]. For FEES, practitioners inspect the postural maneuvers of the nasal structures when the patients speak, eat, and breathe using an endoscope [17]. However, VFSS and FEES are expensive, cause discomfort and risks to the patients, and can only be conducted occasionally.
Non-instrumental bedside assessments for swallowing are alternatives to compromise cost and test frequency that could be readily adopted in nursing homes or care homes by an occupational therapist or speech therapist. A standard bedside screening process involves anamnesis assessment, morphodynamical evaluation, gustative function with specific stimulation test, and the oral feeding test [18]. Other related tests include the 3-ounce water swallowing test [19], cough reflex test [20], and cervical auscultation, which uses a stethoscope to amplify and listen to the swallowing sound [21]. Most of these instruments lacked sensitivity and predictive strength and poor reproducibility and consistency in the protocols [21,22] but could be routinely conducted for initial screening of swallowing functions [23].
Cervical auscultation refers to the measurement of sound or vibration of the throat for swallowing assessment, which is traditionally conducted by physicians using a stethoscope [24]. Wearable technology, such as accelerometry, acoustics, and electromyogram, could be more robust to facilitate non-invasive and non-ionizing, continuous monitoring or screening with less cost. Swallowing accelerometry monitors the translation of vibration through the aerodigestive tract and hyoid bone kinetics during swallowing [25]. The acoustic technique uses an inexpensive microphone to record swallowing sounds and may sometimes integrate with the accelerometry approach [26]. Takahashi et al. [27] could be among the pioneers that systematically reviewed and evaluated acoustic methods for the detection of swallowing sounds, while Taveira et al. [28] reviewed and compared the diagnostic validity of swallowing-sound-based methods to videofluoroscopy. Thereafter, more developments have been conducted using multimodal sensors, advanced data processing techniques and machine learning models.

2. Instrument Configuration

Among the 11 eligible articles, five of them utilized only acoustics (microphone) [32,33,34,35,40], one utilized only accelerometers in the instrument [39], and five applied a multimodal system [30,31,36,37,38]. However, two articles on multimodal systems did not fully describe the modalities other than acoustics [30,37]. Other multimodal systems involved surface electromyography (sEMG), mechanomyography (MMG), and airflow pressure sensor.
As shown in Table 1, a single microphone for detecting swallowing sounds appeared in three articles [32,35,40]. Skowronski et al. [40] made use of a miniature surface-mounted microphone and characterized the signal using Human Factor Cepstral Coefficients [41], which was originally used for automatic speech recognition. Bi et al. [32] developed the “AutoDietary” system using a throat microphone. The system also displayed the food type recognition results for the users for personal health management. Kurihara et al. [35] customized the device by attaching a bi-directional electret condenser microphone on the ends of an air tube to detect the swallowing microphone through the pressure propagation along the air tube. Two studies employed two microphones but with different principles [33,34]. The major laryngeal microphone was used to record the swallowing sound directly in both cases. On the one hand, Fukuike et al. [34] further improved the system accuracy by adding a condenser microphone on the nostril. On the other hand, Fontana et al. [33] used the condenser microphone to detect the swallowing sound in the subsonic range. Additionally, Amft and Troster [31] integrated a stethoscope microphone with sEMG of the cricopharyngeus muscle to recognize swallowing. They also presented separate analyses on dietary movement activity and chewing activity recognition using other sensors [31].
Table 1. Instrument setting, location, and assessment procedures in the reviewed articles.
Author (Year) Sensors Location Swallowing or Non-swallowing events
Afkari [30] Miniature ACC (NM) Level of thyroid cartilage dry (saliva) swallowing, drink 100 mL of water as fast as possible.
sEMG (NM) Level of cricopharyngeus muscle
Omnidirectional electret MIC (NM) Level of cricoid cartilage
Amft and Troster [31] sEMG (Nexus-10, MindMedia) Collar at infra-hyoid throat region Participants were allowed to move, chew, & speak normally during the recording. The participants were asked to drink 5 mL & 15 mL of water, eat a spoonful of yogurt, & 2 cm3 of bread in one piece.
Stethoscope MIC (ECM-C115, Sony) Collar below hyoid
Bi et al. [32] Throat MIC [NM] Over neck close to the jaw Apple, carrot, chip, cookie, peanut, walnut, water.
Fontana et al. [33] Condenser MIC (CZN-15E) thyroid cartilage level, one side of the neck Start with 5 min quiet sitting, 5 min reading aloud, a meal of 4 food items (apple, 40 g crackers, low-fact yogurt, 250 mL water) was consumed at unlimited time.
Piezoelectric MIC (IASUS NT, IASUS Concept Ltd.) Over laryngopharynx
Fukuike et al. [34] Condenser MIC (WM-61A, Panasonic, Osaka, Japan) Fixed on a silicone tube and placed inside the left nostril

Taking a meal and stepping on a foot pedal when swallowed.

Yawn, cough, sigh, throat clearing, gargling, and sipping tea.

Laryngeal MIC (SH-12iK, Nanzu, Shizuoka, Japan) Over anterior larynx
Kurihara et al. [35] Bi-directional electret condenser MIC (EM114, Primo Co., Ltd.) MIC attached to air tube hung over neck with anterior opening swallowing nothing, tea (10 mL), tea with a thickener (10 mL), rice cake (10 g).
Lee et al. [36] Dual axis ACC (ADXL322) Below thyroid cartilage aligned in anterior-posterior and superior-inferior axes Water, barium suspension (Ba), nectar-thick apple juice (Ne), honey-thick apple juice (Ho), spoon-thick apple juice (Sp).
Submental mechanomyography (developed by Silva and Chau [42]) On the geniohyoid
Pressure Transducer (PTAFLITE, Glass Technologies) At nasal cannula
Makeyev et al. [37] Throat microphone (IASUS NT, IASUS Concept Ltd.) * Over laryngopharynx

Start with 10 min silent, 10 min reading aloud, meal of mixed size consumed at an unlimited time (including cheese pizza, yogurg, apple, peanut butter sandiwtch), 10 min silent, 10 min reading aloud.

Sazonov et al. [38] Throat microphone (IASUS NT, IASUS Concept Ltd.) * Over laryngopharynx

20 min rest, a meal, then 20 min rest.

Sejdic et al. [39] Dual-axis accelerometer (ADXL322) Anterior to cricoid cartilage, along anterior-posterior and superior-inferior axes dry (saliva) swallow, drink water in natural & chin-tucked position
Skowronski et al. [40] Miniature surface-mounted MIC (VT506, Voice Technologies, Zurich, Switzerland) Laterally below the cricoid cartilage 5 mL liquid, dry swallow, head move, yawn, sniff, tongue move, speech, hum, throat clear, cough.
Accelerometry measurements were presented in three papers [30] and two incorporated in the multimodal system [36,39]. Afkari [30] implemented a tri-modal system using miniature accelerometers, sEMG, and omnidirectional electret microphone, while Lee et al. [36] targeted the nasal airflow measured by a pressure transducer and the submental MMG developed previously [42]. All these devices made use of biaxial accelerometers aligned in anterior-posterior and superior-inferior directions [30,36,39].
There were variations in the locations of the sensors, which may depend on the types and the suspension methods. Although few studies vaguely mentioned that the sensors shall be attached over the laryngopharynx, thyroid cartilage and cricoid cartilage were two anatomical landmarks highlighted [30,36,39,40]. The sensors could be glued or taped to the throat surface [30,39], collared [31], or in the form of a necklace [33,34,35,36].

3. Assessment Protocol for Swallowing

Since swallowing is a continuous process, segmenting a time frame to stamp the swallowing episode is essential to define the “sample counts” for evaluating accuracy. The episode stamping method could be classified as event-based or episode-based. Two studies attempted both event-based and episode-based approaches for the evaluation [37,38]. For the other studies, five [30,32,34,35,39] adopted the event-based approach, and four [31,33,36,40] adopted the episode-based approach, respectively.
For event-based stamping, the conditions were controlled, and the researchers instructed the participants to perform one maneuver at a time, in which the event could be easily labeled for a period. For the epoch-based approach, the participants were often free to conduct a series of activities at each time. Then, the time was sliced into several non-overlapping time units (epochs) by algorithms or data processing techniques and was then manually labeled by revisiting the videotape. Alternatively, participants might be asked to press a button or pedal during their swallowing process for labeling [33,34].
The swallowing protocol could be broadly classified as non-swallowing maneuvers and swallowing maneuvers, while some studies attempted to have a fine-grained classification within these two categories (Table 1). For non-swallowing, the dry swallow was referred to as saliva swallowing [30,39,40], while assessing non-swallowing through silence or talking was often implemented through an epoch-based approach (detailed in the next paragraph) [31,33,37,38]. Some studies investigated different types of throat movements as non-swallowing events, including yawning, coughing, sighing, sniffing, throat clearing, gargling, speech, and tongue moving [34,40]. Besides, it shall be noted that Fukuike et al. [34] considered sipping tea as a non-swallowing maneuver. On the other hand, there was no consensus on the kinds of food to prompt swallowing events. For the epoch-based approach, participants were asked to take a meal with a variety of food without controlling participants to eat one kind of food at a time during the data collection. Besides, drinking water appeared in most of the articles [30,31,32,33,36,39,40], while yogurt was the most famous semifluid food [31,33,37]. For solid food, bread, crackers, cookies, pizza, sandwiches, fruit, and peanuts were some examples considered [31,32,33,37].

4. Segmentation and Feature Extraction Strategy

Researchers had to identify whether a swallowing event happened within a time frame because of the continuous nature of swallowing. Two studies manually segmented the time window [30,40], while four studies specified the duration of the segmented time window, ranging from 200 ms to 1.5 s [31,33,36,37]. Fukuike et al. [34], Kurihara et al. [35], and Sejdic et al. [39] utilized the semblable wave period, template matching, and minimum description length-based segmentation, respectively. Two studies accounted for randomized sampling concepts in the segmentation process, including the Hidden Markov Model (HMM) conducted by Bi et al. [32] and the grid search conducted by Sazonov et al. [38].
For the feature extraction strategy, four studies exploited the time-domain raw signals for classification [30,33,34,39], while one made use of the frequency-domain raw signals [38]. Predetermined features were computed for analysis in three articles [32,35,36]. For example, Amft and Troster [31] considered and fused the spectral features (band energy, autocorrelation coefficient, and energy) and EMG features (total and maximum). Three studies performed some data reduction processes and established specific index parameters before the classification process [31,37,40], such as using Principal Component Analysis (PCA).

5. Classification and Performance

Depending on the nature of the classification (i.e., swallowing vs. non-swallowing or classification of different food types) and the stamping approach (i.e., event-based vs. epoch-based), studies might apply different classification approaches. In order to classify/identify the swallowing event, three studies applied a threshold-based approach [30,33,34], while others implemented statistical or machine learning models [31,32,35,36,37,38,39,40]. These models included logistic regression, decision tree, Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Artificial Neural Network (ANN), etc.
For the threshold-based approach, a swallowing event was often recognized whenever the collected signal exceeded a predefined threshold value for more than a certain time. Nevertheless, the cut-off level or time range was not adequately justified in the papers, and most of them were empirical. Amft and Troster [31] applied compared acoustics, accelerometry, and EMG data with a set of reference voltages and integrated them by a logic gate (AND) but without justifying the source of the reference set. Fontana et al. [33] established individualized threshold levels based on the collected signal during a reading task. They also suggested that the time range threshold shall be 0.6 s [33], which was an estimated time for a complete swallow [38]. On the other hand, Fukuike et al. [34] decided to use twice the mean baseline as the threshold level, and a recognized event shall last longer than 0.35 s.
For the evaluation of classification performance, accuracy, sensitivity, specificity, and positive predictive value (PPV) are common evaluation metrics. Sensitivity and PPV are also sometimes termed precision and recall from the perspective of information retrieval in the field of data science [43]. Sensitivity represented the proportion of recognizing a swallowing event/class when that event/class did occur, while specificity was the proportion of recognizing not a swallowing event/class when that event/class had not occurred. Accuracy is the ratio of correct classifications over the total number of tests. Besides, one study [32] supplemented the receiver operating characteristics (ROC) curve to demonstrate the discrimination capacity.
As a rule of thumb, classifiers required an independent dataset for training and testing (model evaluation) to better evaluate the generalizing capability. Sejdic et al. [39] evaluated the model using both synthetic tests and real swallowing signals. Despite a different number of folds, most of the model-based classifiers applied k-fold cross-validation, while Kurihara et al. [35] adopted a leave-one-out approach. In addition, Lee et al. [36] calculated the accuracy metrics based on a bootstrapping augmentation after a 10-fold cross-validation of the model to account for the unbalanced class sizes.
The 11 reviewed articles involved 15 classifiers in the data synthesis. There was a high variation in accuracy level among studies, ranging from 68.2% to 96.8%. Some studies had a classification performance as unreliable as a random guess (40–60%). Besides, despite that the accuracy metric of the review articles is generally satisfactory, the outcomes of other metrics (such as sensitivity, specificity, and PPV) could be quite different between studies. For example, Makeyev et al. [37] attained 44% sensitivity and 99% specificity in their epoch-based SVM model. Amft and Troster [31] got 20% positive predictive value and 68% sensitivity in their classification method using the agreement of detectors. The reason could be due to the problem of imbalanced class size, especially for epoch-based approaches.

This entry is adapted from the peer-reviewed paper 10.3390/ijerph20010170

This entry is offline, you can click here to edit this entry!
ScholarVision Creations