Infant Cry Signal Diagnostic System: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

Early diagnosis of medical conditions in infants is crucial for ensuring timely and effective treatment. However, infants are unable to verbalize their symptoms, making it difficult for healthcare professionals to accurately diagnose their conditions. Crying is often the only way for infants to communicate their needs and discomfort. The different combination of the fused features is then fed into multiple machine learning algorithms including random forest (RF), support vector machine (SVM), and deep neural network (DNN) models. The evaluation of the system using the accuracy, precision, recall, F1-score, confusion matrix, and receiver operating characteristic (ROC) curve, showed promising results for the early diagnosis of medical conditions in infants based on the crying signals only, where the system achieved the highest accuracy of 97.50% using the combination of the spectrogram, harmonic ratio (HR), and Gammatone frequency cepstral coefficients (GFCCs) through the deep learning process. 

  • infant’s crying diagnosis
  • audio domains features
  • HR
  • GFCC

1. Introduction

Even though the worldwide number of infant deaths has decreased from 5 million in 1990 to 2.4 million in 2019, newborns still suffer the highest risk of mortality during the first 28 days of life. In 2019, neonatal deaths accounted for 47 percent of all deaths among children under the age of 5, with nearly one-third dying on the day of birth and nearly three-quarters dying during the first week of life [1]. Infants who die within the first 28 days of life are afflicted with illnesses and problems due to a deficiency in the quality of care during delivery or professional care and treatment shortly after birth and in the early days of life [2]. This demonstrates that newborns are vulnerable to a variety of diseases that might result in lifelong illnesses or early death. Some of these diseases are aspiration, asphyxia, kidney failure, RDS, and sepsis. RDS and sepsis are the most common pathologies associated with a high mortality rate; thus, this research study started by diagnosing them at early stages.
RDS is considered the major cause of death and illness among preterm newborns [3]. RDS is a respiratory disorder of neonates that manifests itself immediately after delivery. It is one of the most frequent reasons for newborn intensive care unit admissions (NICU) and breathing failure in newborns [4]. Some of the causes of this disease are maladaptation or delayed adaptation, a preexisting condition such as surgical or congenital defects, and acquired infections, which are all causes of developmental delay [4]. RDS caused deaths at a rate of 10.7 per 100,000 live births in the United States in 2020 [5]. The diagnosis of RDS requires a set of clinical tests including chest X-ray, computerized tomography (CT), electrocardiogram, and echocardiogram for the heart and frequent blood tests to monitor the oxygen levels [6].
Moreover, sepsis is a significant source of death and disease. It caused 15 deaths per 100,000 live births in the United States in 2020 [5]. The main criterion in the diagnosis of sepsis is the isolation of the pathogen in one or more blood cultures [6]. However, it is not easy to grow the pathogenic microorganism in culture in all cases because of many reasons, including inadequate sample collection, slow-growing microorganisms, prior antimicrobial therapy, nonbacterial infections, and contamination. In addition to that, and like RDS, sepsis needs a set of tests to be diagnosed which is related to heart rate, feeding problems, lethargy, fever, hypotonia, convulsion, hemodynamic abnormalities, and apnea [7]. Early detection of these hidden illnesses, such as sepsis and RDS, is critical. As most of the newborns who are infected by such pathologies seem normal at birth, and as can be seen from both RDS and sepsis, these pathologies need a lot of clinical tests that are time-consuming in addition to the risk of them resulting in false-negative and false-positive outputs [8]. Thus, early detection of hidden illnesses for prompt and successful treatment within the first week of life is critical, as it might save these newborns’ lives [9].
On the other hand, the only way infants can communicate with their surroundings is by crying. Through training and experience, experts such as experienced parents, pediatricians, and childcare professionals might be able to understand and distinguish the meaning of infants’ crying. However, interpreting newborn screams may be challenging for new parents as well as unskilled clinicians and caregivers. As a result, distinguishing infants’ cries with distinct meanings based on related cry audio qualities is critical [10]. Accurately interpreting newborn cry sounds and automatically identifying infant cry signals may assist parents and caregivers in providing better care to their infants. Early diagnosis of diseases via cry signals is noninvasive and may be conducted without the presence of specialists; hence, it has the potential to save more lives, particularly in undeveloped countries [11].

2. Infant Cry Signal Diagnostic System Using Deep Learning and Fused Features

Numerous research works have been conducted to detect infant crying [17,18,19] and to identify the reason behind this crying and if this is related to a pathological case. Most of the current research works have focused on classifying pathological from healthy infants, using crying cues [20]. Other works go into more specifics to diagnose certain pathologies such as hypoacoustic [21], asphyxia [22,23,24], hypothyroidism [25], septic [18], RDS [26], and autism spectrum disorder (ASD) [27]. Such research studies and systems mainly involved two main stages, the feature computation and extraction stage, using the CAS and based on different audio domains, including the cepstral domain features, prosodic domain features, image domain features, time domain features, and wavelet domain [14]. The computed features are fed into the next part of the ML model which could be traditional machine learning models or DL models since researchers have recently begun to explore the use of DL algorithms for analyzing infant crying. DL approaches have shown effective results in automatically extracting useful features from audio signals and in classifying sounds into different categories such as healthy and sick infants [19,22,24,28,29,30,31,32].
Most researchers have adopted the cepstral domain features in the feature extraction from audio signals such as Mel frequency cepstral coefficients (MFCC) [33,34,35,36], linear frequency cepstral coefficients (LFCC) [37], short-time cepstral coefficients (STCC) [37], and Bark frequency cepstral coefficients (BFCC) [38], combined with both DL and traditional ML models. MFCCs were the most used in identifying infant pathologies. For instance, in [33], the authors’ system was used to classify the causes of the infants’ crying into eight reasons, including belly pain, discomfort, hungry, sleepy, and tired. The MFCC coefficients have been used to train three ML algorithms, including the K-nearest neighbors rule (KNN), SVM, and naïve Bayes classifier (NBC). The KNN had the highest accuracy of 76%. In [34], they used a dataset of CAS for healthy and pathological infants including 34 pathologies. As a first step, feature extraction was performed using a different set of techniques including the extraction of MFCC and amplitude modulation features. These features were fed into two machine learning algorithms, probabilistic neural networks, and an SVM algorithm with an accuracy of 72.80% and 78.70%, respectively.
Moreover, the MFCC was adopted for feature extraction from audio signals [28] to be used in the training of set machine learning models, including artificial neural network (ANN), CNN, and long short-term memory (LSTM). These ML models were trained to achieve two purposes, identify sick and healthy babies, and then determine the baby’s needs such as hunger/thirst, need for a diaper change, and emotional needs. On the first goal, CNN was able to achieve an accuracy of 95% and an accuracy of 60% was achieved for the second classification purpose. A similar feature extraction was also used along with KNN in [35] and achieved an accuracy of 71.42% in determining the reason for crying, including hunger, belly pain, need for burping, discomfort, and tiredness. In [36], MFCC was used with the CNN model with multiple variants to test and multistage a heterogeneous stacking ensemble model, which consists of four levels of algorithms, Nu-support vector classification, random forest (RF), XGBoost, and AdaBoost. The classification results of the CNN model outperformed the other ML algorithms, reaching an accuracy of 93.7%.
The prosodic domain features were also employed in the analysis and diagnosis of infants’ crying signals. This domain includes much valuable information, such as variations in intensity, fundamental frequency (F0), formants, harmonicity, and duration, which contribute a lot to infant crying signals analysis. This has been followed by a lot of research regarding whether stand-alone or being combined with the cepstral features improves performance. For instance, in [39], they based the proposed model on mean, median, standard deviation, and minimum and maximum of F0 and F123 to distinguish between full-term and preterm infant cries. In contrast, in [22], they used a combined model of weighted prosodic features and MFCC features, thus feeding them into a DL model which was able to achieve a 96.74% accuracy. The obtained results emphasized the importance of using both domains in extracting and modeling a more efficient feature set.
The authors in [40] depended on the wavelet domain audio feature by using the discrete wavelet transform (DWT) method to extract the coefficient characteristics. These coefficients have been used in the classification process using a single-layer neural feed-forward (SLNF) network. This system was able to distinguish between five categories of crying: Eh, Eairh, Neh, Heh, and Owh. Each one is related to a specific condition in a baby, where Heh is related to the feeling of discomfort, and Owh is related to feeling sleepy. Neh indicates thirst or hunger, and Eairh is related to the feeling of burping due to congested air in the chest or stomach. The crying signals were passed through discrete wavelet transform for feature extraction where all signals were then extracted for cry classification using five scaling functions of the wavelet transform, namely Haar, Db2, Coif1, Sym2, and Bior3.1, where the output of each function is used as an input for SLNF. The average accuracy of all discrete wavelet functions on the baby language is over 80%.
Furthermore, the image domain features were used in this field of study, where the main feature is the spectrogram, which is an image or a time–frequency representation of audio [14]. For example, the researchers in [32] classified the neonatal cry signals into pain, hunger, and sleepiness, using the short-time Fourier transform (STFT) technique to generate the spectrogram images, which were used as an input for training a deep convolutional neural network (DCNN), where the extracted features from the DCNN were used as an input for the SVM classifier, which was able to reach an accuracy of 88.89% using the radial basis function (RBF) kernel. Similarly, the spectrogram for the feature extraction and SVM classifier obtained an accuracy of 71.68% [41]. Moreover, the researcher in [29] used the spectrogram with the CNN model for classifying the condition of the baby, whether sleepy or in pain, and obtained an accuracy of 78.5%.
Some researchers have gone more deeply into this topic to diagnose a specific disease. For instance, the authors in [42] suggested a machine learning model to diagnose hypoxic ischemic encephalopathy disease in newborns based on CAS analysis. Multiple feature extraction techniques were used, including the MFCC and Gammatone frequency cepstral coefficients (GFCCs). These features were utilized by a basic deep network, achieving an accuracy of 96%. The authors in [37] introduced a classification model between healthy and unhealthy newborn cries. A set of feature extraction techniques were used, including MFCC, LFCC, STCC, and Teager energy cepstral coefficients (TECC). The classification process is based on the Gaussian mixture model (GMM) and SVM algorithms. Both models have been trained using the different features extracted separately and the results justified the superiority of the TECC representations with the GMM classifier, which achieved an accuracy of 99.47%. Furthermore, in [31], the researchers developed a DL approach that can classify healthy and pathological babies based on the infant’s CAS, where the signals were processed using cepstrum analysis to extract the harmonics in the cry records, and the outputted spectrum was fed into three DL models including deep feed-forward neural networks (DFFNN), LSTM, and CNN. The latter DL model outperformed the other algorithms with an accuracy of 95.31%. Similarly, the researchers in [43] adopted the cepstrum to build a model to distinguish between healthy and pathological infants based on the crying signal by evaluating DFFNN, naïve Bayes, SVM, and a probabilistic neural network. The DFFNN achieved a 100% accuracy.
Few researchers have followed a combined features domain similar to the work in [8] where they combined both GFCC and HR features by using simple concatenation to distinguish between RDS and sepsis. Using SVM and MLP, the SVM achieved 95.29% compared to 92.94% for the GFCC alone and 71.03% for the HR. While in [44], they combined images that contain the prosodic feature lines including F0, intensity, and formant spectrogram CNN and waveform CNN, producing a 5% better accuracy. This study [45] explored the use of DL models with hybrid features to classify asphyxia cries in infants. The models used a combination of MFCC, chromagram, Mel-scaled spectrogram, spectral contrast, and Tonnetz features. The results showed that the DNN models performed better with the hybrid features, achieving a 100% accuracy for normal and asphyxia cries, and a 99.96% accuracy for nonasphyxia and asphyxia cries. The CNN model performed better with the MFCC alone. The study demonstrated the effectiveness of using DL models with hybrid features for classifying asphyxia cries in infants.

This entry is adapted from the peer-reviewed paper 10.3390/diagnostics13122107

This entry is offline, you can click here to edit this entry!
Video Production Service