DL Techniques and Imaging in Head and Neck: Comparison
Please note this is a comparison between Version 1 by Michail Athanasopoulos and Version 2 by Catherine Yang.

Deep learning (DL) systems utilize complex algorithms and neural networks featuring numerous intricate layers in order to make decisions and solve advanced problems. Their application in medicine, and specifically in otorhinolaryngology has increased rapidly. The head and neck region is among the most common locations for cancer, with a substantial occurrence of lymph node involvement and metastases observed in both nearby and distant regions. 

  • deep learning
  • artificial intelligence
  • convolutional neural network

1. Head and Neck Imaging

Head and neck surgery relies majorly on imaging, which is often a pre-requisite before any further management. Different techniques offer significant advantages in disease diagnosis but also follow-up. Computed tomography (CT) and magnetic resonance imaging (MRI), during the last decades, have usually been used combinatorically in a large variety of medical conditions to acquire both bone and soft tissue information.
Lately, deep learning algorithms have emerged, which enable the conversion of one imaging modality to another. For example, MRI scans, which involve bone techniques, give us the possibility of a subsequent MRI to CT reconstruction, avoiding exposure to ionizing energy and aiding non-experts in diagnosis at the same time [1][11]. A combination of two generative adversarial networks has also been implemented to generate accurate synthetic CT images from MRI scans [2][12]. On the other hand, non-contrast CT scans can be converted to PET-like images with generative models, eliminating the need for radioactive tracers. The generated PET images demonstrate comparable accuracy to actual FDG-PET images in predicting clinical outcomes [3][13]. It seems rational to hypothesize that such deep learning pipelines can transform head and neck imaging into a one-step-procedure in the future.
Next, CNNs are believed to exhibit superior performance compared to a traditional radiomic framework regarding their ability to detect image patterns, often undetectable by the latter, while systems such as ultra-high-resolution CT with a DL-based image reconstruction engine offer significant amelioration in subjective and objective image quality, with a higher sound-to-noise ratio, lower noise, and lower radiation exposure [4][14].
The DL technique utilized in the analysis of medical images allows the incorporation of both qualitative and quantitative imaging characteristics to create prediction models characterized by exceptional diagnostic accuracy. These principles have been applied generally to HNSCC imaging, but also specifically to specific types of HNSCC. Notably, in the imaging of oral and oropharyngeal cancer, FDG-PET/CT scans can be processed by DL systems to predict local treatment outcomes [5][15], disease-free survival with high sensitivity and specificity [6][16], overall survival [7][17], and they can even assist in differentiating human papillomavirus positive from human papillomavirus negative oropharyngeal carcinomas [8][18].
At the same time, progress in computer vision and deep learning provide potent techniques for creating supplementary tools capable of automatically screening the oral cavity. These cost-effective and non-invasive tools can offer real-time insights for healthcare practitioners during patient assessments and can also facilitate self-examinations for individuals. The automated diagnosis of oral cancer through images is predominantly focused on the utilization of specialized imaging technologies, namely optical coherence tomography [9][10][19,20], hyperspectral imaging [11][21], and autofluorescence imaging [12][22], but also white-light photographs [13][23]. Such DL techniques can come in the form of mHealth applications, assisting in oral and oropharyngeal lesion detection in both hospitals and resource-limited areas, and enabling telediagnosis [14][24]. Finally, systems offering a real-time estimation of cancer risk and biopsy assistance maps on the oral mucosa are very promising [15][25].
Furthermore, diseases of the nasopharynx have been an area of focus during the last years for DL system developers. From MRI-based applications focusing on the differential diagnosis between benign and malignant nasopharyngeal diseases [16][17][26,27] to the automatic detection of pathological lymph nodes and assessment of the peritumoral area in nasopharyngeal carcinoma, DL algorithms can significantly assist in disease prognosis and treatment planning [18][28]. Interestingly, peritumoral information, especially the largest areas of tumor invasion, has been shown to provide valuable insights for distant metastasis prediction in individuals with nasopharyngeal carcinoma [19][29].
Imaging of the salivary glands constitutes another significant challenge for radiologists and otolaryngologists, who have many different imaging modalities in their quiver. Specialized DL algorithms have been developed to assist in differential diagnosis between benign and malignant parotid gland tumors in contrast-enhanced CT images [20][30], and ultrasonography [21][31]. ΜRI remains the gold standard in the diagnosis of salivary gland diseases, where DL models intend to automatically classify salivary gland tumors with very high accuracy [22][23][32,33].
Relative to thyroid disease diagnosis, ultrasound (US) is widely acknowledged as the primary diagnostic technique for examining thyroid nodules and assessing papillary thyroid carcinomas (PTCs) before surgery [24][34]. DL networks with excellent diagnostic efficiency have been deployed to distinguish between benign nodules and thyroid carcinoma [25][35], improve the detection of follicular carcinoma, differentiate between atypical and typical medullary carcinoma [26][36], and assess for gross extrathyroidal extension in thyroid cancer [27][37]. AI systems can be very useful in eliminating the operator dependence of US and ameliorating diagnosis precision, especially in inexperienced radiologists.
Nevertheless, plenty of other DL techniques are associated with thyroid gland evaluation. Thus, apart from thyroid gland contouring in non-contrast-enhanced CT images [28][38], special applications used intraoperatively to assist surgeons in recurrent laryngeal nerve [29][39] and parathyroid gland identification have been designed. Such algorithms have the potential to improve surgical workflows in the intricate environment of open surgery.
The head and neck region is among the most common locations for cancer, with a substantial occurrence of lymph node involvement and metastases observed in both nearby and distant regions. The identification of distant metastases is linked to an unfavorable prognosis, often resulting in a median survival period of around 10 months [30][40]. The role of imaging in metastasis diagnosis is uncontroversial and novel convolutional neural networks have been developed in this direction. For example, extended 2D-CNN and 3D-CNN models have been deployed to perform time-to-event analysis for the binary classification of distant metastasis in head and neck cancer patients. These models result in the generation of distant metastasis-free probability curves and stratify patients into high- and low-risk groups [31][41]. CNN are generally able to detect image patterns that can be untraceable with traditional methods. Thus, it has been shown that CNN can be trained to forecast the treatment results for individuals with HNSCC, relying exclusively on the information from CT scans conducted prior to treatment [32][42].
CNN assessing pre-treatment MRI scans to predict the possibility of distant metastases in individuals with nasopharyngeal carcinoma can also be useful, since the occurrence of a metastasis is the main reason for radiotherapy failure in this patient group. Predicting the high risk for distant metastasis in a patient can lead to a more aggressive treatment approach [19][29]. Moreover, pre-therapy MRI scans have been used in patients with advanced (T3N1M0) nasopharyngeal carcinoma to guide the clinicians in deciding between induction chemotherapy plus concurrent chemoradiotherapy or concurrent chemoradiotherapy alone [33][43].
DL models diagnosing lymph node involvement can boost clinical decision-making in the future. A relative model has been developed that detects pathological lymph nodes in individuals with oral cancer [34][44], while another one predicts lymph node involvement in patients with thyroid cancer through the interpretation of their multiphase dual-energy spectral CT images [35][45].
The utilization of deep learning techniques allows for the complete automation of image analysis providing the user with multiple possibilities (Table 1). Nevertheless, it demands a substantial volume of accurately labeled images. Additionally, prediction-making necessitates detailed patient endpoint data, a process that is both expensive and time-intensive. Developing more effective models with constrained datasets stands as a critical challenge in the field of AI today.
Table 1.
The contributions of deep learning systems in head and neck imaging and radiotherapy.

2. Head and Neck Radiotherapy

Radiotherapy (RT) stands as a fundamental pillar in head and neck cancer (HNC) treatment, whether administered independently, post-surgery, or concurrently with chemotherapy. Defining organs at risk (OARs) and clinical target volumes represents a crucial phase in the treatment protocol. This process typically involves manual work, is time-consuming, and necessitates substantial training. Ideally, these tasks would be substituted by automated procedures requiring minimal clinician involvement, and AI appears competent to undertake this role.
A major challenge and the primary drawback of radiation therapy is that, apart from the cancerous mass, it unavoidably exposes nearby healthy tissues, known as OARs, to some level of radiation. This can potentially result in various adverse effects and toxicities, since contouring organs like the parotid and the submandibular gland and excluding them from radiation intake can be quite arduous [36][46]. Additionally, DL-based automated segmentation of the masticatory area has successfully reduced the incidence of RT-associated trismus [37][47].
Several applications aiming to realize normal tissue structure auto-segmentation from CT images [38][39][48,49] exist. These can include three-dimensional segmentation models and convolutional neural networks for final OAR identification [40][50]. DL pipelines focusing on tumor segmentation in specific organs, such as the oropharynx [41][51], and the salivary glands promise to gradually automatize the RT procedure, and at the same time reduce post-segmentation editing [42][52].
On the other hand, 3D CNNs aim to consistently and precisely generate clinical target volumes contouring for the different lymph node levels in HNSCC RT [43][44][53,54]. Such applications show quicker contouring adjustments in comparison to automated delineations, closely aligning with corrected delineations for specific levels, and reducing interobserver variability.
The possibilities that DL systems offer are countless, with distant metastasis and overall survival prediction in HNSCC using PET-only models without gross tumor volume segmentation [45][55], or automatically delineating the gross tumor volume in the FDG-PET/CT images of HNSCC patients [46][56]. Overall, DL systems present the potential to offer personalized RT guidance in HNSCC patients, with limited contribution from medical experts.

3. Endoscopy and Laryngoscopy

Machine learning has been recently experimentally applied to diagnostic ENT endoscopy to leverage meaningful information from digital images, videos, and other visual inputs and take actions or make recommendations based on that information. Mediated from the early experience acquired in the more standardized field of gastrointestinal endoscopy, AI-based video analysis, or videomics [47][57], has been variously applied to improve automatic image quality control, classification, optical detection, and the segmentation of images. After numerous proof-of-concept studies, videomics is rapidly moving to viable clinical approaches for detecting pathological patterns in real-time assistance during the endoscopic evaluation of the upper aerodigestive ways.
A deep learning model consists of complex multilayer artificial neural networks, among which convoluted neural networks are the most popular in the image analysis field. The CNN does not require instructions on which features describe an object and can autonomously learn how to identify it by observing a sufficient number of examples. Various available AI models exist and have been applied [48][58], although a specific comparison between the various algorithm architectures for the task is still lacking. After this preliminary conceptualization phase, the model undergoes a supervised learning session, in which expert otolaryngologists provide the AI human annotated images to transfer their ability in recognizing the lesions. The higher the quality and quantity of items in the validation set, the more accurate the model will be. After the training validation set, the performance of the system is measured on the testing set by comparing the model prediction with the original human annotations. The performance will be evaluated using diagnostic metrics relative to the task analyzed.
AI can be used to classify endoscopic images. In that case, the diagnostic metrics of interest are accuracy (percentage of correctly classified images), precision (positive predictive value), and sensitivity (percentage of correctly identified images compared to all the ones that should have been recognized); F1 score (harmonic mean of precision and sensitivity); and the receiver operating characteristic curve (graphically identifying the true positive rate against the false positive one) [49][59]. In this framework, it is possible to apply AI to classify videos based on their image quality, selecting only the most informative frames for further analysis [50][51][60,61]. Another classification task is the optical biopsy [52][62], predicting the histology of a lesion based on its appearance. At the current state, AI is more accurate in binary classification, e.g., premalignant/malignant [53][63], whereas it loses diagnostic power in multiclass operation [54][64]. By expanding and diversifying the validation dataset, it is possible to achieve high accuracy in simultaneously identifying different conditions such as glottic carcinoma, leucoplakia, nodules, and vocal cord polyps [55][65], outperforming other approaches according to AUC and F1 otolaryngologist trainees [56][66].
Another task the AI is devised for is the automatic detection of lesions during endoscopic evaluation. The main diagnostic metrics for this function are the F1 score, the intersection over union (how well the selected area overlaps with the original annotated area), and the mean average precision (precision and sensitivity according to the chosen IoU). Using narrow band images, AI can be trained to localize mucosal cancerous lesions in the pharynx and larynx during endoscopy [57][58][59][67,68,69]. This concept has been recently applied to automatically detect laryngeal cancer in real time video-laryngoscopy using the open-source YOLO CNN, achieving 67% precision, 62% sensitivity, and 0.63 mean average precision at 0.5 IoU [60][70], which could be implemented in a self-screening approach for early tumor recurrence detection [61][71]. Based on simple diagnostic endoscopy, the same approach can be applied intraoperatively to detect pathological tissues, such as in endoscopic parathyroid surgery [62][63][72,73].
Finally, CNN has been used to automatically delineate the boundaries of anatomical structure and lesions in the upper aerodigestive ways. Segmentation performance is evaluated with IoU and the dice similarity coefficient (similarity between the predicted segmentation mask and the ground truth mask). The rationale of segmentation in videomics is to improve lesion follow-up, the definition of tumor resection margins in the operation room, and the area of interest for general laryngology. The automated segmentation of cancer tissue has been successfully attempted in the nasopharynx (DSC 0.78) [64][74], oropharynx (DSC 0.76) [65][75], and laryngeal lesion (DSC 0.814) [66][76]. Aside from cancer pathology, segmentation may be used to select the region of interest for automated functional laryngeal analysis, such us the identification of the glottis angle [67][68][77,78], glottal midline [69][79], vocal cord paralysis [70][80], postintubation granuloma [71][81], vocal cord dynamics [72][73][82,83], or in the endoscopic evaluation of aspiration and penetration risk in dysphagia (FESS-CAD, DSC 0.92.5) [74][84].
Building a sufficiently large and heterogeneous training image dataset is a necessary task required to improve the deep learning-based image classifier. The main obstacles remain the lack of standardization of endoscopic techniques and study structures, hampering a comparison between the different experiences, and the complex anatomy of the upper aerodigestive ways, making image acquisition and standardization difficult. Although deep learning models can be very good at analyzing images belonging to the same group of the training cohort, they may lack accuracy when tested on different populations. To effectively apply videomics in real world situations, future research should focus on validating the trained models with an external dataset, acquired in different institutions and thus being diverse in terms of acquisition technique and population demographics. Although AI-aided endoscopy is still in a preclinical state, the results are promising and may soon efficiently assist the otolaryngologist in many tasks, such as the quality assessment of endoscopic examination, detection of mucosal lesions during endoscopy, optical biopsy of selected lesions, segmentation of cancer margins, and the assessment of laryngeal mobility.