Deep Learning Building Blocks: Comparison
Please note this is a comparison between Version 2 by Fanny Huang and Version 1 by Humberto Ochoa.

Industry 4.0 characterizes the transformation from traditional automation to engineered cyber-physical systems with human-like intelligence. Indeed, this gives Artificial Intelligence (AI) the privilege of playing a central role in Industry 4.0. Moreover, the leading branch of AI turned out to be deep learning (DL). DL is an essential subfield of machine learning (ML) characterized by its layered structure of artificial neural networks (ANNs). 

  • deep learning
  • artificial neural networks (ANNs)

1. Introduction

Industry 4.0 characterizes the transformation from traditional automation to engineered cyber-physical systems with human-like intelligence (including sensing, problem solving, and other cognitive capabilities). Indeed, this gives current Artificial Intelligence (AI) the privilege of playing a central role in Industry 4.0. Moreover, in the last ten years (2013–2023), the leading branch of AI turned out to be deep learning (DL) [1,2][1][2]. DL is an essential subfield of machine learning (ML) characterized by its layered structure of artificial neural networks (ANNs). Each layer extracts helpful knowledge to make decisions or future predictions to solve self-supervised, semisupervised, supervised, and unsupervised learning problems [3,4,5][3][4][5]. In this sense, DL has become the main driver of the current AI hype. Moreover, DL has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers [1], leading to more AI models with human-level performance, as pointed out in [2]. This allows manufacturing companies to make intelligent use of the large number of data generated in the industrial business environment [6].
Currently, many industry sectors are experiencing a tremendous conversion and integrating DL models in their solutions [7,8,9][7][8][9]. For example, DL is used in the nuclear energy industry to predict cracking in hazardous areas. In the agriculture industry [10], DL is used to analyze historical rainfall patterns, wind direction, and atmospheric pressure to predict storms and river water levels [11]. In the manufacturing industry, it is used for predictive maintenance. In the food industry, it is used to understand current consumer preferences and behavior [12]. In the automotive sector [13], DL is used to guide autonomous vehicles [14]. In the medical industry, it is used to diagnose and predict illnesses [15,16,17][15][16][17]. The increase in popularity is mainly due to three factors: (1) the rapid evolution of the hardware with a highly parallel structure, (2) the development of open-source platforms for ML, and (3) the predominance of DL models in terms of accuracy and flexibility to represent the world with concepts that range from the simplest to the most complex. All this is powered by a vast data number [18,19][18][19].
Due to the popularity of DL, specialized hardware is necessary because current microprocessors and CPUs are not designed for training and inferencing DL models. Nevertheless, Graphics Processing Units (GPUs) have emerged as the ideal complement to Central Processing Units (CPUs) to bring intelligence to applications. In addition, the neural network accelerators based on Field-Programmable Gate Arrays (FPGA) are favored over CPUs because they accelerate computations by mapping calculations into parallel hardware [20]. Furthermore, the cloud has become mainstream and cheap. Consequently, industry, academia, and government are making their previously stored data locally available to the research community. As DL models are complex and require a large number of data, the availability of data sets significantly impacts AI research because today, scientists can access large data sets with diverse data points to train complex DL models. Therefore, problems like detecting cancer or predicting rainfall can be solved quickly and accurately because of high-performance computers and vast data. In addition, the available open-source platforms allow for the design, modification, and testing of models quickly and help deploy the model [2,21][2][21].
Despite this importance, DL models are considered black boxes whose components or building blocks are unclear and difficult to understand. Much of the research is based on the straight application of deep networks that have already been developed. The authors do not contemplate modifications and avoid discussing a new network design for a specific input signal type. Furthermore, the explanation of the building blocks for carrying out new model designs or modifications to the current ones is limited, light, and dispersed in the different literature.

2. Deep Learning Building Blocks

In the past, computer applications were developed from a single data processing perspective. Currently, applications have shifted to ML being the DL one of the leading research trends, whose success is mainly due to convolutional neural networks (CNNs). DL algorithms can be classified into three groups: (i) recurrent neural networks (RNNs), (ii) multilayer perceptron (MLP), and (iii) CNNs.
RNNs are feed-forward neural network layers used as hidden states to allow past outputs to be used as inputs. A particular type of RNN is the long short-term memory (LSTM) network, capable of handling the vanishing gradient problem that RNNs face [23][22]. RNNs are commonly used in sequential data or time series like language translation [24][23], natural language processing (NLP) [25][24], speech recognition [26][25], and image captioning [27][26]. Moreover, RNNs are included in popular applications such as Siri [28][27] and Google Translate [29][28]. Currently, some challenges [30][29] that NLP is addressing with DL are (a) contextual words and homonyms, for example, the same words can have different meanings according to the context of a sentence, (b) irony and sarcasm, for example, a sentence that communicates the opposite of what is said, (c) ambiguity in phrases that can have two or more interpretations, (d) errors in text and speech, for example, misspelled words, (e) slang and colloquialism, and (f) domain-specific language, for example, the model used to process a legal document is different to that used to process a healthcare document.
An MLP consists of fully connected (FC) layers with nonlinear activation functions that separate nonlinearly separable data. MLPs have been successfully used in task pattern classification, recognition, prediction, and approximation. Transformers are among the newest and most powerful MLPs ever invented to track relationships in sequential data or words to learn the context and meanings of the data [31][30]. Vision transformers (ViTs) are the transformer version for computer vision applications [32,33][31][32]. In ViTs, the images are split patches serialized into a vector mapped to a smaller dimension. Then, the resulting vector is processed via a transformer. These algorithms have been employed for autonomous driving [34][33], image classification, object detection, image segmentation [35][34], video deepfake detection [32][31], anomaly detection [36][35], image synthesis [37][36], and cluster analysis [38][37]. Generative adversarial networks (GANs) are MLP extensions of stacked FC layers that generate new data with the same statistics as the training set. The GAN consists of two parts: the generator and the discriminator. The generator models the training data distribution and provides a compressed image representation, and the discriminator is a binary classifier that decides between real and fake [39][38]. There are several applications of the GANs. For example, CycleGAN is a GAN architecture that uses a technique for training unsupervised image translation models to learn transformations between images of different styles [40][39]. StyleGAN is a GAN that generates images of high resolution built from a stack of FC layers. The initial layers generate low-resolution images, and further layers refine the resolution [41][40]. PixelRNN is an auto-regressive generative model capable of learning an explicit data distribution, where GANs learn implicit probability distributions [42][41]. Autoencoders (AEs) and variational autoencoders (VAEs) are generative models explicitly designed to capture the probability distribution of a training set and generate new samples [43][42]. An encoder is made of FC layers that decrease in dimension as the encoder becomes deeper. This compresses the input data into an encoded representation several orders of magnitude smaller than the input data to produce the latent space of variables. A decoder takes a representation of the latent space and decompresses it, increasing the dimension of the FC layer as it approaches the output. When combined with prior knowledge, AE-based models have proven successful for anomaly detection in hyperspectral images. For example, in [44][43], a dynamic low-rank and sparse prior-constrained model was developed to combine a linear-based low-rank model, a sparse model, and a nonlinear-based deep AE to detect the anomaly and to extract the discriminative features between the background and anomaly for complex scenes. In [45][44], a deep self-representation learning framework for hyperspectral anomaly detection was proposed. The model integrates the prior knowledge of robust principal component analysis (PCA) and the local spatial information into the AE model for a result that outperforms state-of-the-art methods.
CNNs are stacked layers composed of the convolution of the trainable filter with the input signal or receptive field, followed by a pooling and an activation layer. The filtering extracts features from previous layers to form a feature map. CNNs are used for classification and prediction in computer vision tasks. For example, the LeNet-5 [46[45][46],47], the AlexNet [48][47], the VGG-16 [49[48][49],50], the DenseNet121 [51[50][51],52], the ResNet50 [53[52][53],54], and the MobileNet-V2 [55,56][54][55] have been used for classification tasks, while the U-Net [57,58][56][57] has been used for semantic segmentation problems. At present, some challenges addressed by computer vision [59][58] through DL are (a) image and video synthesis to create realistic images and videos for content creation and entertainment, (b) image style transfer to merge the artistic style of one image with the content of another, (c) text-to-image synthesis to extract meaning from the text description and convert it into an image for image editing, (d) enhancing the capabilities of autonomous vehicles to more precisely handle difficult driving scenarios, (e) detecting early signs of diseases before the symptoms appear, (f) identifying suspicious behavior or objects more accurately for security purposes, and (g) making DL models more interpretable, especially in applications where human lives, safety, and ethics are involved.
In summary, it is essential to mention that works in the literature have only addressed one side, focusing on one application or topic, such as the review of CNN architectures. However, the works do not provide a complete understanding of DL topics, such as the concepts and math behind the building blocks used to develop an architecture.

References

  1. Hamid, O. Data-centric and model-centric AI: Twin drivers of compact and robust industry 4.0 solutions. Appl. Sci. 2023, 13, 2753.
  2. Hamid, O.; Smith, N.; Barzanji, A. Automation, per se, is not job elimination: How artificial intelligence forwards cooperative human-machine coexistence. In Proceedings of the 15th IEEE International Conference on Industrial Informatics (INDIN), Emden, Germany, 24 July 2017; pp. 899–904.
  3. Jiang, X.; Hadid, A.; Pang, Y.; Granger, E.; Feng, X. Deep Learning in Object Detection and Recognition; Springer Nature: Singapore, 2019.
  4. Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449.
  5. Schmarje, L.; Santarossa, M.; Schröder, S.; Koch, R. A Survey on semi-, self- and unsupervised learning for image classification. IEEE Access 2021, 9, 82146–82168.
  6. Rikalovic, A.; Suzic, N.; Bajic, B.; Piuri, V. Industry 4.0 implementation challenges and opportunities: A technological perspective. IEEE Syst. J. 2022, 16, 2797–2810.
  7. Sorin, G.; Bogdan, T.; Tiberiu, C.; Gigel, M. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386.
  8. Sun, Q.; Ge, Z. Deep learning for industrial KPI prediction: When ensemble learning meets semi-supervised data. IEEE Trans. Ind. Inform. 2021, 17, 260–269.
  9. Daud, M.; Saad, H.; Ijab, M. Conceptual design of human detection via deep learning for industrial safety enforcement in manufacturing site. In Proceedings of the 2021 IEEE International Conference on Automatic Control Intelligent Systems (I2CACIS), Shah Alam, Malaysia, 26 June 2021; pp. 369–373.
  10. Liu, Y.; Ma, X.; Shu, L.; Hancke, G.; Abu-Mahfouz, A. From industry 4.0 to agriculture 4.0: Current status, enabling technologies, and research challenges. IEEE Trans. Ind. Inform. 2021, 17, 4322–4334.
  11. Masrur, A.; Deo, R.; Ghahramani, A.; Feng, Q.; Raj, N.; Yin, Z.; Yang, L. New double decomposition deep learning methods for river water level forecasting. Sci. Total Environ. 2022, 831, 154722.
  12. Shiuann, S.; Bhaskar, C.; Vinay, S. A neural network based price sensitive recommender model to predict customer choices based on price effect. J. Retail. Consum. Serv. 2021, 61, 102573.
  13. Singh, S.; Yadav, B.; Batheri, R. Industry 4.0: Meeting the challenges of demand sensing in the automotive industry. IEEE Eng. Manag. Rev. 2023, 51, 179–184.
  14. Turay, T.; Vladimirova, T. Toward performing image classification and object detection with convolutional neural networks in autonomous driving systems: A survey. IEEE Access 2022, 10, 14076–14119.
  15. Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2019, 51, 1–36.
  16. Shi, D.; Ping, W.; Khushnood, A. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379.
  17. Piccialli, F.; Di Somma, V.; Gianpaolo, F.; Cuomo, S.; Fortino, G. A survey on deep learning in medicine: Why, how and when? Inf. Fusion 2021, 66, 111–137.
  18. Gary, M. The Next Decade in AI: Four Steps towards Robust Artificial Intelligence. 2020. Available online: https://arxiv.org/abs/2002.06177 (accessed on 22 January 2022).
  19. Ganaie, M.; Minghui, H.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151.
  20. Dhilleswararao, P.; Boppu, S.; Manikandan, M.; Cenkeramaddi, L. Efficient hardware architectures for accelerating deep neural networks: Survey. IEEE Access 2022, 10, 131788–131828.
  21. Osypanka, P.; Nawrocki, P. Resource usage cost optimization in cloud computing using machine learning. IEEE Trans. Cloud Comput. 2022, 10, 2079–2089.
  22. Ribeiro, A.; Tiels, K.; Aguirre, L.; Schön, T. Beyond exploding and vanishing gradients: Analysing RNN training using attractors and smoothness. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 28–30 March 2020; pp. 2370–2380.
  23. Natarajan, B.; Rajalakshmi, E.; Elakkiya, R.; Kotecha, K.; Abraham, A.; Gabralla, L.A.; Subramaniyaswamy, V. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access 2022, 10, 104358–104374.
  24. Choo, S.; Kim, W. A study on the evaluation of tokenizer performance in natural language processing. Appl. Artif. Intell. 2023, 37, 2175112.
  25. Oruh, J.; Viriri, S.; Adegun, A. Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 2022, 10, 30069–30079.
  26. Sairam, G.; Mandha, M.; Prashanth, P.; Swetha, P. Image captioning using CNN and LSTM. In Proceedings of the 4th Smart Cities Symposium (SCS 2021), Online, 21–23 November 2021; pp. 274–277.
  27. Apple Inc. Speech and Natural Language Processing: Voice Trigger System for Siri. 2023. Available online: https://machinelearning.apple.com/research/voice-trigger/ (accessed on 30 December 2023).
  28. NLP Architect by Intel® AI Lab. Compression of Google Neural Machine Translation Model. 2023. Available online: https://intellabs.github.io/nlp-architect/sparse_gnmt.html (accessed on 30 December 2023).
  29. Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2023, 82, 3713–3744.
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15.
  31. Coccomini, D.; Messina, N.; Gennaro, C.; Falchi, F. Combining efficientNet and vision transformers for video deepfake detection. In Image Analysis and Processing—ICIAP 2022; Sclaroff, S., Distante, C., Leo, M., Farinella, G., Tombari, F., Eds.; Springer: Cham, Switzerland, 2022; pp. 219–229.
  32. Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677.
  33. Ma, J.; Xiong, G.; Xu, J.; Chen, X. CVTNet: A cross-view transformer network for LiDAR-based place recognition in autonomous driving environments. IEEE Trans. Ind. Inform. 2023, 1–10, early access.
  34. Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802.
  35. Yao, H.; Luo, W.; Yu, W.; Zhang, X.; Qiang, Z.; Luo, D.; Shi, H. Dual-attention transformer and discriminative flow for industrial visual anomaly detection. IEEE Trans. Autom. Sci. Eng. 2023, 1–15, early access.
  36. Dalmaz, O.; Yurt, M.; Çukur, T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 2022, 41, 2598–2614.
  37. Xie, Y.; Zhang, J.; Xia, Y.; van den Hengel, A.; Wu, Q. ClusTR: Exploring Efficient Self-Attention via Clustering for Vision Transformers. arXiv 2022, arXiv:2208.13138.
  38. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144.
  39. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv 2020, arXiv:1703.10593.
  40. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410.
  41. Van Den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 20–22 June 2016; pp. 1747–1756.
  42. Ehsan, A.; Dick, A.; Van Den Hengel, A. Infinite variational autoencoder for semi-Supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
  43. Lin, S.; Zhang, M.; Cheng, X.; Shi, L.; Gamba, P.; Wang, H. Dynamic low-rank and sparse priors constrained deep autoencoders for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–18.
  44. Cheng, X.; Zhang, M.; Lin, S.; Li, Y.; Wang, H. Deep Self-Representation Learning Framework for Hyperspectral Anomaly Detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–16.
  45. LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551.
  46. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324.
  47. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90.
  48. Karen, S.; Andrew, Z. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Yoshua, B., Yann, L., Eds.; Cornell University: Ithaca, NY, USA, 2015.
  49. Haque, M.; Lim, H.; Kang, D. Object Detection Based on VGG with ResNet Network. In Proceedings of the 2019 International Conference on Electronics, Information, and Communication (ICEIC), Auckland, New Zealand, 22–25 January 2019; pp. 1–3.
  50. Huang, G.; Liu, Z.; Van Der, M.; Weinberger, K. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269.
  51. Kateb, Y.; Meglouli, H.; Khebli, A. Coronavirus Diagnosis Based on Chest X-ray Images and Pre-Trained DenseNet-121. Rev. D’Intell. Artif. 2023, 37, 23.
  52. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
  53. Çınar, A.; Yıldırım, M.; Eroğlu, Y. Classification of pneumonia cell images using improved ResNet50 model. Trait. Signal 2021, 38, 165–173.
  54. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–23 June 2018; pp. 4510–4520.
  55. Kaya, Y.; Gürsoy, E. A MobileNet-based CNN model with a novel fine-tuning mechanism for COVID-19 infection detection. Soft Comput. 2023, 27, 5521–5535.
  56. Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651.
  57. Naseer, I.; Akram, S.; Masood, T.; Rashid, M.; Jaffar, A. Lung cancer classification using modified U-Net based lobe segmentation and nodule detection. IEEE Access 2023, 11, 60279–60291.
  58. Morris, D.; Joppa, L. Challenges for the computer vision community. In Conservation Technology; Serge, A.W., Alex, K.P., Eds.; Oxford Academic: Oxford, UK, 2021; pp. 225–238.
More
Video Production Service