1. Introduction
Industry 4.0 characterizes the transformation from traditional automation to engineered cyber-physical systems with human-like intelligence (including sensing, problem solving, and other cognitive capabilities). Indeed, this gives current Artificial Intelligence (AI) the privilege of playing a central role in Industry 4.0. Moreover, in the last ten years (2013–2023), the leading branch of AI turned out to be deep learning (DL)
[1][2]. DL is an essential subfield of machine learning (ML) characterized by its layered structure of artificial neural networks (ANNs). Each layer extracts helpful knowledge to make decisions or future predictions to solve self-supervised, semisupervised, supervised, and unsupervised learning problems
[3][4][5]. In this sense, DL has become the main driver of the current AI hype. Moreover, DL has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers
[1], leading to more AI models with human-level performance, as pointed out in
[2]. This allows manufacturing companies to make intelligent use of the large number of data generated in the industrial business environment
[6].
Currently, many industry sectors are experiencing a tremendous conversion and integrating DL models in their solutions
[7][8][9]. For example, DL is used in the nuclear energy industry to predict cracking in hazardous areas. In the agriculture industry
[10], DL is used to analyze historical rainfall patterns, wind direction, and atmospheric pressure to predict storms and river water levels
[11]. In the manufacturing industry, it is used for predictive maintenance. In the food industry, it is used to understand current consumer preferences and behavior
[12]. In the automotive sector
[13], DL is used to guide autonomous vehicles
[14]. In the medical industry, it is used to diagnose and predict illnesses
[15][16][17]. The increase in popularity is mainly due to three factors: (1) the rapid evolution of the hardware with a highly parallel structure, (2) the development of open-source platforms for ML, and (3) the predominance of DL models in terms of accuracy and flexibility to represent the world with concepts that range from the simplest to the most complex. All this is powered by a vast data number
[18][19].
Due to the popularity of DL, specialized hardware is necessary because current microprocessors and CPUs are not designed for training and inferencing DL models. Nevertheless, Graphics Processing Units (GPUs) have emerged as the ideal complement to Central Processing Units (CPUs) to bring intelligence to applications. In addition, the neural network accelerators based on Field-Programmable Gate Arrays (FPGA) are favored over CPUs because they accelerate computations by mapping calculations into parallel hardware
[20]. Furthermore, the cloud has become mainstream and cheap. Consequently, industry, academia, and government are making their previously stored data locally available to the research community. As DL models are complex and require a large number of data, the availability of data sets significantly impacts AI research because today, scientists can access large data sets with diverse data points to train complex DL models. Therefore, problems like detecting cancer or predicting rainfall can be solved quickly and accurately because of high-performance computers and vast data. In addition, the available open-source platforms allow for the design, modification, and testing of models quickly and help deploy the model
[2][21].
Despite this importance, DL models are considered black boxes whose components or building blocks are unclear and difficult to understand. Much of the research is based on the straight application of deep networks that have already been developed. The authors do not contemplate modifications and avoid discussing a new network design for a specific input signal type. Furthermore, the explanation of the building blocks for carrying out new model designs or modifications to the current ones is limited, light, and dispersed in the different literature.
2. Deep Learning Building Blocks
In the past, computer applications were developed from a single data processing perspective. Currently, applications have shifted to ML being the DL one of the leading research trends, whose success is mainly due to convolutional neural networks (CNNs). DL algorithms can be classified into three groups: (i) recurrent neural networks (RNNs), (ii) multilayer perceptron (MLP), and (iii) CNNs.
RNNs are feed-forward neural network layers used as hidden states to allow past outputs to be used as inputs. A particular type of RNN is the long short-term memory (LSTM) network, capable of handling the vanishing gradient problem that RNNs face
[22]. RNNs are commonly used in sequential data or time series like language translation
[23], natural language processing (NLP)
[24], speech recognition
[25], and image captioning
[26]. Moreover, RNNs are included in popular applications such as Siri
[27] and Google Translate
[28]. Currently, some challenges
[29] that NLP is addressing with DL are (a) contextual words and homonyms, for example, the same words can have different meanings according to the context of a sentence, (b) irony and sarcasm, for example, a sentence that communicates the opposite of what is said, (c) ambiguity in phrases that can have two or more interpretations, (d) errors in text and speech, for example, misspelled words, (e) slang and colloquialism, and (f) domain-specific language, for example, the model used to process a legal document is different to that used to process a healthcare document.
An MLP consists of fully connected (FC) layers with nonlinear activation functions that separate nonlinearly separable data. MLPs have been successfully used in task pattern classification, recognition, prediction, and approximation. Transformers are among the newest and most powerful MLPs ever invented to track relationships in sequential data or words to learn the context and meanings of the data
[30]. Vision transformers (ViTs) are the transformer version for computer vision applications
[31][32]. In ViTs, the images are split patches serialized into a vector mapped to a smaller dimension. Then, the resulting vector is processed via a transformer. These algorithms have been employed for autonomous driving
[33], image classification, object detection, image segmentation
[34], video deepfake detection
[31], anomaly detection
[35], image synthesis
[36], and cluster analysis
[37]. Generative adversarial networks (GANs) are MLP extensions of stacked FC layers that generate new data with the same statistics as the training set. The GAN consists of two parts: the generator and the discriminator. The generator models the training data distribution and provides a compressed image representation, and the discriminator is a binary classifier that decides between real and fake
[38]. There are several applications of the GANs. For example, CycleGAN is a GAN architecture that uses a technique for training unsupervised image translation models to learn transformations between images of different styles
[39]. StyleGAN is a GAN that generates images of high resolution built from a stack of FC layers. The initial layers generate low-resolution images, and further layers refine the resolution
[40]. PixelRNN is an auto-regressive generative model capable of learning an explicit data distribution, where GANs learn implicit probability distributions
[41]. Autoencoders (AEs) and variational autoencoders (VAEs) are generative models explicitly designed to capture the probability distribution of a training set and generate new samples
[42]. An encoder is made of FC layers that decrease in dimension as the encoder becomes deeper. This compresses the input data into an encoded representation several orders of magnitude smaller than the input data to produce the latent space of variables. A decoder takes a representation of the latent space and decompresses it, increasing the dimension of the FC layer as it approaches the output. When combined with prior knowledge, AE-based models have proven successful for anomaly detection in hyperspectral images. For example, in
[43], a dynamic low-rank and sparse prior-constrained model was developed to combine a linear-based low-rank model, a sparse model, and a nonlinear-based deep AE to detect the anomaly and to extract the discriminative features between the background and anomaly for complex scenes. In
[44], a deep self-representation learning framework for hyperspectral anomaly detection was proposed. The model integrates the prior knowledge of robust principal component analysis (PCA) and the local spatial information into the AE model for a result that outperforms state-of-the-art methods.
CNNs are stacked layers composed of the convolution of the trainable filter with the input signal or receptive field, followed by a pooling and an activation layer. The filtering extracts features from previous layers to form a feature map. CNNs are used for classification and prediction in computer vision tasks. For example, the LeNet-5
[45][46], the AlexNet
[47], the VGG-16
[48][49], the DenseNet121
[50][51], the ResNet50
[52][53], and the MobileNet-V2
[54][55] have been used for classification tasks, while the U-Net
[56][57] has been used for semantic segmentation problems. At present, some challenges addressed by computer vision
[58] through DL are (a) image and video synthesis to create realistic images and videos for content creation and entertainment, (b) image style transfer to merge the artistic style of one image with the content of another, (c) text-to-image synthesis to extract meaning from the text description and convert it into an image for image editing, (d) enhancing the capabilities of autonomous vehicles to more precisely handle difficult driving scenarios, (e) detecting early signs of diseases before the symptoms appear, (f) identifying suspicious behavior or objects more accurately for security purposes, and (g) making DL models more interpretable, especially in applications where human lives, safety, and ethics are involved.
In summary, it is essential to mention that works in the literature have only addressed one side, focusing on one application or topic, such as the review of CNN architectures. However, the works do not provide a complete understanding of DL topics, such as the concepts and math behind the building blocks used to develop an architecture.
This entry is adapted from the peer-reviewed paper 10.3390/math12020296