Embedded machine learning (EML) can be applied in the areas of accurate computer vision schemes, reliable speech recognition, innovative healthcare, robotics, and more. However, there exists a critical drawback in the efficient implementation of ML algorithms targeting embedded applications. Machine learning algorithms are generally computationally and memory intensive, making them unsuitable for resource-constrained environments such as embedded and mobile devices. In order to efficiently implement these compute and memory-intensive algorithms within the embedded and mobile computing space, innovative optimization techniques are required at the algorithm and hardware levels.
1. Introduction
Machine learning is a branch of artificial intelligence that describes techniques through which systems learn and make intelligent decisions from available data. Machine learning techniques can be classified under three major groups, which are supervised learning, unsupervised learning, and reinforcement learning as described in
Table 1. In supervised learning, labeled data can be learned while in unsupervised learning, hidden patterns can be discovered from unlabeled data, and in reinforcement learning, a system may learn from its immediate environment through the trial and error method
[1][2][3]. The process of learning is referred to as the
training phase of the model and is often carried out using computer architectures with high computational resources such as multiple GPUs. After learning, the trained model is then used to make intelligent decisions on new data. This process is referred to as the
inference phase of the implementation. The inference is often intended to be carried out within user devices with low computational resources such as IoT and mobile devices.
Table 1. Machine learning techniques.
Machine Learning Techniques |
Supervised Learning |
Unsupervised Learning |
Reinforcement Learning |
Classification |
Regression |
Clustering |
Genetic Algorithms |
SVM |
SVR |
HMM |
Estimated Value Functions |
Naïve Bayes |
Linear Regression |
GMM |
Simulated Annealing |
k-NN |
Decision Trees |
k-means |
|
Logistic Regression |
ANN |
DNN |
|
Discriminant Analysis |
Ensemble Methods |
|
|
DNN |
DNN |
|
|
In recent times, machine learning techniques have been finding useful applications in various research areas and particularly in embedded computing systems. In this research, we surveyed recent works of literature concerning machine learning techniques implemented within resource-scarce environments such as mobile devices and other IoT devices between 2014 and 2020. We present the results of this survey in a tabular form given in Table 2. Our survey revealed that of all available machine learning techniques, SVMs, GMMs, DNNs, k-NNs, HMMs, decision trees, logistic regression, k-means, and naïve Bayes are common techniques adopted for embedded and mobile applications. Naïve Bayes and decision trees have low complexity in terms of computation and memory costs and thus do not require innovative optimizations as pointed out by Sayali and Channe [37]. Logistic regression algorithms are computationally cheaper than naïve Bayes and decision trees, meaning they have even lower complexity [38]. HMMs, k-NNs, SVMs, GMMs, and DNNs are however computationally and memory intensive and hence, require novel optimization techniques to be carried out to be efficiently squeezed within resource-limited environments. We have thus limited our focus to these compute intensive ML models and discuss state-of-the-art optimization techniques through which these algorithms may be efficiently implemented within resource-constrained environments.
Table 2. Machine Learning Techniques in Resource-Constrained Environments.
Reference |
ML Method |
Embedded/Mobile Platform |
Application |
Year |
[4] |
SVM |
ARMv7, IBM PPC440 |
Network Configuration |
2015 |
[5] |
DNN |
FPGA Zedboard with 2 ARM Cortex Cores |
Character Recognition |
2015 |
[6] |
DNN |
Xilinx FPGA board |
Image classification |
2016 |
[7] |
LSTM RNN |
Zynq 7020 FPGA |
Character Prediction |
2016 |
[8] |
CNN |
VC707 Board with Xilinx FPGA chip |
Image Classification |
2015 |
[9] |
GMM |
Raspberry Pi |
Integer processing |
2014 |
[10] |
k-NN, SVM |
Mobile Device |
Fingerprinting |
2014 |
[11] |
k-NN |
Mobile Device |
Fingerprinting |
2014 |
[12] |
k-NN, GMM |
Mobile Device |
Mobile Device Identification |
2015 |
[13] |
SVM |
Xilinx Virtex 7 XC7VX980 FPGA |
Histopathological image classification |
2015 |
[14] |
HMM |
Nvidia Kepler |
Speech Recognition |
2015 |
[15] |
Logistic Regression |
Smart band |
Stress Detection |
2015 |
[16] |
k-means |
Smartphone |
Indoor Localization |
2015 |
[17] |
Naïve Bayes |
AVR ATmega-32 |
Home Automation |
2015 |
[18] |
k-NN |
Smartphone |
Image Recognition |
2015 |
[19] |
Decision Tree |
Mobile Device |
Health Monitoring |
2015 |
[20] |
GMM |
FRDM-K64F equipped with ARM Cortex-M4F core |
IoT sensor data analysis |
2016 |
[21] |
CNN |
FPGA Xilinx Zynq ZC706 Board |
Image Classification |
2016 |
[22] |
CNN |
Mobile Device |
Mobile Sensing |
2016 |
[23] |
SVM |
Mobile Device |
Fingerprinting |
2016 |
[24] |
k-NN, SVM |
Mobile Device |
Fingerprinting |
2016 |
[25] |
k-NN |
Xilinx Virtex-6 FPGA |
Image Classification |
2016 |
[26] |
HMM |
Arduino UNO |
Disease detection |
2016 |
[27] |
Logistic Regression |
Wearable Sensor |
Stress Detection |
2016 |
[28] |
Naïve Bayes |
Smartphone |
Health Monitoring |
2016 |
[29] |
Naïve Bayes |
Mobile Devices |
Emotion Recognition |
2016 |
[30] |
k-NN |
Smartphone |
Data Mining |
2016 |
[31] |
HMM |
Smartphone Sensors |
Activity Recognition |
2017 |
[32] |
DNN |
Smartphone |
Face detection, activity recognition |
2017 |
[33] |
CNN |
Mobile Device |
Image classification |
2017 |
[34] |
SVM |
Mobile Device |
Mobile Device Identification |
2017 |
[35] |
SVM |
Jetson-TK1 |
Healthcare |
2017 |
[36] |
SVM, Logistic Regression |
Arduino UNO |
Stress Detection |
2017 |
[37] |
Naïve Bayes |
Smartphone |
Emotion Recognition |
2017 |
[38] |
k-means |
Smartphones |
Safe Driving |
2017 |
[39] |
HMM |
Mobile Device |
Health Monitoring |
2017 |
[40] |
k-NN |
Arduino UNO |
Image Classification |
2017 |
[41] |
SVM |
Wearable Device (nRF51822 SoC+BLE) |
Battery Life Management |
2018 |
[42] |
SVM |
Zybo Board with Z-7010 FPSoC |
Face Detection |
2018 |
[43] |
CNN |
Raspberry Pi + Movidus Neural Compute Stick |
Vehicular Edge Computing |
2018 |
[44] |
CNN |
Jetson TX2 |
Image Classification |
2018 |
[45] |
HMM |
Smartphone |
Healthcare |
2018 |
[46] |
k-NN |
Smartphone |
Health Monitoring |
2019 |
[47] |
Decision Trees |
Arduino UNO |
Wound Monitoring |
2019 |
[48] |
RNN |
ATmega640 |
Smart Sensors |
2019 |
[49] |
SVM, Logistic Regression, k-means, CNN |
Raspberry Pi |
Federated Learning |
2019 |
[50] |
DNN |
Raspberry Pi |
Transient Reduction |
2020 |
[51] |
MLP |
Embedded SoC (ESP4ML) |
Classification |
2020 |
[52] |
HMM |
Smartphone |
Indoor Localization |
2020 |
[53] |
k-NN |
Smartphone |
Energy Management |
2020 |
[54] |
ANN, Decision Trees |
Raspberry Pi |
Classification and Regression |
2021 |
2. Challenges and Optimization Opportunities in Embedded Machine Learning
Embedded computing systems are generally limited in terms of available computational power and memory requirements. Furthermore, they are required to consume very low power and to meet real-time constraints. Thus, for these computationally intensive machine learning models to be executed efficiently in the embedded systems space, appropriate optimizations are required both at the hardware architecture and algorithm levels
[55][56]. In this section, we survey optimization methods to tackle bottlenecks in terms of power consumption, memory footprint, latency concerns, and throughput and accuracy loss.
2.1. Power Consumption
The total energy consumed by an embedded computing application is the sum of the energy required to fetch data from the available memory storage and the energy required to perform the necessary computation in the processor.
Table 3 shows the energy required to perform different operations in an ASIC. It can be observed from
Table 3 that the amount of energy required to fetch data from the SRAM is much less, than when fetching data from the off-chip DRAM and very minimal if the computation is done at the register files. From this insight, we can conclude that computation should be done as close to the processor as possible to save energy. However, this is a bottleneck because the standard size of available on-chip memory in embedded architectures is very low compared to the size of deep learning models
[57]. Algorithmic-based optimization techniques for model compression such as parameter pruning, sparsity, and quantization may be applied to address this challenge
[58]. Also, hardware design-based optimizations such as Tiling and data reuse may be utilized
[8]. The next section expatiates some of these optimization methods in further detail. Furthermore, most machine-learning models, especially deep learning models, require huge amounts of multiply and accumulate (MAC) operations for effective training and inference.
Figure 1 describes the power consumed by the MAC unit as a function of the bit precision adopted by the system. We may observe that the higher the number of bits, the higher the power consumed. Thus, to reduce the power consumed during computation, reduced bit precision arithmetic and data quantization may be utilized
[59].

Figure 1. This graph describes the energy consumption and prediction accuracy of a DNN as a function of the Arithmetic Precision adopted for a single MAC unit in a 45 nm CMOS
[57]. It may be deduced from the graph that lower number precisions consume less power than high precisions with no loss in prediction accuracy. However, we can observe that when precision is reduced below a particular threshold (16 bit fp), the accuracy of the model is greatly affected. Thus, quantization may be performed successfully to conserve energy but quantizing below 16-bit fp may require retraining and fine-tuning to restore the accuracy of the model.
Table 3. Energy Consumption in (pJ) of performing operations.
Operation |
Energy (pJ) |
8 bit int ADD |
0.03 |
16 bit int ADD |
0.05 |
32 bit int ADD |
0.1 |
16 bit float ADD |
0.4 |
32 bit float ADD |
0.9 |
8 bit MULT |
0.2 |
32 bit MULT |
3.1 |
16 bit float MULT |
1.1 |
32 bit float MULT |
3.7 |
32 bit SRAM READ |
5.0 |
32 bit DRAM READ |
640 |
2.2. Memory Footprint
The available on-chip and off-chip memory in embedded systems are very limited compared to the size of ML parameters (synapses and activations)
[60]. Thus, there is a bottleneck for storing model parameters and activations within this constrained memory. Network pruning (removing redundant parameters)
[58] and data quantization
[59] (reducing the number of bits used to represent model parameters) are the primary optimization techniques adopted to significantly compress the overall model size such that they can fit into the standard memory sizes of embedded computers.
2.3. Latency and Throughput Concerns
Embedded systems are required to meet real-time deadlines. Thus, latency and overall throughput can be a major concern as an inability to meet these tight constraints could sometimes result in devastating consequences. The parameters of deep learning models are very large and are often stored off-chip or in external SDCARDs, which introduces latency concerns. Latency results from the time required to fetch model parameters from off-chip DRAM or external SDCARDs before appropriate computation can be performed on these parameters
[58]. Thus, storing the parameters as close as possible to the computation unit using Tiling and data reuse, hardware-oriented direct memory access (DMA) optimization techniques would reduce the latency and thus, inform high computation speed
[61]. In addition, because ML models require a high level of parallelism for efficient performance, throughput is a major issue. Memory throughput can be optimized by introducing pipelining
[5].
2.4. Prediction Accuracy
Although deep learning models are tolerant of low bit precision
[62], reducing the bit precision below a certain threshold could significantly affect the prediction accuracy of these models and introduce no little errors, which could be costly for the embedded application. To address the errors which model compression techniques such as reduced precision or quantization introduce, the compressed model can be retrained or fine-tuned to improve precision accuracy
[57][58][63][64].
2.5. Some Hardware-Oriented and Algorithm-Based Optimization Techniques
Hardware acceleration units may be designed using custom FPGAs or ASICs to inform low latency and high throughput. These designs are such that they may optimize the data access from external memory and/or introduce an efficient pipeline structure using buffers to increase the throughput of the architecture. In sum, some hardware-based optimization techniques are presented in this section to guide computer architects in designing and developing highly efficient acceleration units to inform high performance