Embedded machine learning (EML) can be applied in the areas of accurate computer vision schemes, reliable speech recognition, innovative healthcare, robotics, and more. However, there exists a critical drawback in the efficient implementation of ML algorithms targeting embedded applications. Machine learning algorithms are generally computationally and memory intensive, making them unsuitable for resource-constrained environments such as embedded and mobile devices. In order to efficiently implement these compute and memory-intensive algorithms within the embedded and mobile computing space, innovative optimization techniques are required at the algorithm and hardware levels.
1. Introduction
Machine learning is a branch of artificial intelligence that describes techniques through which systems learn and make intelligent decisions from available data. Machine learning techniques can be classified under three major groups, which are supervised learning, unsupervised learning, and reinforcement learning as described in
Table 1. In supervised learning, labeled data can be learned while in unsupervised learning, hidden patterns can be discovered from unlabeled data, and in reinforcement learning, a system may learn from its immediate environment through the trial and error method
[1][2][3]. The process of learning is referred to as the
training phase of the model and is often carried out using computer architectures with high computational resources such as multiple GPUs. After learning, the trained model is then used to make intelligent decisions on new data. This process is referred to as the
inference phase of the implementation. The inference is often intended to be carried out within user devices with low computational resources such as IoT and mobile devices.
Table 1. Machine learning techniques.
Machine Learning Techniques |
Supervised Learning |
Unsupervised Learning |
Reinforcement Learning |
-NNs, HMMs, decision trees, logistic regression, k-means, and naïve Bayes are common techniques adopted for embedded and mobile applications. Naïve Bayes and decision trees have low complexity in terms of computation and memory costs and thus do not require innovative optimizations as pointed out by Sayali and Channe [37]. Logistic regression algorithms are computationally cheaper than naïve Bayes and decision trees, meaning they have even lower complexity [38]. HMMs, k-NNs, SVMs, GMMs, and DNNs are however computationally and memory intensive and hence, require novel optimization techniques to be carried out to be efficiently squeezed within resource-limited environments. We have thus limited our focus to these compute intensive ML models and discuss state-of-the-art optimization techniques through which these algorithms may be efficiently implemented within resource-constrained environments.
Table 2. Machine Learning Techniques in Resource-Constrained Environments.
Reference |
ML Method |
Embedded/Mobile Platform |
Application |
Year |
Operation |
Energy (pJ) |
[4] |
Network Configuration |
2015 |
Classification |
8 bit int ADD |
Regression |
Clustering |
Genetic Algorithms |
0.03 |
[5] |
DNN |
FPGA Zedboard with 2 ARM Cortex Cores |
Character Recognition |
2015 |
SVM |
16 bit int ADD |
0.05 |
SVR |
HMM |
Estimated Value Functions |
32 bit int ADD |
Naïve Bayes |
[6] |
DNN |
Xilinx FPGA board |
Image classification |
0.12016Linear Regression |
GMM |
Simulated Annealing |
[7] |
LSTM RNN |
Zynq 7020 FPGA |
Character Prediction |
2016 |
k | -NN |
Decision Trees |
|
SVM |
ARMv7, IBM PPC440 |
16 bit float ADD |
k | -means |
0.4 |
[8] |
CNN |
VC707 Board with Xilinx FPGA chip |
Image Classification |
2015 |
Logistic Regression |
32 bit float ADD |
0.9 |
ANN |
DNN |
|
[9] |
GMM |
Raspberry Pi |
Integer processing |
2014 |
Discriminant Analysis |
Ensemble Methods |
|
|
[10] |
DNN |
DNN |
|
|
In recent times, machine learning techniques have been finding useful applications in various research areas and particularly in embedded computing systems. In this research, we surveyed recent works of literature concerning machine learning techniques implemented within resource-scarce environments such as mobile devices and other IoT devices between 2014 and 2020. We present the results of this survey in a tabular form given in Table 2. Our survey revealed that of all available machine learning techniques, SVMs, GMMs, DNNs, k
k |
8 bit MULT |
0.2 |
-NN, SVM |
Mobile Device |
Fingerprinting |
2014 |
32 bit MULT |
3.1 |
[11] |
k | -NN |
Mobile Device |
16 bit float MULT | Fingerprinting |
1.1 | 2014 |
[12] |
k | -NN, GMM |
Mobile Device |
Mobile Device Identification |
2015 |
32 bit float MULT |
3.7 |
[13] |
SVM |
Xilinx Virtex 7 XC7VX980 FPGA |
Histopathological image classification |
2015 |
[14] |
HMM |
Nvidia Kepler |
Speech Recognition |
2015 |
[15] |
Logistic Regression |
Smart band |
Stress Detection |
2015 |
[16] |
k | -means |
Smartphone |
Indoor Localization |
2015 |
[17] |
Naïve Bayes |
AVR ATmega-32 |
Home Automation |
2015 |
[18] |
k | -NN |
Smartphone |
Image Recognition |
2015 |
[19] |
Decision Tree |
Mobile Device |
Health Monitoring |
2015 |
[20] |
GMM |
FRDM-K64F equipped with ARM Cortex-M4F core |
IoT sensor data analysis |
2016 |
[21] |
CNN |
FPGA Xilinx Zynq ZC706 Board |
Image Classification |
2016 |
[22] |
CNN |
Mobile Device |
Mobile Sensing |
2016 |
[23] |
SVM |
Mobile Device |
Fingerprinting |
2016 |
[24] |
k | -NN, SVM |
Mobile Device |
Fingerprinting |
2016 |
[25] |
k | -NN |
Xilinx Virtex-6 FPGA |
Image Classification |
2016 |
[26] |
HMM |
Arduino UNO |
Disease detection |
2016 |
[27] |
Logistic Regression |
Wearable Sensor |
Stress Detection |
2016 |
[28] |
Naïve Bayes |
Smartphone |
Health Monitoring |
2016 |
[29] |
Naïve Bayes |
Mobile Devices |
Emotion Recognition |
2016 |
[30] |
k | -NN |
Smartphone |
Data Mining |
2016 |
[31] |
HMM |
Smartphone Sensors |
Activity Recognition |
2017 |
[32] |
DNN |
Smartphone |
Face detection, activity recognition |
2017 |
[33] |
CNN |
Mobile Device |
Image classification |
2017 |
[34] |
SVM |
Mobile Device |
Mobile Device Identification |
2017 |
[35] |
SVM |
Jetson-TK1 |
Healthcare |
2017 |
[36] |
SVM, Logistic Regression |
Arduino UNO |
Stress Detection |
k | -means |
Smartphones |
Safe Driving |
2017 |
[ |
32 bit SRAM READ | 2017 |
[37] |
Naïve Bayes |
Smartphone |
Emotion Recognition |
2017 |
[38]39] |
HMM |
Mobile Device |
Health Monitoring |
2017 |
[40] |
k | -NN |
Arduino UNO |
Image Classification |
2017 |
[41] |
SVM |
Wearable Device (nRF51822 SoC+BLE) |
Battery Life Management |
2018 |
5.0 |
[42] |
SVM |
Zybo Board with Z-7010 FPSoC |
Face Detection |
2018 |
[43] |
CNN |
Raspberry Pi + Movidus Neural Compute Stick |
Vehicular Edge Computing |
2018 |
[44] |
CNN |
Jetson TX2 |
Image Classification |
2018 |
[45] |
HMM |
Smartphone |
Healthcare |
2018 |
[46] |
k | -NN |
Smartphone |
Health Monitoring |
2019 |
[47] |
Decision Trees |
Arduino UNO |
Wound Monitoring |
2019 |
[48] |
RNN |
ATmega640 |
Smart Sensors |
2019 |
[49] |
SVM, Logistic Regression, | k | -means, CNN |
Raspberry Pi |
Federated Learning |
2019 |
[50] |
DNN |
Raspberry Pi |
Transient Reduction |
2020 |
[51] |
MLP |
Embedded SoC (ESP4ML) |
Classification |
2020 |
[52] |
HMM |
Smartphone |
Indoor Localization |
2020 |
[53] |
k | -NN |
Smartphone |
Energy Management |
2020 |
[54] |
ANN, Decision Trees |
Raspberry Pi |
Classification and Regression |
2021 |
2. Challenges and Optimization Opportunities in Embedded Machine Learning
Embedded computing systems are generally limited in terms of available computational power and memory requirements. Furthermore, they are required to consume very low power and to meet real-time constraints. Thus, for these computationally intensive machine learning models to be executed efficiently in the embedded systems space, appropriate optimizations are required both at the hardware architecture and algorithm levels
[55][56]. In this section, we survey optimization methods to tackle bottlenecks in terms of power consumption, memory footprint, latency concerns, and throughput and accuracy loss.
2.1. Power Consumption
The total energy consumed by an embedded computing application is the sum of the energy required to fetch data from the available memory storage and the energy required to perform the necessary computation in the processor.
Table 3 17 shows the energy required to perform different operations in an ASIC. It can be observed from
Table 3 17 that the amount of energy required to fetch data from the SRAM is much less, than when fetching data from the off-chip DRAM and very minimal if the computation is done at the register files. From this insight, we can conclude that computation should be done as close to the processor as possible to save energy. However, this is a bottleneck because the standard size of available on-chip memory in embedded architectures is very low compared to the size of deep learning models
[57]. Algorithmic-based optimization techniques for model compression such as parameter pruning, sparsity, and quantization may be applied to address this challenge
[58]. Also, hardware design-based optimizations such as Tiling and data reuse may be utilized
[8]. The next section expatiates some of these optimization methods in further detail. Furthermore, most machine-learning models, especially deep learning models, require huge amounts of multiply and accumulate (MAC) operations for effective training and inference.
Figure 1 3 describes the power consumed by the MAC unit as a function of the bit precision adopted by the system. We may observe that the higher the number of bits, the higher the power consumed. Thus, to reduce the power consumed during computation, reduced bit precision arithmetic and data quantization may be utilized
[59].

Figure 13. This graph describes the energy consumption and prediction accuracy of a DNN as a function of the Arithmetic Precision adopted for a single MAC unit in a 45 nm CMOS
[57]. It may be deduced from the graph that lower number precisions consume less power than high precisions with no loss in prediction accuracy. However, we can observe that when precision is reduced below a particular threshold (16 bit fp), the accuracy of the model is greatly affected. Thus, quantization may be performed successfully to conserve energy but quantizing below 16-bit fp may require retraining and fine-tuning to restore the accuracy of the model.
Table 317. Energy Consumption in (pJ) of performing operations.