Deep Learning for Robotic Vision Methods

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		George Fragulis	--	2901	2024-02-05 14:01:03	\|
2	Reference format revised.	Lindsay Dong	+ 2 word(s)	2903	2024-02-07 02:02:55	\|

This entry is adapted from the peer-reviewed paper 10.3390/technologies12020015

Robotic vision algorithms serve three primary functions in visual perception. Pattern recognition in machine vision is the process of identifying and classifying objects or patterns in images or videos using machine learning algorithms. Deep learning in robotic vision reveals a plethora of promising approaches, each with its own unique strengths and characteristics. Robotic vision systems can leverage the strengths of each architecture to improve object detection, tracking, and the understanding of complex visual scenes in dynamic environments. Big data and federated learning play significant roles in advancing the field of computer vision. Big data provides a wealth of diverse visual information, which is essential for training deep learning models that power computer vision applications. These datasets enable more accurate object recognition, image segmentation, and scene understanding.

deep learning machine learning robotic vision

1. Introduction

Computer vision, through digital image processing, empowers machines to map surroundings, identify obstacles, and determine their positions with high precision ^[1]^[2]. This multidisciplinary field integrates computer science, artificial intelligence, and image analysis to extract meaningful insights from the physical world, empowering computers to make informed decisions ^[3]. Real-time vision algorithms, applied in domains like robotics and mobile devices, have yielded significant results, leaving a lasting impact on the scientific community ^[4].

The study of computer vision presents numerous complex challenges and inherent limitations. Developing algorithms for tasks such as image classification, object detection, and image segmentation requires a deep understanding of the underlying mathematics. However, it is important to acknowledge that each computer vision task requires a unique approach, which adds complexity to the study itself. Therefore, a combination of theoretical knowledge and practical skills is crucial in this field, as it leads to advancements in artificial intelligence and the creation of impactful real-world applications.

The field of computer vision has been greatly influenced by earlier research efforts. In the 1980s, significant advancements were made in digital image processing and the analysis of algorithms related to image understanding. Prior to these breakthroughs, researchers worked on mathematical models to replicate human vision and explored the possibilities of integrating vision into autonomous robots. Initially, the term “machine vision” was primarily associated with electrical engineering and industrial robotics. However, over time, it merged with computer vision, giving rise to a unified scientific discipline. This convergence of machine vision and computer vision has led to remarkable growth, with machine learning techniques playing a pivotal role in accelerating progress. Today, real-time vision algorithms have become ubiquitous, seamlessly integrated into everyday devices like mobile phones equipped with cameras.

Machine vision has revolutionized computer systems, empowering them with advanced artificial intelligence techniques that surpass human capabilities in various specific tasks. Through computer vision systems, computers have gained the ability to perceive and comprehend the visual world ^[3].

The overarching goals of computer vision are to enable computers to see, recognize, and comprehend the visual world in a manner analogous to human vision. Researchers in machine vision have dedicated their efforts to developing algorithms that facilitate these visual perception functions. These functions include image classification, which determines the presence of specific objects in image data; object detection, which identifies instances of semantic objects within predefined categories; and image segmentation, which breaks down images into distinct segments for analysis. The complexity of each computer vision task, coupled with the diverse mathematical foundations involved, poses significant challenges to their study. However, understanding and addressing these challenges holds great theoretical and practical importance in the field of computer vision.

2. Machine Learning/Deep Learning Algorithms

Various artificial intelligence (AI) algorithms facilitate pattern recognition in machine vision and can be broadly categorized into supervised and unsupervised types. Supervised algorithms, which leverage labeled data to train models predicting the class of input images, can be further divided into parametric (assuming data distribution) and non-parametric methods. Examples include k-nearest neighbors, support vector machines, and neural networks. Unsupervised algorithms, which lack labeled data, unveil patterns or structures and can be categorized into clustering and dimensionality reduction methods. Comparative analyses assess technologies based on metrics like accuracy and scalability, with the optimal choice dependent on the specific problem and resources. Initially met with skepticism, public perception of AI’s benefits has shifted positively over time. Artificial intelligence aims to replicate human intelligence, with vision being a crucial aspect. Exploring the link between computer vision and AI, the latter comprises machine learning and deep learning subsets, essential for understanding machine vision’s progress (see Figure 1).

Figure 1. Relationship between artificial intelligence, machine learning, and deep learning.

The terms artificial intelligence, machine learning, and deep learning are often mistakenly used interchangeably. To grasp their relationship, it is helpful to envision them as concentric circles. The outermost circle represents artificial intelligence, which was the initial concept. Machine learning, which emerged later, forms a smaller circle that is encompassed by artificial intelligence. Deep learning, the driving force behind the ongoing evolution of artificial intelligence, is represented by the smallest circle nested within the other two.

3. Robotic Vision Methods

3.1. Pattern Recognition—Object Classification

Pattern recognition in machine vision is the process of identifying and classifying objects or patterns in images or videos using machine learning algorithms. Pattern recognition can be used for various applications, such as object detection, face recognition, optical character recognition, biometric authentication, etc. ^[5]^[6]. Pattern recognition can also be used for image preprocessing and image segmentation, which are essential steps for many computer vision tasks ^[7]^[8]^[9].

Robotic vision is based on pattern recognition. It is necessary to classify the data into different categories to make it easier to use appropriate algorithms to select the right decisions.

Originally, two approaches were founded for the implementation of a pattern recognition system. Statistical pattern recognition is based on underlying statistical models to describe the patterns and their classes. The first pattern is the theoretical decision. In the second approach, the classes are represented by formal structures such as grammar and strings. This approach is called syntactic pattern recognition, otherwise defined as a structural approach. The third approach was developed later, and it has experienced rapid development in recent years. It is called neural pattern recognition. In this approach, the classifier is depicted as a network of small autonomous units that perform a small number of specific actions, i.e., “cells” that mimic the neurons of the human brain.

Classifying objects belongs to a biological capacity of the human system that refers to visual perception. It is a very important function in the field of computer vision, aiming to automatically classify images into predefined categories. For decades, researchers have developed advanced techniques to improve the quality of classification. Traditionally, classification models can only perform well on small datasets, such as CIFAR-10 ^[10] and MNIST ^[11]. The biggest leap forward in the development of image classification occurred when the large-scale image dataset “ImageNet” was created by Feifei Li in 2009 ^[12].

An equally important and challenging task in computer vision is object detection, which involves identifying and localizing objects from either a large number of predefined categories in natural images or for a specific object. Object detection and image classification face a similar technological challenge: both need to handle a wide variety of objects. However, object detection is more challenging compared to image classification because it requires identifying the exact target object being searched for ^[13].

Object classification identifies the objects present in the visual scene, whereas object detection reveals their locations. Object segmentation is defined as the pixel-level categorization of pixels, aiming to divide an image into significant regions by classifying each pixel into a specific layer. In classical object segmentation, the method of uncontrolled merging and region segmentation has been extensively investigated based on clustering, general feature optimization, or user intervention. It is divided into two primary branches based on object partitioning. In the first branch, semantic segmentation is employed, where each pixel corresponds to a semantic object classification. In the second branch, instance segmentation is utilized, providing different labels for different object instances as a further improvement of semantic segmentation ^[13].

3.2. Mathematical Foundations of Deep Learning Methods in Robotic Vision

Deep learning in robotic vision reveals a plethora of promising approaches, each with its own unique strengths and characteristics. To utilize the full potential of this technology, it is crucial to identify the most promising methods and consider several combinations to tackle specific challenges.

Convolutional neural networks (CNNs): Among the most promising approaches are CNNs, which excel in image recognition tasks. They have revolutionized object detection, image segmentation, and scene understanding in robotic vision. Their ability to automatically learn hierarchical features from raw pixel data is a game-changer.

Recurrent neural networks (RNNs): RNNs are vital for tasks requiring temporal understanding. They are used in applications like video analysis, human motion tracking, and gesture recognition. Combining CNNs and RNNs can address complex tasks by leveraging spatial and temporal information.

Reinforcement learning (RL) in robotic vision involves algorithms for robots to learn and decide via environmental interaction, utilizing a Markov Decision Process (MDP) framework. RL algorithms like Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) use neural networks to approximate mappings between states, actions, and rewards, improving robots’ understanding and navigation. The integration of RL and robotic vision is promising for applications like autonomous navigation and human–robot collaboration, which rely on well-designed reward functions.

Generative Adversarial Networks (GANs): GANs offer transformative potential in generating synthetic data and enhancing data augmentation. This is especially valuable when dealing with limited real-world data. Their combination with other models can enhance training robustness.

Transfer Learning: Using pre-trained models is a promising strategy. By fine-tuning models on robotic vision data, it can benefit from knowledge transfer and accelerate model convergence. This approach is particularly useful when data are scarce.

Multi-Modal Fusion: Combining information from various sensors, such as cameras, LiDARS, and depth sensors, is crucial for comprehensive perception. Techniques like sensor fusion, including vision and LiDAR or radar data, are increasingly promising.

The convergence of these approaches holds tremendous potential. For instance, combining CNNs, RNNs, and GANs for real-time video analysis or fusing multi-modal data with transfer learning can address complex robotic vision challenges. The promise lies in the thoughtful integration of these approaches to create holistic solutions that can empower robots to effectively perceive, understand, and interact with their environments.

3.3. Combining Approaches for Robotic Vision

To address the complexity of robotic vision tasks, a combination of these neural network architectures can be powerful. For instance:

Using CNNs for initial image feature extraction to identify objects and their positions.
Integrating RNNs to process temporal data and track object movement and trajectories over time.
Implementing GANs to generate synthetic data for training in various environments and conditions.
Employing LSTMs to remember past states and recognize long-term patterns in robot actions and sensor data.

By combining these approaches, robotic vision systems can leverage the strengths of each architecture to improve object detection, tracking, and the understanding of complex visual scenes in dynamic environments (see Table 1).

Table 1. Combined approaches in robotic vision.

Approach	Strategy	Benefits
CNN-RNN fusion ^[14]	Utilizes CNNs for initial image feature extraction and integrates RNNs to process temporal data.	- Improved object tracking by capturing both spatial and temporal features. - Enhanced understanding in dynamic scenes, combining spatial and temporal context.
GAN-based data augmentation ^[15]	Applies GANs to generate synthetic data, diversifying training datasets.	- Diversifying training datasets by adding synthetic data. - Enhancing robustness by training on various simulated environments.
Hybrid CNN-LSTM models ^[16]	Combine CNNs for static feature extraction with LSTMs for sequential understanding.	- Improved object recognition by capturing both static and sequential features. - Enhanced tracking in dynamic scenes, understanding both spatial and temporal aspects.
Triplet network with GANs ^[17]	Implements GANs for generating realistic variations of images and uses a triplet network (embedding CNN) to enhance similarity comparisons.	- Improved recognition through realistic image variations. - Better understanding of similar objects in varied conditions facilitated by the triplet network.

3.4. Big Data, Federated Learning, and Vision

Big data and federated learning play significant roles in advancing the field of computer vision. Big data provides a wealth of diverse visual information, which is essential for training deep learning models that power computer vision applications. These datasets enable more accurate object recognition, image segmentation, and scene understanding.

Federated learning, on the other hand, enhances privacy and efficiency. It allows multiple devices to collaboratively train models without sharing sensitive data. In computer vision, this means that the collective intelligence of various sources can be used while preserving data privacy, making it a game-changer for applications like surveillance, healthcare, and autonomous vehicles or drones.

3.4.1. Big Data

Big data uses vast and complex datasets arising from diverse origins and applications, such as social media, sensors, and cameras. Within machine vision, big data proves invaluable for pattern recognition, offering a plethora of information like images, videos, texts, and audio.

The advantages of big data are numerous: it can facilitate the creation of more accurate and resilient pattern recognition models by supplying ample samples and variations; it can display latent patterns and insights inaccessible to smaller datasets; and it can support pattern recognition tasks necessitating multiple modalities or domains. However, big data also has certain drawbacks: it can present challenges in data collection, storage, processing, analysis, and visualization; it can create ethical and legal concerns surrounding data privacy, security, ownership, and quality; and it can introduce noise, bias, or inconsistency that may impede the performance and reliability of pattern recognition models.

Big data and machine vision find a lot of applications. In athlete training, they aid behavior recognition. By combining machine vision with big data, the actions of athletes can be analyzed using cameras, providing valuable information for training and performance improvements ^[18].

In image classification, spatial pyramids can enhance the bag-of-words approach. Machine vision-driven big data analysis can improve speed and precision in microimage surface defect detection or be used to create intelligent guidance systems in large exhibition halls, enhancing the visitor experience. In the context of category-level image classification, the use of spatial pyramids based on 3D scene geometry has been proposed to improve classification accuracy ^[19]. Data fusion techniques with redundant sensors have been used to boost robotic navigation. Big data and AI have been used to optimize communication and navigation within robotic swarms in complex environments. They have also been applied in robotic platforms for navigation and object tracking using redundant sensors and Bayesian fusion approaches ^[20]. Additionally, the combination of big data analysis and robotic vision has been used to develop intelligent calculation methods and devices for human health assessment and monitoring ^[21].

3.4.2. Federated Learning

Federated learning, a distributed machine learning technique, facilitates the collaborative training of a shared model among multiple devices or clients while preserving the confidentiality of their raw data. In the context of machine vision, federated learning proves advantageous when dealing with sensitive or dispersed data across various domains or locations. Federated learning offers several benefits: it can safeguard client data privacy and security by preserving data locally; it can minimize communication and computation costs by aggregating only model updates; and it can harness the diversity and heterogeneity of client data to enhance model generalization. Nonetheless, federated learning entails certain drawbacks: it may encounter challenges pertaining to coordination, synchronization, aggregation, and evaluation of model updates; it may be subject to communication delays or failures induced by network bandwidth limitations or connectivity issues; and it may confront obstacles in model selection, optimization, or regularization due to non-iidness or data imbalance. In computer vision and image processing, “IID” stands for “Independent and Identically Distributed”. It refers to a statistical assumption about the data used in vision-related tasks.

Federated learning can be used to improve the accuracy of machine vision models. It enables training a machine learning model in a distributed manner using local data collected by client devices, without exchanging raw data among clients ^[22]. This approach is effective in selecting relevant data for the learning task, as only a subset of the data is likely to be relevant, whereas the rest may have a negative impact on model training. By selecting the data with high relevance, each client can use only the selected subset in the federated learning process, resulting in improved model accuracy compared to training with all data ^[23]. Additionally, federated learning can handle real-time data generated from the edge without consuming valuable network transmission resources, making it suitable for various real-world embedded systems ^[24].

LEAF is a benchmarking framework for learning in federated settings. It includes open-source federated datasets, an evaluation framework, and reference implementations. The goal of LEAF is to provide realistic benchmarks for developments in federated learning, meta-learning, and multi-task learning. It aims to capture the challenges and intricacies of practical federated environments ^[25].

Federated learning (FL) offers several potential benefits for machine vision applications. Firstly, FL allows multiple actors to collaborate on the development of a single machine learning model without sharing data, addressing concerns such as data privacy and security ^[26]. Secondly, FL enables the training of algorithms without transferring data samples across decentralized edge devices or servers, reducing the burden on edge devices and improving computational efficiency ^[27]. Additionally, FL can be used to train vision transformers (ViTs) through a federated knowledge distillation training algorithm called FedVKD, which reduces the edge-computing load and improves performance in vision tasks ^[28].

Finally, FL algorithms like FedAvg and SCAFFOLD can be enhanced using momentum, leading to improved convergence rates and performance, even with varying data heterogeneity and partial client participation ^[29]. The authors of ^[30] introduced personalized federated learning (pFL) and demonstrated its application in tailoring models for diverse users within a decentralized system. Additionally, they introduced the Contextual Optimization (CoOp) method for fine-tuning pre-trained vision-language models.

References

Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2021, 38, 2939–2970.
Robinson, N.; Tidd, B.; Campbell, D.; Kulić, D.; Corke, P. Robotic Vision for Human-Robot Interaction and Collaboration: A Survey and Systematic Review. ACM Trans. Hum.-Robot. Interact. 2023, 12, 1–66.
Anthony, E.J.; Kusnadi, R.A. Computer Vision for Supporting Visually Impaired People: A Systematic Review. Eng. Math. Comput. Sci. (Emacs) J. 2021, 3, 65–71.
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349.
Miled, M.; Messaoud, M.A.B.; Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimed. Tools Appl. 2023, 82, 551–571.
Gianey, H.K.; Khandelwal, P.; Goel, P.; Maheshwari, R.; Galhotra, B.; Singh, D.P. Lip Reading Framework using Deep Learning. In Advances in Data Science and Analytics: Concepts and Paradigms; Scrivener Publishing LLC: Beverly, MA, USA, 2023; pp. 67–87.
Wu, Y.; Wang, D.H.; Lu, X.T.; Yang, F.; Yao, M.; Dong, W.S.; Shi, J.B.; Li, G.Q. Efficient Visual Recognition: A Survey on Recent Advances and Brain-Inspired Methodologies. Mach. Intell. Res. 2022, 19, 366–411.
Santosh, K.; Hegadi, R. Recent Trends in Image Processing and Pattern Recognition. In Proceedings of the Second International Conference, RTIP2R 2018, Solapur, India, 21–22 December 2018. Revised Selected Papers, Part I; Communications in Computer and Information Science; Springer-Nature: Singapore, 2019.
Liu, H.; Yin, J.; Luo, X.; Zhang, S. Foreword to the Special Issue on Recent Advances on Pattern Recognition and Artificial Intelligence. Neural Comput. Appl. 2018, 29, 1–2.
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009.
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324.
Wang, J.; Ma, Y.; Zhang, L.; Gao, R.X.; Wu, D. Deep learning for smart manufacturing: Methods and applications. J. Manuf. Syst. 2018, 48, 144–156.
Feng, X.; Jiang, Y.; Yang, X.; Du, M.; Li, X. Computer vision algorithms and hardware implementations: A survey. Integration 2019, 69, 309–320.
Kollias, D.; Zafeiriou, S. A Multi-component CNN-RNN Approach for Dimensional Emotion Recognition in-the-Wild. arXiv 2019, arXiv:1805.01452.
Rožanec, J.M.; Zajec, P.; Theodoropoulos, S.; Koehorst, E.; Fortuna, B.; Mladenić, D. Synthetic Data Augmentation Using GAN for Improved Automated Visual Inspection. Ifac-Papersonline 2023, 56, 11094–11099.
Tasdelen, A.; Sen, B. A Hybrid CNN-LSTM Model for Pre-miRNA Classification. Sci. Rep. 2021, 11, 14125.
Zieba, M.; Wang, L. Training Triplet Networks with GAN. arXiv 2017, arXiv:1704.02227.
Sergiyenko, O.Y.; Tyrsa, V.V. 3D Optical Machine Vision Sensors with Intelligent Data Management for Robotic Swarm Navigation Improvement. IEEE Sens. J. 2020, 21, 11262–11274.
Jiang, H.; Peng, L.; Wang, X. Machine Vision and Big Data-Driven Sports Athletes Action Training Intervention Model. Sci. Program. 2021, 2021, 9956710.
Elfiky, N. Application of Analytics in Machine Vision Using Big Data. Asian J. Appl. Sci. 2019, 7, 376–385.
Popov, S.B. The Big Data Methodology in Computer Vision Systems. In Proceedings of the International Conference Information Technology and Nanotechnology (ITNT-2015), Samara, Russia, 29 June–1 July 2015; Volume 1490, pp. 420–425.
Tuor, T.; Wang, S.; Ko, B.J.; Liu, C.; Leung, K.K. Overcoming Noisy and Irrelevant Data in Federated Learning. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5020–5027.
Zhang, H.; Bosch, J.; Olsson, H.H. Real-Time End-to-End Federated Learning: An Automotive Case Study. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; pp. 459–468.
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R. Advances and Open Problems in Federated Learning. In Foundations and Trends® in Machine Learning; Now Publishers Inc.: Boston, MA, USA, 2021; Volume 14, pp. 1–210.
Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečný, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. LEAF: A Benchmark for Federated Settings. arXiv 2019, arXiv:1812.01097.
Tyagi, S.; Rajput, I.S.; Pandey, R. Federated Learning: Applications, Security Hazards and Defense Measures. In Proceedings of the 2023 International Conference on Device Intelligence, Computing and Communication Technologies, (DICCT), Dehradun, India, 17–18 March 2023; pp. 477–482.
Federated Learning: Collaborative Machine Learning without Centralized Training Data. 2017. Available online: https://blog.research.google/2017/04/federated-learning-collaborative.html (accessed on 9 March 2023).
Kant, S.; da Silva, J.M.B.; Fodor, G.; Göransson, B.; Bengtsson, M.; Fischione, C. Federated Learning Using Three-Operator ADMM. IEEE J. Sel. Top. Signal Process. 2022, 17, 205–221.
Tao, J.; Gao, Z.; Guo, Z. Training Vision Transformers in Federated Learning with Limited Edge-Device Resources. Electronics 2022, 11, 2638.
Guo, T.; Guo, S.; Wang, J. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1364–1374.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Nikoleta Manakitsa

George S. Maraslidis

Lazaros Moysis

George F. Fragulis

View Times: 262

Update Date: 07 Feb 2024

Table of Contents

Video Upload Options

Confirm