Machine Learning for Multimedia and Edge Information Processing: Comparison
Please note this is a comparison between Version 1 by Anthony Chukwunonso Mmonyi and Version 2 by Beatrix Zheng.

The advancements and progress in artificial intelligence (AI) and machine learning, and the numerous availabilities of mobile devices and Internet technologies together with the growing focus on multimedia data sources and information processing have led to the emergence of new paradigms for multimedia and edge AI information processing, particularly for urban and smart city environments. Compared to cloud information processing approaches where the data are collected and sent to a centralized server for information processing, the edge information processing paradigm distributes the tasks to multiple devices which are close to the data source. Edge information processing techniques and approaches are well suited to match current technologies for Internet of Things (IoT) and autonomous systems, although there are many challenges which remain to be addressed. 

  • multimedia processing
  • edge multimedia
  • intelligence edge
  • edge AI
  • edge computing
  • edge multimedia analytics

1. Multimedia Streaming on Intelligence Edge

Streaming involves the continuous transmission of multimedia files in bit sized flows from client servers to users allowing content consumption without the need to establish permanent storage spaces for transmitted data. Video streaming has become a major source of internet traffic generation with major demand placed on the edge network infrastructure to provide capacitive storage capable of managing upload and download operations for an ever-increasing number of users. Additionally, there is a growing presence of Internet of Things (IoT) T devices operating over sensor networks, constantly transmitting multimedia data obtained from sensor nodes designed to capture varying physical, chemical and statistical properties. According to data obtained from Cisco visual networking index for 2016 to 2021, an estimated 41 exabytes of data were transmitted on a monthly basis at initial measurements and this was projected to increase to 77 exabytes by 2022 with between 79 and 82% of this traffic made up of video data. This highlights the critical role played by multimedia streaming on bandwidth efficiency in mobile edge computing. This section examines the work relating to the application of intelligent solutions for the optimized streaming of multimedia content over content distribution networks and similar platforms within edge networks.
The streaming of videos across the edge infrastructure demands large bandwidth allocations. Some service providers manage this by establishing fair usage policies which manage the user experience reactively based on historic data. Considerations for real-time bandwidth sharing to optimize the allocation across a section of users with video quality adaptations were explored by Chang et al., 2019 [1][49]. The research applied a Deep Q-learning approach to inform a bandwidth sharing policy operated within an edge network simulation. The MEC server utilized in the experiments was simulated using the LTE mobile cellular network software based on Amarisoft EPC Suite and eNB as representative edge nodes operating on separate physical machines. The experiment established two scenarios constituting the experience of a single user and two users modelled to observe how bandwidth allocation will be executed with respect to quality of experience and fairness of allocation among users. The information from the MECs Radio Network Information system was extracted to establish quality of experience metrics for performance evaluation.
The authors adopted Deep Q-learning as an alternative to standard Q-learning to overcome the requirement for large training data sizes, since standard Dynamic Adaptive Streaming over HTTP (MPEG-DASH) only specifies media presentation formats and creates flexibility for adaptation logic. Deep Q-learning provides a more adaptive option for video fragmentation than previous comparatively rigid client-based logic algorithms. Compared to previously applied client-based logic used to coordinate HTTP Adaptive streaming (HAS), Deep Q-learning leverages the neural network to establish reliable data sets similar to Q-tables that provide information on action and rewards over a range of variables which are monitored as performance measures in this methodology. In the research setup, video content unavailable within the internal edge-supported video caches are forwarded to an external video content server. Alternatively, these requests could be redirected to a supporting edge-assisted video adaptation application. These systems are capable of providing users with the multimedia data appropriate to the initial query, while tailoring the user experience to match the video quality based on the limitations imposed by the streaming policy. The research challenge then would be to establish a policy that provides users with the best perceived experiential quality in terms of video quality, time required for downloads and adaptability within the network.
The utilization of the experience replay mechanism in the training process provides the reliability described within existing datasets. This is credited to the development of a heuristic borne of multiple instances within the network. The experience captured includes the initial state, which refers to the network state before any action is taken. The next variable captured is the action which establishes the bitrate for the video segment intended for access and download. The reward variable provides feedback on how effective the action variable was in providing minimum bitrate deviation per user. Finally, a variable is captured to represent the new state of the network. The fairness index measured the bitrate deviation per client. In the two-client scenario, Jain’s fairness index is applied which surveys and compares the differences between the adaptation solutions in bitrate delivery. The results from the experiment were considered over two client-based adaptation logic tools: Buffer-Based Adaptation (BBA) and Rate-Based Adaptation (RBA). These were compared against the quality of experience results obtained from Deep Q-learning. For the BBA in the single user scenario, switching buffer rates causes frequent oscillations in bitrate even when download conditions are relatively stable. RBA by design neglects buffer occupancy in providing adaptations leading to a situation where the selected video quality exists as a function of the bandwidth available. The observed high buffer rates are not utilized to optimize the process creating missed opportunities in situations with fluctuating bandwidth. Dash.js creates a high average video quality with a low switching frequency. This creates an experimental situation where a low bandwidth is selected for the entire process even in the presence of a greater available allocation. For the two-user situation dash.js produces a fierce rivalry for bandwidth where one user benefits from a higher bandwidth allocation and a higher average throughput to the detriment of the competing user.
Zhou et al., (2020) [2][50] captured the enormous computational demands created by 3D video requirements among an increasing number of users. Their research developed a Quality of Experience Model that relies on actor-critic deep reinforcement learning to adapt video renderings reactively improving video playback buffer times and bandwidth distribution. A resource allocation model (RAM) is hinged on a Software distribution Network (SDN)-managed Mobile Edge computing architecture. The work in [3][51] discussed several techniques involving AI to manage SDN networks and proposed a SDN management system powered by artificial intelligence (AI)AI termed as SDNMS-PAI for handling end-to-end networks. The SDN has the function of establishing control of the server resource allocation and allocating resource necessary for data processing requirements separately. The resource allocation model (RAM) is implemented at the edge layer, where 3D video playbacks are cached for onward transmission to users in form of video blocks.
The author identified video blocks as sections of the frame-by-frame video, each accounting for one second of playtime. The choice of video block rates to implement are influenced by the system’s need to operate a Dynamic Adaptive Streaming over HTTP (DASH) protocol. Hence, the performance of future video files is reliant on the playback statistics computed. The resource allocation model benefits from caching operations within the MEC along several edge servers which constitute the overall network. Working with the SDN, the authors suggested a method of allocation of MEC resources over the network which is supported by buffers to optimize the 3D video user experience. Caches provide optimized video block transmission while tiling operations which result in stitching video blocks in parts rely on edge computing resources. The quality of experience model (QoEM) is based on an improvement of the resolution allocation to the Head-Mounted Display (HMD) viewport which is responsible for the transmission speeds of 3D videos. The HMD viewport tiles require equal tile rates to mitigate observable screen fragmentation during display. The higher resolution in HMD is complemented by the reduction in the allocation outside the viewport, with tiles in this region allocated a non-zero rate. Allocation rates are modelled using the Markov decision-making Process (MDP) which optimizes quality of experience. An actor-critic deep reinforcement learning tool is deployed at this point to predict and adapt viewports and the bandwidth of future videos. Additional tools applied in this work include Long Short-Term Memory (LSTM) and fully convolutional (FC) networks responsible for providing resolution accuracy. The performance of the methodology was evaluated using model predictive controls (MPCs) and Deep Q-Network (DQN) to perform a comparative analysis on four QoE targets.
Luo et al., (2019) [4][52] used similar inputs to Zhou et al., 2020 [2][50] differentiated by the objective to proffer a solution for energy management and quality of user experience in cases where there is a requirement for video streaming over software-defined mobile networks (SDMN) existing on mobile edge computing resources. This is achieved by establishing variables within two optimization problems based on constrained Markov decision process (CMDP) and Markov Decision Process. The optimization problems are solved by applying the model-free deep reinforcement learning; asynchronous advantage actor-critic (A3C) algorithm method. The subsequent analysis and adaptation derived describe video buffer rates, adaptive bitrate (ABR) streaming, edge caching, video transcoding and transmission. The Lyapunov technique was used to address the challenge created by the application of CMDP which applies a one-period drift-plus-penalty. This creates a requirement for the resolution of a period-by-period isolated deterministic problem to create an accurate representation of conditions. The substitution of the one-period drift-plus-penalty with the T-period drift-plus-penalty provides a global solution to the CMDP problem. The streaming profile considers a downlink case involving video transmission within a mobile network with multiple base stations serving a large user demographic. With each request from the edge base station, a discrete time Markov chain (DTMC) is used to model changes in the state of the channel which is dependent on transmission probability. Also important for the achievement of the anticipated QoE are the buffer rates which perform the duty of smoothing with variations in bitrates. The research approach to buffer sizing is modelled to match demands by mobile devices with the concept of minimum tolerable performance deterioration tolerable at different buffer levels.
The need for indexing video tiles to provide an adaptation based on bandwidth requirement makes a case for the presence of a software-defined controller. The SDN is utilized for QoE adaptation as the bitrate allocation to streamed videos will require constant adjustments to meet the demands of changes in resolution within video content. Beyond quality adaptations, the segmentation of video tiles will be decided by the SDN controller which has the responsibility of assigning computational resources required for transcoding video files from one virtual machine to the final mobile device. In establishing the quality of experience, the author selects performance metrics that measure the time average bitrate which measures the normalized bitrate time average for each segment. Moreover, the time average instability index is measured to depict user perception towards the influence of changing bitrates brought on by intelligent SDN adaptations.
Machine learning in this research served to define an optimal policy that optimizes each scenario of bandwidth demand and allocation utilizing limited learning data. For the simulation of the process, an open-source machine learning library named Pytorch was used to implement the actor-critic deep reinforcement learning. An MDP and a non-MDP optimization solution were proffered to obtain the best performing between the methodologies. The utilization of caches was found to speed up learning rates, whilst the best performing adaptations were found with up to 50 segments cached in a setup that involved 20 base stations with three mobile devices per station. It was further observed that as the number of mobile devices per base station increased, the maximum power consumed at each base station was stretched, leading to major service degradation.
Dai et al., (2021) [5][53] raised the consideration for multimedia streaming components required for effective communication among vehicles existing within an Internet of Vehicles (IoV) network. The requirement for constant multimedia streams to be maintained with vehicles in constant motion, creating a dynamic demand, differentiated the identified problem from previous considerations. The authors considered streaming for heterogenous IoV with an adaptive-bitrate-based (ABR) multimedia streaming design which operates over an MEC network. Utilizing roadside units as mount points for edge devices, the bandwidth allocation per vehicle can be determined with priority placed on quality of experience for each user. This means as multimedia streaming takes place, intelligent systems are required to guarantee the quality of the received multimedia segments while minimizing the lags created by bandwidth limitations. Their adaptive-quality-based chunk selection (AQCS) algorithm provides opportunities to monitor and synthesize service quality, playback time and freezing delays within the setup. Other critical quality of experience factors responsible for service performance and freezing delays were synthesized using the joint resource optimization (JRO) problem developed in the rpapesearchr.
Multimedia segmentation and the allocation of bitrate in their methodology employed Deep Q-learning (DQN) and a multi-armed bandit (MAB) algorithm, both being reinforcement learning-based methodologies. The application of MAB was noted to create a consequent loss of convergence speed and clock speed. Using Q-tables and gradient based Q-function, deep Q-learning is expected to provide better results through rehearsed data driven processes. The multimedia data were hosted on the cloud layer. Multimedia files constituting of varying quality levels to reflect the predictive adaptation of DQN are duplicated at the MEC layer to support ABR. The Multi-Arm Bandit algorithm provides decision making and Q-function updates which are made up of the streaming history and rewards for bitrate adaptations. Follow-up action provided by Deep Q-Learning establishes the representation of system state, experience replay, a loss function and a reward function which leads to a performance index for comparison to other methodologies. Using a traffic simulator alongside a scheduling and optimization module, real-time trace data were obtained from vehicles within Chengdu city in China over a 16 km2 area from the resource-based Simulation of Urban Mobility (SUMO).
The author noted the JRO exists in a novel space; hence, a combination of methodologies were utilized to create a suitable comparison. These were comprised of a classical cache algorithm and two adaptive-streaming algorithms made up of a Markov Distribution Process and a Rate Adaptation that will be responsible for chunk transmission. These algorithms were tested with different bandwidth requirements across the multi-arm bandit, DQN, adaptive quality-based chunk selection (AQCS) and least frequently used (LFU) algorithms. Five scenarios were simulated with different traffic conditions observing average and standard deviations for vehicle number and dwelling time. The results captured the effects of traffic workload on the performance of algorithms. Of the five tools considered, the combination of DQN and AQCS was found to perform best in managing average service quality (ASQ) and minimizing the average freezing delay (AFD) simultaneously.
Deep Q-networks find additional applications in managing streaming multimedia data for autonomous vehicles as captured in research presented by (Park et al., 2020) [6][54]. The research addressed the challenge of establishing reliable video streaming in fast moving autonomous vehicles and proposes a combined Mobile Edge Computing and DQN driven solution. Their design was constituted of two DQN-based decision support applications with one dedicated to the offloading decision algorithm and the other charged with the data compression decision algorithm. Autonomous vehicles benefit from the operation of a large number of digital cameras fitted at differing locations responsible for image capturing and processing. This function has high requirements for speed to support the decision-making processes that influence the safety factor of the vehicles. Caching along the MEC supported by 5G technologies provides reasonable support. However, the method proposed seeks to achieve a greater bandwidth efficiency which promotes video offloading and compression operations for fast streaming as represented in Figure 13.
Figure 13. The DQN-based offloading and compression decision process in Autonomous Vehicles [6] (Park et al., 2020).
The DQN-based offloading and compression decision process in Autonomous Vehicles [54] (Park et al., 2020).
Due to limitations in server capacity, internal policies within the MEC are required to influence the multimedia offloading decision. Deep Q-Learning has been found to be a tool capable of offering maximized reward for offload functions. Some of these advantages are credited to the operation of the layered structure DQN which is able to perform learning operations from small sections of agent data. Assessment of offloading and compression decision is performed in terms of state, action and reward. The state of offloading is expressed by the vehicle’s capacity and standBy Q capacity while that of the compressing decision is represented by standBy Q and MEC capacity. Offloading delays and energy consumption are the mark of how rewarding the offloading decision was. The data quality and waiting delay are responsible for establishing the reward mechanism for the compression decision. The outcomes of the performance appraisal show that DQN, as in many other processes, quickens the offloading and compression in autonomous vehicles operating in highly dynamic environments.
In Ban et al., 2020 [7][55], the authors developed a 360-degree (virtual reality) video streaming service which employs deep reinforcement learning for prediction and allocation of streaming resources. Their scheme solved the problem involved in multi-user live VR video streaming in edge networks. To deliver consistent video quality across all users, the server requires higher bitrates to cope with data sizes related to delivery of VR videos due to its spherical nature. The system utilized the Mean Field Actor-Critic (MFAC) algorithm to enable the server to collaborate and distribute video segments on request to maximize the general quality of experience while reducing bandwidth utilization. The deployment of edge cache network enables multiple users to be served concurrently. The utilization of an edge-assisted framework helps to minimize congestion on the backhaul network. The client changes their title rates to improve both the quality of experience and the total bandwidth requirement by communicating over several edge servers. The authors used the Long Short-Term Memory (LSTM) network to forecast user’s future bandwidth and viewing activities to adapt the dynamic network and playback settings.
The authors utilized the multi-agent deep reinforcement learning (MADRL) model to tackle the problem associated with high-dimensional distributive collaboration and to study the optimal rate allocation scheme. The objective of the virtual reality video streaming scheme concentrates on four aspects, namely, average quality, temporal viewing variance, playback delay and bandwidth consumption. The performance evaluation of MA360’s with 48 users on different live video was executed over three experiment labels from video number 1 to 3 distinctly. From the evaluation, as the video number increased, the normalized quality of experience remained fixed for all methods and the download traffic increased, respectively. The MA360 scheme could be easily transferred to the present streaming systems with variable number and video numbers. Simulations carried out on data derived from actual events were used to establish comparison between MA360 and some state-of-the-art streaming methods such as Standard DASH algorithm (SDASH), Leverages LR (LRTile), ECache, Pytheas. The result showed that MA360 improved the total quality of experience and reduced bandwidth consumption. It also showed that MA360 exceeded the current state -of-the art scheme performance in terms of different network circumstances.

2. Multimedia Edge Caching and AI

Caching has become an integral part of computer networks globally. The need for the short-term storage of transient data with the exponentially growing traffic created by multimedia files such as video, images and other data types existing in virtual and local servers has made caching a necessity. In caching, subsets of data are maintained in a storage location within close proximity to the user to eliminate the repeated query operations channeled to the main source such as a cloud storage resource. The challenge that exists in this space involves deciding on what to cache, identifying where data are required and managing caching resources in a way that reflects the need and storage capacity to obtain trade-off benefits. In this section, thwe researchers eexamine different research efforts targeted at the application of machine learning techniques within edge networks to identify and predict multimedia caching opportunities.
Edge networks have evolved to utilize in-network data caches which can be in the form of user equipment and in some cases base station installations, to manage the latency in backhaul created by distant central cloud storage resources (Wang et al., 2017) [8][56]. Content Distribution Networks (CDN) are notably the first instance where cache deployments are recognized and eventually become major contributors to 5G networks and possibly future deployments with mobile network operator-managed local infrastructure offering greater caching capacity and higher backhaul performance marked by improved coverage (Wang et al., 2019 [9][57]; Yao et al., 2019 [10][58]). The application of machine learning in Mobile Edge Caching and other radio network instances promotes the predictive capacity within caching layers. This requires a capture of futuristic data demand leading to a reduced need for backhaul interaction for content access. In one study on the caching of videos within CDNs in MEC, Zhang et al., 2019 [11][59] applied a variant of recurrent neural networks which utilizes a deep Long Short-Term Memory network cell (LSTM-C) as a means of cache prediction and content update in a CDN, to optimize video caching in streaming. The methodology reveals improvement on previously existing caching algorithms such as the first in first out (FIFO), Least recently used (LRU) method among others.
Shuja et al., (2021) [12][60] presented a review of several intelligent data cache methods in edge networks. Considering the role of constantly evolving IoT and other multimedia devices which create a demand for low latency bandwidth supply capable of handling the loading of backhaul networks, the review comprehensively covered several machine learning variations and developed a taxonomy (shown in Figure 24) which accounted for applicable machine learning techniques, caching strategy and edge networks and how they work together to address the challenge of what, when and where to cache data. The benefits of this methodology were observed in the technological architecture of 5G technologies where leveraging millimeter-wave (mmWave), ultra-reliable low latency communication (URLLC), edge computing and data caching have greatly improved peak data rates for uplink and downlink processes. Further requirements beyond these tools are the need for increased efficiency in the management of limited network resources which may be achieved by network traffic prediction, the utilization of routing algorithms and the reduction in network congestion. The availability of large data sets and computing resources present in edge computing promote the opportunity for incorporation of various implementations of machine learning based on the unique efficiencies associated with them.
Figure 24. ML-Edge-Caching Taxonomy [12] (Shuja et al., 2021).
ML-Edge-Caching Taxonomy [60] (Shuja et al., 2021).
The increased performance capability of local devices and localized storage capacities provide an opportunity for caching without infrastructure in edge networks. Yao et al., (2019) [10][58] highlighted the extent of the impact caches have on backhaul links by addressing the caching process to identify challenges occurring within the four-phase process. The architecture of mobile edge caches is greatly influenced by the unique interactions shared by various caching options and predominant problems experienced within the requesting, exploration, delivery and update phases. The full array of in edge network cache options identified in the resviearchw included user equipment (UE), base stations with differing capacity variations, baseband unit pools and Cloud Radio Area Networks and mobile network infrastructure, and established joint multi-tier caching infrastructure.
Said et al., 2018 [13][61] researched the application of the Clustering Coefficient based Genetic Algorithm (CC-GA) for community detection with device-to-device communication integration. The machine learning cluster capability provides proactive cache opportunities which outperform reactive caching in terms of captured overall user experience. The benefits of adopting the Edge network architecture to involve multi-layer caching have been shown to reduce backhaul load (Sutton, 2018) [14][62]. A problem with the performance of backhaul networks is the requirement for repeat downloads of redundant multimedia data which create requests that are repetitive in nature. This leads to a backhaul loaded with redundant content requests. These challenges have been met with several alternative optimizations ranging from the proactive time-based content distribution network setup to offer transit linkage during periods of predicted congestion (Muller et al., 2016) [15][63] and reactive content caching as shown in Figure 35 and Figure 46.
Figure 35. Reactive caching [12] (Shuja et al., 2021).
Reactive caching [60] (Shuja et al., 2021).
Figure 46. Proactive caching [12] (Shuja et al., 2021).
Proactive caching [60] (Shuja et al., 2021).
The anticipative and responsive cache options utilizing machine learning tools are largely affected by privacy policies, insight restrictions and complex user preference mapping. The identification of edge-specific trends can achieve a multi-process data-based user profiling in edge networks. Shuja et al., (2021) [12][60] established that the limits to machine learning in identifying user clusters depend largely on cache policy restrictions. Liu and Yang, 2019 [16][64] showed how deep reinforcement learning may be applied in proactive content caching on a deep-Q network. This outcome was achievable by applying learning derived from implemented recommendation policies expressed as two reinforcement learning problems run on a double deep-Q network. Wang et al., 2020 [17][65] also considered a Q-learning network to provide a model solution that offers flexible integrated multimedia caching between user equipment and network operator facilities in a heterogenous nodal setting. In this method description, federated deep reinforcement learning is used to reactively enhance the Q-learning network by a multistage modelled system involving popularity prediction, device-to-device sharing within physical and social domains, and enhanced delay and transition models.
A popular theme in edge network content caches using machine learning is the application of reinforcement learning to create a multi-solution approach for caching requirements. Research on proactive content cache based on predicted popularity has been carried out by Doan et al., 2018 [18][66] and Thar et al., 2018 [19][67] with the former considering extracted raw video data mapped into G-clusters and analyzed by a predictor based on a convolutional neural network learning model to determine how much the content features deviate from a predefined ideal. Thar et al., 2018 [19][67] and Masood et al., 2021 [20][68] approached the challenge by utilizing deep learning to predict popularity scores. The former research applied class labels for content and assign them, while the latter applied a regression-based approach in its predictive functions. Based on the predictive machine learning model, content is then dispatched along the edge network to be cached at locations promoted by their popularity scores. Liu et al., 2020 [21][69] adopted a similar approach but went beyond the application of content popularity by applying a privacy preserving federated K-means led training for determining the appropriacy of content distribution along the edge network.
In more complex optimization situations such as observed in IoT communications, Xiang et al., 2019 [22][70] expressed a reactive methodology for caching within fog radio access networks (F-RANs) which utilized a deep reinforcement learning algorithm to prioritize user demands and allocate network resources. The methodology promoted core efficiency and transmission efficiency by slicing the network to cater to user categories as prioritized by the machine learning tool. In another work, Sun et al., 2018 [23][71] presented a reactive intelligent caching method combining Dynamic Adaptive Streaming over HTTP (DASH) made popular by YouTube with Deep Q-Learning for improved predictive efficiencies in video caching. The combination of both tools on a Mobile Edge Network can create an adaptive video caching service that responds reactively to changes along several variables identified by deep Q-learning. The authors identified the impact of buffer time losses within user equipment (UE) on overall perceived backhaul delays in video streaming within Mobile Edge Computing (MEC) caching schemes. On the network side, the loading effect on the backhaul is managed by fragmenting video files to bit sized data streams with information relevant for decoding, captured within a media presentation description (MPD). High density traffic along the backhaul network informs the intelligent caching along nodes referred to as agents within the network. A proactive application of deep learning was established by Masood et al., 2021 [20][68] which established a regression-based deep learning implementation on MEC storage devices which enabled video content prediction and mapping to multiple base stations across the edge network for caching purposes. Table 12 shows the evaluated caching research areas as a representative measure of achievable objectives hidden within various deployments of machine learning in edge network caching.
Mobile Edge Computing utilizing local user caches in some cases requires access to personal data to provide shared data caches for the promotion of content availability required for users within the boundaries of an edge network. One such case was investigated by Dai et al., 2020 [24][72] who captured the multimedia sharing challenges within vehicular edge computing (VEC). In this methodology, MEC base stations were adopted as verifiers of multimedia data obtained from vehicles which constitute the caching providers within the VEC network. The research highlighted how concerns around privacy protection shroud the willingness of users to have their data cache policy in VEC. To combat this challenge, blockchain driven permission systems were employed to ensure content is securely cached. This is achieved by users operating dynamic wallet addresses which leverage blockchain properties of anonymity, decentralization, and immutability. Content caching is optimized by the application of deep reinforcement learning (DRL) which manages caching operations despite changing wireless channels creating by vehicle mobility. Similar to other caching methods, these eases backhaul network traffic, utilizing vehicle to vehicle communication to reduce the demand associated with large multimedia transmissions. Their work investigated a cache requester and block verifier architecture utilized in a Manhattan city modelled grid, utilizing data from 4.5 million Uber pick-ups in New York City. The performance of deep reinforcement learning was evaluated using greedy content caching and random content caching which was plotted to generate a relationship between cumulative average reward for all requests and number of episodes initiated by caching requesters. The research showed the relationship between increased caching requesters and higher reward within a VEC. Another work by Li et al., (2019) [25][73] looked into cooperative edge caching as a means of eliminating redundant multimedia fetching protocols from base stations in MEC.

3. Multimedia Services for Edge AI

Multimedia service describes the interaction of voice, data, video and image in dual or multiple configurations taking place at the same time between the parties involved in some form of communication. Multimedia services can exist either as distributed or interactive services. Quality of Experience (QoE) is an important factor in multimedia services for users. Roy et al., (2020) [26][74] proposed mobile multimedia service driven by artificial intelligence in MEC, with the objective of achieving a high quality of experience. The authors proposed an artificial intelligence-based method which utilizes meta-heuristic Binary Swarm Optimization (BPSO) to obtain a high performing solution. To manage and optimize nodes, an edge orchestrator (EO) managed by a Mobile network operator (MNO) makes use of statistical relationships derived from the nodal data and a mobility prediction model for planning multimedia service. The design assigns an edge server operating virtual machines responsible for managing user queries for each edge node creating multiple miniature data processing units. The EO has a controlling role of the edge server making use of the three database modules (movement data, contextual database and trajectory and edge database). The Path Oriented Proactive Placement (POPP) presents a twofold problem relating to the quality of user experience and the minimization of deployment cost in multimedia service delivery. The proposed POPP provides an intelligent interaction along the prediction path and optimizes Quality of Experience and cost reduction in real-time data processes.
The authors integrated the analysis of probabilistic relationships derived from historical movement data to predict and compensate for errors in the path prediction model. The implementation of the work was completed in cloudsim and a comparison was made with other existing works. The results indicated the performance of POPP exceeded those of previously existing work in QoE performance capturing superior satisfaction levels between 15% and 25% above deployments with similar objectives. The authors provided a computational model that expressed the POPP problem and developed a solution hinged on binary swarm optimization (BPSO) capable of managing the service placement requirements. Figure 57 shows the framework of the computational model utilized by the POPP system.
Figure 57. Computational framework of the POPP system [26] (Roy et al., 2020).
Computational framework of the POPP system [74] (Roy et al., 2020).
Wang et al., (2020) [17][65] developed research around an intelligent Deep Learning Reinforcement (DRL) edge-assisted crowdcast framework called DeepCast which examines the total amount of viewing data for smart decisions to personalized Quality of Experience with minimized cost of system. Crowdcast enables the viewers to watch and interact with the broadcaster and other viewers in a live video program. This interaction is completed in the same channel. The broadcasters use many platforms in crowdcast services to stream their own content to the viewer; such platforms include Youtube, Gaming and Twitch.tv. Therefore, Crowdcast faces challenges of poor Quality of Experience and high cost of services due to three major features in a crowdcast service namely the crowdcast platforms, content preferences and the rich interaction between the viewers and broadcaster.
The DeepCast which was proposed by the authors combined cloud, Content Distribution Network (CDN) and MEC for crowdcasting applications. Moreover, the Deepcast through the help of DRL recognizes the appropriate approach for allocation of viewers and transcoding on the edge server. The inherent process is data-driven and depends on the identification of complex trends from real-world datasets among components. To train the process, DRL is applied to trace-based experiments. Identified real world datasets applied in this process were obtained from inke.tv based in China having a viewership of up to 7.3 million users daily in 2016 and twitch.tv from the USA. Data fields are captured to represent users’ datasets consists of the viewer and channel ID, network type, location, and viewing duration. Moreover, the collection of viewers’ interaction information such as records of web application traffic, online exchanges and broadcaster’s channel content from 300 well-known channels of Twitch.tv for two months was analyzed. In this framework, the responsibility of establishing a connection to the cloud server rests with the broadcaster creating a link that supports streaming of raw data. Streamed data are then encoded and compressed into chunks with multiple bitrates, which are conveyed to the content distribution network server. The DRL tool performs the function of allocating content with different bitrates to the relevant Edge servers based on QoE policy established from training. The results from the evaluation of the DeepCast system showed an effective improvement in the average personalized Quality of Experience than the cloud CDN method. The author cited a cost reduction of between 16.7 and 36% which was achievable by the implementation of the model. In conclusion, the utilization of the edge servers in DeepCast can satisfy viewer’s personalized and heterogeneous QoE demands.
When there is a need for offloading of storage and computing resources to the network edge, the network is faced with problems such as latency and underutilized bandwidth. Guo et al., (2019) [27][75] proposed an approach utilizing Deep-Q-network based multimedia multi-service quality of service optimization for mobile edge computing systems. The authors investigated a multi-service situation in MEC systems. The MEC offers three multimedia services; streaming, buffered streaming and low latency enhanced mobile broadband applications (eMBB) for edge users. The packets scheduling method and quality of service model in mobile edge computing system were analyzed. Whenever mapping is required for converting a packet into a quantity of service flow, the scheduler is required to prioritize the matching of available resource with quality-of-service characteristics. The consideration of 5G quality of service model enables the packet from different multimedia applications to be mapped into different Quality of Service flows in accordance with the quality-of-service requirements. As a solution, a QoS maximization problem was formulated by which requirements for scheduling the limited radio resource can be defined and computed. The application of the 5G quality of service (QoS) model was used for satisfying various QoS conditions in several service cases. The processing of each quality of service was performed individually by allocating the same QoS flow to packets withg similar requirements. A reinforcement-based deep-Q learning method was utilized to allocate dynamic radio resources. The Deep Reinforcement Learning framework performance was monitored using the properties of state space, action space, state performance and reward function. A simulation was performed, and the results indicated that the Deep-Q-Network-based algorithm performed better than the other resource allocation algorithms.
Huo et al., (2020) [28][76] proposed an energy efficient model for resource allocation in edge networks applying deep reinforcement learning. Their case study considered multimedia broadband services in the mobile network and addressed the challenge of the inefficient allocation of resources including bandwidth and energy consumption. Energy consumption for the system takes the form of transmission energy and basic energy which are both required to support the network flow. A simulation of the proposed work was carried out on three base stations with a significant number of active users. Four variations of user structures involving in one case three users, and in others four, five and six users were considered. The obtained results verified the effectiveness of the DRL-based scheme and its usefulness in catering to mobile user requirements while outperforming competing methods in energy-efficient resource allocation.
Wu et al., (2021) [29][77] researched video service enhancement strategies that guarantee that video coding rates are fairly distributed over user devices. The research considered video coding rates under the constraints of statistical delay and limited edge caching capacity. For the content delivery to match the quality-of-service requirements, two methods were highlighted, the first was, content caching to ensure the content is as close to user as possible. The second was video delivery which achieves an optimized performance by a sequenced scheduling of users based on an optimization policy established to improve the network. Both systems performed well in differing scenarios as several studies have attempted to hybridize the methods to derive the combined benefits. The author proposed a combined human–artificial intelligence approach capable of improving caching hit rates by more accurately predicting video cache requirements within the MEC network. The artificial intelligence component is responsible for learning user interest, movie attributes and ratings in terms of low-order and high-order features. This capability is made possible by the joint functioning of the factorization machine (FM) model and multi-layer perceptron (MLP) model. This information concerning user preferences and behavior is adopted by a designed socially aware model that takes individual preferences and models them into groups depicting a demographic of users with similar interests.
The video delivery policy is founded on the user’s interest prediction and edge caching decisions. The optimization problem which is posed by limited caching resources within the MEC, and video coding rates is modelled by first identifying the delay violation probability. This gives rise to an analytically derived statistical delay guarantee model with a dual bisection exploration scheme to guide service delivery. The solution to the modelled optimization problem yields video coding rates that outperform other user-based logic methodologies. Coupled with the predictive competence of the hybrid human-machine intelligence, video caching complements the adopted service delivery method to create a more reliable bandwidth allocation structure. To test the suitability of the proposed method, data were obtained from Movielens, a web-based video recommendation software which recommends movies to users based on previous interest. The results from observed service simulations showed that increases in video coding rates were met with corresponding changes in maximum delay tolerance and probability of delay violations exceeding QoS stipulations. The reduced constraint on delay violations made allowance for higher video coding rates as it showed that reduced constraint creates convergence in video coding rates while approaching a mean channel capacity.

4. Hardware and Devices for Multimedia on Edge Intelligence

Edge computing has great applications for multimedia technology. Graphics processing units (GPUs), high-end Field Programmable Gate Arrays (FPGAs) and Tensor processing units (TPU) are some of the multimedia edge AI computing devices/platforms [30][78]. This section provides a discussion on hardware and devices for multimedia on edge intelligence. The reader can refer to the survey papers in [31][32][79,80] for further works on GPU and FPGA-embedded intelligence systems. Edge-based hardware devices for the deployment of AI and machine learning can be classified into the following types: (1) Application-Specific Integrated Circuit (ASICs) Chips—ASICs for AI applications are designed specifically to execute machine/deep learning algorithms and have the advantages of being compact in size with low power consumption. Some examples of ASICs for AI are the ShiDianNao [33][81] and Google TPU. (2) Graphics Processing Units (GPUs)—GPUs have the advantages of being able to perform massive parallel processing to increase the throughput and are able to achieve a higher computational performance for AI algorithms/modules compared to conventional microprocessor/CPU-based architectures.
Some examples of GPUs for AI are the Nvidia Jetson and Xavier architectures [34][82]; (3) Field-Programmable Gate Array (FPGA)—FPGAs have the advantages of being reconfigurable to give flexibility to implement custom AI architectures with lower energy consumption and higher security. An example of an FPGA device which is commonly used for AI acceleration is the Xilinx ZYNQ7000; and (4) Neuromorphic chips—these brain-inspired chips have the advantages of accelerating neural network architectures with low energy consumption. An example of a neuromorphic device which is commonly used for AI acceleration is the Intel Loihi [35][83]. It should be noted that neuromorphic approaches may utilize algorithms (e.g., spiking neural networks) which are different from conventional AI approaches.
There are various ways or modes in which edge AI models can be deployed as discussed by the authors of [36][84]: (1) Edge-Based Mode—in this mode, the edge AI device receives and sends the data to the edge server to perform the inference/prediction processing and returns the results to the edge AI device. This mode has the advantage that the edge server contains the centralized inference model for ease of deployment but has the disadvantages of latency depending on the network bandwidth. (2) Device-Based mode—in this mode, the edge AI device retrieves the inference model from the edge server and performs the prediction/inference task locally. This mode has the advantage that the inference processing does not rely on the network bandwidth but has the disadvantage of having a higher computational and memory requirement on the edge device. (3) Edge-Device mode—in this mode, the inference model is partitioned into multiple parts depending on the current factors such as network bandwidth and server workload. The information processing task is then shared between the edge device and the edge server. The mode has the advantages of flexibility and dynamic resource management. (4) Edge-Cloud mode—this mode has similarities with the edge-device mode when the edge device is highly resource constrained.
The authors in [37][85] considered the industrial Internet of Things (IIoT) over artificial intelligence (AI) applications and presented a discussion on edge AI technology. The work proposed a shared active transfer learning (SATL) design in which the open difficulties of edge AI applications for IIoT frameworks can be solved through training and testing. The work began with a briefing on smart edge AI, which is a mix of AI and edge computing, with an emphasis on model training for IIoT applications. The suggested SATL design focused on the three edge AI concerns listed: (1) Customization; (2) Adaptability; and (3) Preserving privacy by the use of AI, TL, and FL, respectively. Adaptability customizes the AI scheme by adjusting the number of labeled samples based on the task requirements. TL improves responsiveness by allowing the scheme to smartly harmonize the new learning routine, and FL ensures privacy by using a shared training approach in which the devices do not exchange any information. SATL attains superior precision with a smaller number of connected edge nodes, and the precision maintains at the top ranks even when the number of training samples is significantly reduced, according to simulation data. When compared with alternative state-of-the-art techniques, the SATL model’s training procedure took much less time.

4.1. GPU-Based Edge Hardware, Systems and Devices

Graphics Processing Units (GPUs) are high-speed graphic rendering processors with many parallel cores of about 100s to 1000s cores. They provide high-performance computing and, in comparison to CPUs, have a bigger size and a higher power consumption. GPUs are highly suited for AI tasks due to their large number of tiny cores, which allows for both neural network training and AI inference. Civerchia et al. [38][86] demonstrated the efficiency of 5G-based low latency remote control and image processing using AI and GPUs to drive SuperDroid Robots in all posts. The captured images by the SuperDroid Robots are sent to the image identification scheme through the 5G network via the robot rover. The image processing application’s output is provided to a remotely controlled app, which via the 5G network data plane relays the instruction to the robot rover. Image recognition installations in two different ways were investigated. One of the image processing applications runs on a mini-personal computer central processing unit, while the other runs on the Jetson Nano GPU. The rover’s ability to complete the slalom and cross the finish line without hitting any cones is the qualitative measuring performance criteria. Image transfer across the entire virtualized 5G network, image processing with image processing software, control resolution and the activation of robot controls all depend on the dual mode of the control chain. The quantity estimates for all parts of the two service line delays, such as travel time and return from the rover to the NGC N6 interface connector using user space probes such as ping, showed that the proposed design is efficient. as shown in Figure 68.
Figure 68.
(
a
) Surveillance camera distribution view. (
b) Illustration on a one master-slave pair [34].
) Illustration on a one master-slave pair [82].

4.2. FPGA-Based Edge Hardware, Systems and Devices

Field Programmable Gate Arrays (FPGAs) are made up of an arrangement of a matrix of programmable logic containing customizable logic blocks (CLBs) coupled via configurable interconnects. FPGAs can be reconfigured to satisfy specific application or feature demands. The hardware allows engineers with programming experience to reprogram the device whenever the need arises. When a large degree of flexibility is required, these are the best options. The System-on-Chip FPGA (Soc FPGA) is a popular FPGA implementation approach that combines programmable logic with processor cores (e.g., ARM, MIPS). The work in [39][87] presented an FPGA accelerator for a broadcasting classification model based on broadcast linear classifiers for continuous deep learning analysis (CLDA). Xilinx Vitis 2020.1 and C++ HLS were the building blocks of the project. They target the Xilinx ZCU1O2 kit at 200 MHz speed. The hardware is controlled by an ARM processor-based host program. The obtained CoRE50 dataset showed that the proposed optimization solution results in significant reductions in latency, resource and energy usage. In all CLDA variants, the FPGA architecture beats the Nvidia Jetson TX1 GPU, with reduced delays of four and five times per specification accordingly over the GPU for CLDA Plastic Cov. The design can perform class progressive lifetime learning for object categorization when paired with a freezing Convolutional Neural Network scheme.

4.3. ASIC-Based Edge Hardware, Systems and Devices

Application-Specific Integrated Circuits (ASICs) are specialized logic designs that use a custom circuit library and have a low power consumption, speed and a tiny footprint. ASICs are recommended for devices that will run in very high volumes because they are time-consuming to design and more expensive than other solutions [40][88]. Fuketa and Uchiyama [41][89] proposed a custom chip-shaped AI chip that supports power-saving computer tools and the development of AI computing systems that use parallelism to accelerate neural network processing. These chips are known as cloud AI chips and are used for both training and orientation using models of deep neural network (DNN) where processing capacity is very high. The rpapesearchr described the architecture of edge AI cloud chips and offered the tools that developers can use to create them. The researchers focused on image recognition tasks, which are common in CPS applications such as autonomous driving and factory automation with the aim of reducing computing precision using 32-bit floating-point (FP32) precision to enhance energy economy. The work used 16-bit FP (FP16) precision for training, and so the GPUs support FP16. The DNN model is made up of multiple layers of neural networks stacked on top of each other. The weighted sum of the input activations is used to produce the output activations.
The application of deep learning for video analytics is disadvantaged with high computational overheard as noted by the authors of [42][90]. The authors addressed the challenge by proposing an approach termed as FastVA, a framework that integrates video analytics for deep learning with neural processing unit (NPU) and edge processing in mobile and implemented FastVA on smartphones with extensive evaluations for its effectiveness. Based on the mobile application’s accuracy and requirements, the project looked into several issues: (1) maximum accuracy, where the purpose is to achieve precision within a limited time; (2) maximum utility, where the purpose is to improve utility as an average dependant of precision and processing time; and (3) minimum energy, where the goal is to reduce energy usage under time and precision limitations. The authors discovered when to offload the work and when to employ NPU to overcome these challenges. Their method is based on the network condition, the NPU’s unique properties and the optimization objective, which was presented as an integer programming problem with a heuristics-based scheme. To demonstrate the efficiency, the FastVA was deployed on smartphones for its evaluation.

4.4. TPU-Based Edge Hardware, Systems and Devices

The Tensor Processing Unit (TPU) is a machine learning engine with a specific function. It is a processing IC created by Google to handle TensorFlow neural network computing. The integrated circuits are application-specific (ASICs) that are used to increase specific machine learning tasks by putting processing elements—small digital signal processors (DSPs) with inbuilt memory on a framework and allowing them to communicate and transport data between them. The study in [43][91] analyzed low-power computer topology built into ML-specific hardware in the context of Chinese handwriting recognition. The work used NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures have been tested for performance. The streaming latency of AlexNet and a bespoke version of GoogLeNet for optical character recognition were compared. Many architectures are not especially optimized for these models because they are custom-made and not commonly utilized. The AGX’s massively parallel architecture allowed it to outperform the AlexNet model with more RAM. The TPU’s neural network-optimized architecture allowed it to outperform the smaller-memory GoogLeNet model while avoiding high-end memory access penalties. Furthermore, because of its closely connected, ML-focused architecture, the NCS2 had the better average throughput compared with both design models, demonstrating its strong adaptation properties. TPU devices have been shown to work very well on the GoogLeNet model and in the intermediate state, according to the authors.
ScholarVision Creations