Efficient Approximate Analytics of Georeferenced Big Data

Efficient Approximate Analytics of Georeferenced Big Data: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Interdisciplinary Applications | Environmental Sciences | Geography, Physical

Contributor:

The unprecedented availability of sensor networks and GPS-enabled devices has caused the accumulation of voluminous georeferenced data streams. These data streams offer an opportunity to derive valuable insights and facilitate decision making for urban planning.

approximate query processing
Spark Streaming
stratified sampling
spatial sampling
Douglas–Peucker

1. Introduction

The unprecedented overdependence on the ubiquitous Internet of Things (IoT) in all aspects of our lives has caused the accumulation of massive amounts of multidimensional and heterogeneous data that are mostly georeferenced [1,2,3,4]. To extract insightful knowledge from such an abundance of data, the fast-track adoption of a breed of big data processing frameworks such as Apache Spark has occurred.

In particular, with most data from IoT arriving in streams, unprecedented efforts in the relevant literature have focused on designing and supporting the promotion of a constellation of tools known as data stream processing systems (SPSs) [5]. Those systems typically search for a good trade-off between two indispensable QoS requirements specified in service-level agreements (SLAs): low latency and high accuracy. As those requirements are normally contradicting, where improving one of them leads naturally to the deterioration of the other, a constellation of different SPSs, characterized by different trade-offs and based on approximate computation, has recently emerged [3,4,6,7]. Approximate computing is driven by the fact that users are typically satisfied with approximate results and willing to give up tiny error-bounded losses in accuracy in exchange for a lower latency [4,6]. There are numerous methods applied in approximate computing. However, they are mostly based on either selecting subsets of the arriving data streams or discarding (also known as shedding) extra data loads that may otherwise introduce additional latencies. Sampling is, by and large, one of the most widely adopted methods for such purposes [3].

Embracing randomness has been the defining factor for most sampling methods that have been designed in the relevant literature so far. This, in part, is attributed to the fact that application scenarios were increasingly known to generate data that exhibited normal distributions [3]. However, things have changed dramatically with the emergence of smart environments (e.g., smart cities) and big data scenarios. Currently, those scenarios are generating big data characterized by high skewness and thus are non-uniformly distributed. In short, this means that the dependence on random-based sampling methods is not going to perform well when it comes to approximating geo-statistics (e.g., the average speed of a vehicle moving in a specific trajectory in a city).

Online sampling that is aware of the geographical characteristics of big data has been demonstrated to be computationally expensive and could cause SPSs to halt at high spikes in data arrival rates [6]. This problem is exacerbated by the fact that well-performing spatial sampling algorithms have to be based on advanced data structures as opposed to simple random sampling. In fact, they are mostly based on stratification, which divides the data into groups corresponding to the real geometrical areas where they have been collected in order to be able to draw roughly equal amounts of data objects from each area independently.

2. Spatial Approximate Query Processing

Unprecedently, massive amounts of big, geotagged data streams are hitting database management systems (DBMSs) continuously, with fluctuations in such a way that in peak hours, data size typically overwhelms those systems and far exceeds their computational capacity [3,6,7]. Seeking to derive insights from such data [6], DBMSs need to manage and store such data efficiently. In such cases where size is overwhelming, systems resort to either elasticity or adaptivity [14]. In the former, systems overprovision or deprovision resources depending on the fluctuation in data arrival rates, whereas in the latter, systems resort to applying approximate query processing (AQP). The disadvantage of elasticity is that it keeps the online DBMS busy with configurations while those resources can be better allocated to data stream processing instead.

The typical form of adaptivity is represented by AQP. This entails reducing the size of the arriving data in such a way that enables the system to operate gracefully in response to the fluctuations in data arrival rates. The most adopted method for AQP in the literature is sampling, which works by selecting a representative subset of the arrived data in a way that guarantees bounded error numbers in exchange for significant gains in processing times and reduced latencies. AQP is highly sought after in the state-of-the-art simply because the tiny loss in accuracy in exchange for a significant gain in speedup is desirable and acceptable by decision makers in smart city scenarios [4].

Authors in [7] proposed an AQP system for online data stream processing. Their system is composed of three main components: data learning, dynamic sampling strategy, and error control. Their sampling strategy is based on stratification and binary-trees division. They split arriving data streams into tree-based nodes (nodes are analogous to partitions in data distributed parallel processing terms). The other component is a loop-feedback-based error controller to compute the sample sizes adaptively for subsequent processing cycles. The shortcoming of this framework includes the fact the stratification methodology is expensive as it is tree-based. Also, the system is unaware of spatial characteristics and is not reliable for spatial online data processing.

In the same vein, SnappyData [15] has been designed to operate over Apache Spark [16] and incorporates AQP for data stream processing and analytics. It features data structure modules that balance latency and accuracy QoS SLA targets. However, it is not optimized for spatial AQP and data analytics. It does not host a spatial-aware sampler.

Authors in [17] have designed Taster, which encapsulates a few components, including a collector that employs synopses reduction techniques as front stages to select a subset of data. In addition, it hosts a tuner and cost planner, which is responsible for producing on-the-fly approximate running plans. Plans are then run against a synopsis (e.g., samples or sketches) that either persists in a data warehouse or is poured from an online data stream. Plans should be able to meet the accuracy/latency targets from an SLA. Those plans are passed to a tuner, which is then responsible for selecting the plan that optimizes the performance. The system then utilizes the selected plan to capture the synopsis (persisted in disk or from online data streams). The system then computes approximate results from those synopses to answer a query posed by the user. This system, however, is not designed with spatial data processing in mind; it also regularly stores synopsis in disks, which adds extra overhead.

Stock versions of those AQP systems do not include over-the-counter support for geospatial data processing, which leaves handling those logistics to the presentation layer, thus overwhelming the shoulders of front-end developers and distracting their focus orientation away from the main task of developing dependable spatial data stream analytics. This is so because geospatial data are parametrized and represented in the form of pairs of coordinates (typically longitudes and latitudes). Having said that, spatial data lose their characteristics while moving and floating around in a network, and bringing them back to their multidimensional shape is a computationally expensive task that normally involves costly spatial join processing [18,19]. In conclusion, for a data stream processing system to be able to work with big geospatial data at scale with QoS guarantees, it must feature by-design products that incorporate support for geospatial data processing and be aware of other spatial characteristics intrinsically. It also should support interactive and proactive mechanisms to respond to sudden spikes in data arrival rates in such a manner that guarantees, to a good extent, the stability of the system. It should be well-noted, however, that online spatial sampling is challenging basically because of the multidimensionality of data. Stock versions of state-of-the-art data stream processing engines do not currently offer spatial awareness and AQP with QoS guarantees together [3,6,19,20].

Within the same consortium, ApproxHadoop [21] features a two-level sampling mechanism atop Apache Hadoop, which drops tuples to reduce data size in order to reduce I/O overhead. Nevertheless, it relies on an offline profiling method for reconfiguring the latency/accuracy estimates and, therefore, is hardly applied for data streaming workloads.

Other big data systems employ sampling for spatial data partitioning in the Cloud. For example, Simba [22] and SpatialHadoop [23]. However, they do not work for dynamic spatial data stream processing scenarios, where data arrival rates fluctuate periodically.

Also, EXPLORA [24] has been designed to support AQP and geo-visualization of spatiotemporal data at scale with QoS guarantees. It hosts a sampling method that selects online representative samples of arriving data, which aims to speed up processing while keeping accuracy in check. However, it does not host a latency-aware controller that pulses the fluctuation in data arrival rates and reacts proactively.

Also, spatial-aware approximate query processing (SAQP) is employed for integrating calibrated granular heterogeneous spatially tagged pollution and mobility data for joint analytics at scale with QoS guarantees, as in [25]. Authors have designed EMDI (Environmental Mobility Data Integrator) for integrating mobility data with environmental data, e.g., pollution, climatological, and meteorological data, for complex unified analytics with QoS guarantees. It features a sampler that samples data from both worlds for unified AQP analytics.

However, those systems do not take advantage of the line simplification algorithms. As stipulated in Tobler’s first law of geography, objects that are closer in real geometries are more related. Taking advantage of this fact has been widely proven in the literature [3,4,6,26,27]. Having said so, most analytics do not consider spatial objects floating on the outskirts of cities, and most analytics are focused inside cities and livable areas. Furthermore, line simplification algorithms are indispensable for efficient, timely analytics of big georeferenced data with QoS guarantees, such as reducing latency while keeping accuracy in check.

3. Line Generalization Algorithms

Vector line generalization algorithms can be divided into several categories, as discussed in [9]. Those are independent point algorithms (such as the nth point routine and the routine of random selection of points), local processing routines (such as the routine of distance between points and the perpendicular distance routine), unconstrained extended local processing (such as the Reumann–Witkam routine [10] and the sleeve-fitting polyline simplification algorithm [11]), constrained extended local processing (such as the Opheim simplification algorithm and Lang simplification algorithm [12]), and global routines (such as Douglas–Peucker algorithm [8] and an algorithm by Visvalingam and Whyatt [13]). For example, the independent point algorithms work by grouping points on a line into independent sets of consecutive points and then retaining random points. Local processing routines depend on the idea of eliminating points within a radial or perpendicular distance that is less than a user-supplied threshold (tolerance bandwidth) while retaining points within a distance greater than that of the tolerance threshold. From the family of unconstrained extended local processing, the Reumann–Witkam routine [10] works by first passing a strip (rectangular) that shifts stepwise through all segments of an original line, where the start and end points within the strip are retained while in-between points are eliminated. The sleeve-fitting polyline simplification algorithm works in a way similar to that of the Reumann–Witkam routine, where the strip is known as a sleeve that is passed through the original segments of the original line. The Opheim simplification algorithm from the category of constrained extended local processing algorithms works in a way that is similar to that of unconstrained extended local processing routines, in addition to minimum and maximum additional constraints applied within a search region (similar to strips and sleeves), where points within the minimum tolerance or within the search region bounded by the maximum tolerance are removed, while the other points of the original line are retained. Lang simplification algorithm works in a similar way as compared to the Opheim simplification algorithm utilizing search regions, in addition to a distance of intermediate points to a segment connecting start and end points in the region, which should not exceed a user-supplied tolerance for in-between points to be retained.

Global routines, e.g., DP, take the entire line into consideration while processing, contrary to what appears in the other four categories, which consider segments of the entire line stepwise from the start point to the end point. Visvalingam–Whyatt from this category works on the basis of eliminating the original point with the minimum effective area (triangular area formed by any point and its direct in-between neighboring points). The DP algorithm performs favorably for tiny weeding (minor simplification), while the Visvalingam–Whyatt algorithm wins for eliminating entire shapes (e.g., caricatural generalization) [9].

Extensive experiments conducted by [9] show that the Douglas–Peucker algorithm produces the most accurate generalization in terms of measures such as mean distance and mean dissimilarity (which means it has the best performance in the language of shape distortion).

4. Applications of Line Simplification in Approximate Geospatial Analysis

Line generalization has been widely adopted in several aspects related to complex geospatial analysis, including trajectory compression. For example, a recent work by [28] proposed to parallelize several trajectory compression algorithms atop Apache Spark, including an algorithm that is based on Douglas–Peucker. So, it has been applied to trajectories formed from GPS points so that trajectories are lines formed by connecting points, and then parts of those lines are cut, and their corresponding points are discarded accordingly. The process does not involve reducing a polygon; thus, no spatial join is involved.

In the same vein, ref. [29] have designed a novel geospatial-aware big data analytics framework that features trajectory compression using DP algorithm line simplification as an integral part of its operation. The same applies to a work by [30], where authors proposed an Enhanced Douglas–Peucker (EDP) algorithm which applies a group of enhanced spatial–temporal constraints (ESTC) while simplifying trajectory data, aiming basically to preserve few spatial–temporal features while applying the line simplification for compressing trajectories.

Within the same consortium, ref. [31] designed a novel method to unleash hotspot routes in mobility data by employing massive taxi trajectory data. To reduce the burden on storage and processing resources, they employed the DP algorithm as a quick-and-approximate filter in the front stage of their system to reduce the number of trajectory points accepted for clustering downstream in their system.

Also, variations of the plain application of the DP algorithm are available. For example, ref. [32] proposed a velocity-preserving trajectory DP-based simplification algorithm, which divides the source trajectory into sub-trajectories based on their average velocity such that the velocity in every sub-trajectory differs largely from the others. This way, each sub-trajectory is assigned a different threshold based on those figures; thereafter, each sub-trajectory is simplified using the plain DP algorithm independently, then all simplified sub-trajectories will be merged into a single simplified trajectory.

In addition, ref. [33] designed an adaptive DP-based algorithm with automatic thresholding for AIS-based vessel trajectory compression. Also, another DP-based algorithm has been designed by [34] for compressing automatic identification system (AIS) ocean massive datasets. Also, in oceanography, refs. [35,36,37] designed a DP-based algorithm for the compression of AIS ocean ship data.

Another line of applications of line simplification algorithms for scale-aware geospatial data analytics includes big georeferenced data visualization. In this direction, ref. [38] applied a DP-based algorithm for reducing the georeferenced data to be visualized at scale with QoS guarantees. Authors in [39] propose a novel algorithm that they term locally adaptive line simplification (LOCALIS) for GPU-based geographic polyline data visualization at scale with quality guarantees. Similar work appears in [40], where a method for Level-of-Details (LoD) visualization for polygonal data was designed, which incorporates a polygon simplification method that is based on the DP algorithm. Ref. [41] applies the DP algorithm for generating trajectory data for thematic heatmaps at the city scale for tourist activity analysis. Similarly, authors in [42] have designed an approach for the efficient generation of heatmaps using methods based on the DP algorithm. Also, ref. [43] designed RectMap by combining DP-based simplification with a gridding algorithm to generate simplified versions of plain reference maps.

The picture that emerges from the recent state-of-the-art is the following: systems are mostly run on beefed-up servers and not deployed in parallel. Also, they are not applied with stratified-like spatial sampling, and they are mostly applied for simplifying polylines, not polygons. Simplifying polygons has a great impact on the processing and qualified analytics of big geospatial data streams for several reasons, including the fact that georeferenced sensor data are parametrized and typically need to be joined with polygons for insightful analytics, which adds extra I/O and computational overheads. Also, sending those polygons to the Edge node in Edge–Cloud deployments that are designed for spatial data analytics means adding extra layers of network overhead that slow down the performance overall. Also, federated learning is a line of machine learning (ML) that parallelizes ML algorithms so that they run on Edge devices near the data, so being able to join the spatial data quickly on those devices with polygons data can provide significant speedups.

This entry is adapted from the peer-reviewed paper 10.3390/s23198178

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.