Others

Big Data Concepts

The entries consolidates the main concepts about Big Data modeling and management. The characterization and the definitions about structured, semi-structured and unstructured data.

A Brief History of Big Data

The production and processing of large volumes of data began to be of interest to researchers many years ago. By 1944, estimations for the size of libraries, which increased rapidly every year, were made in American universities [3]. In 1997, at the Institute of Electrical and Electronics Engineers (IEEE) Conference on Visualization, the term “Big Data” was used for the first time during the presentation of a study about large datasets’ visualization [4].

Big Data is the buzzword of recent years, that is, a fashionable expression in information systems. The general population relates the term Big Data to its literal meaning of large volumes of data. However, Big Data is a generic term used to refer to large and complex datasets that arise from the combination of famous Big Data V’s that characterize it [5].

Big Data Characterization

As mentioned before, Big Data does not refer only to high volumes of data to be processed. At the beginning of the Big Data studies, their volume, velocity and variety were considered as fundamental characteristics, which were known as the three Vs of Big Data. After advances in the research, new Vs, such as value and veracity, were established. Currently, there are authors who propose up to 42 characteristics needed to consider data as Big Data, therefore, they define 42 Vs for Big Data [6]. For the purposes of our study, we will mention only ten Vs of Big Data, that are presented in a scientific study [7]. Table 1 summarizes each characteristic, along with a brief description.

Table 1. Ten Vs Big Data.

Characteristic

Brief Description

Volume

Large data sets

Velocity

High data generation rate

Variety

Different type of data formats

Variability

Consistent data

Viscosity

Data velocity variations

Virality

Data transmission rate

Veracity

Accuracy of data

Validity

Assessment of data

Visualization

Data symbolization

Value

Useful data to retrieve info

Volume and Velocity

To deal with the Volume and Velocity characteristics of Big Data, ecosystems and architectural solutions, such as lambda and kappa, have been created. Both architectures propose a structure of layers to process Big Data; the main difference between them is that lambda proposes a layer for batch data processing and another for streaming data, while kappa proposes a single layer for both batch and streaming processing [8]. This SLR focuses on data modeling, a concept related to the Variety characteristic, which is explained next.

Variety

Variety is a characteristic referring to the different types of data and the categories and management of a big data repository. As per this characteristic, Big Data has been classified into structured, semi-structured and unstructured data [9,10]. The next subsections explain in detail each data type.

Structured Data

In Big Data, structured data are represented in tabular form, in spreadsheets or relational databases [10]. To deal with this type of data, widely developed and known technologies and techniques are used. However, according to the report presented by the CISCO company, this type of data only constituted 10% of all existing data in 2014 [11]. Therefore, it is very important to analyze the 90% of the remaining data, corresponding to the semi-structured data and unstructured data that will be described below.

Semi-Structured Data

Semi-structured data are considered to be data that do not obey a formal structure, such as a relational database model. However, they present an internal organization that facilitates its processing; for instance, servers’ logs in comma-separated values (csv) format, documents in eXtensible Markup Language (XML) format, JavaScript Object Notation (JSON) and Binary JSON (BSON) and so forth. Some authors may consider XML and JSON as structured [10].

Unstructured Data

Unstructured data are considered those that have either no predefined schema or no organization in their structure [12]. Within this type of data are text documents, emails, sensor data, audio files, images files, video files, data from websites, chats, electronic health records, social media data and spatio-temporal data, among others [9]. According to CISCO, the volume of unstructured data between 2017 and 2022 is expected to increase up to twelvefold [13].

To support the Variety, Volume and Velocity of Big Data, non-relational, distributed and open source data storage systems have been created. These systems include horizontal scalability, linearization, high availability and fault tolerance. Usually, these databases are known as NoSQL.

The article has been published on 10.3390/su12020634

Cite this article

Diana, Martinez-Mosquera. Big Data Concepts, Encyclopedia, 2020, v1, Available online: https://encyclopedia.pub/556