1000/1000
Hot
Most Recent
Big data mining (BDM) is an approach that uses the cumulative data mining or extraction techniques on large datasets / volumes of data. It is mainly focused on retrieving relevant and demanded information (or patterns) and thus extracting value hidden in data of an immense volume. BDM draws from the conventional data mining notation but also combines the aspects of big data, i.e. it enables to acquire useful information from databases or data streams that are huge in terms of “big data V’s”, like volume, velocity, and variety.
In a world overflowing with information, which is particularly evident in information societies with access to the Internet, the term “big data” is indispensable and interconnected. Nowadays, data are being sent to the global network not only by people who do it consciously and manually (e.g., via social networks or e-mails), but also by all kinds of sensors and with the use of cloud computing. It is a Web 3.0 domain of which big data is one of the main pillars.[1]
Big data, as a nascent concept, has a rather turbulent history of trying to define it, and an attempt to organize the definitions was made by Gandomi and Haider.[2] The common axis of all definitions is the perception that big data, which can be viewed as a ‘new era’ of the data-driven paradigm, has opened up new possibilities for the improved decision support.[3] Big data is not only vast and dynamic, but it also necessitates the use of cutting-edge technologies to analyze and process.[4] Big data is distinguished from the traditional “data” notation, because big data, due to its stupendous size, cannot be processed and managed with conventional data mining tools.[5][6]
The concept of big data mining (BDM) is intimately associated to conventional data mining notation. These two concepts differ mainly in the methods of obtaining data, and not in the idea itself. BDM enables to obtain useful information from databases or data streams that are huge in terms of “big data V’s”, like volume, velocity, and variety.[7] The main functions of data mining in general are descriptive functions (such as clustering, association, and pattern mining) and predictive functions (such as classification, time series analysis, etc.). These functions (enablers) differ only slightly from each other, mainly in terms of the temporal reference, i.e., descriptive functions mainly concern the present and settled dependencies, and the predictive functions mainly refer to the study of the future and tentative dependencies. Having collected (big) data, it is necessary to analyze it to extract information hidden in it. In this case, it is crucial to use big data analytical tools. BDA was created in response to the need to analyze vast volumes of quickly collected complex data. As a result, data acquisition and processing occur at a high pace, which is impossible to achieve with calcareous computational methods.[8] Big data analytics, as being a big data derivative, can also be described with big data “V” characteristics. The ultimate and the pivotal step of the whole BDA process is “action on insight”, as Akhtar et al. [9] claim. The use and implementation of BDA, as one of the most principal factors for engendering meaningful insights for decision-making,[10][11] is crucial to extract value from the multitude of data being obtained. An organizational capability to handle BDA has recently become mainstream to create value.[12] It should be noted, however, that the blossoming of BDA potential in organizations can be withheld due to the lack of IT infrastructure, data storage facilities and organization strategy.[13]
In order to extract the hidden value of customer insights, big data (along with derived approaches, such as big data mining or wider: big data analytics) come in handy. Having presented the historical and theoretical background regarding the values of customer insights, it is possible to explore this topic in an even more extensive way, paying attention e.g., to the practical use of big data for customer insights in organizations.
Big data‑enabled societies – particularly based on the foundations of the digital economy – are capable of opening new perspectives for organizations striving to get to know their customers better.[14] The enormous volumes of data deriving from a variety of sources allow to analyze greater number of dimensions depicting customers, than before. This is mostly due to the new sources of data origin, as e.g., social media. Thanks to the characteristics of big data, it offers gargantuan possibilities for gaining new insights.[15][16] These new insights are not narrowed to customer-centric decision-making processes but affect the whole operating space of an organization. In simplification, the insights from big data – if properly used – may contribute to value generation, as well as to innovations, and to the competitive advantage.[17][18] Specifically, big data analytics is considered together with data mining issues.[19] For example, Xindong Wu et al. [20] propose a big data processing model from the data mining perspective. They point out that mining big data is data-driven and demand-driven. This context of data mining is present in many big data analytics definitions. In the paper of Mohsenian-Rad et al. [21] this type of analytics is described as the process of uncovering hidden patterns, unknown correlations, irregularities, and other data-driven intelligence. Data mining related to big data analytics’ tasks also encompasses text mining (for sentiment analysis) and social media analytics (for community detection or social influence analysis).[2] Especially the latter may be of paramount importance in modern dynamic operational environments, due to empowerment of organizations to perform the so-called situational data analytics instead of – or at least together with – classic static data analytics of transactional or enterprise data.[22]
Generating insights from big data is a process consisting of two main activities – data management and data analytics. The former encompasses data acquisition, extraction, cleaning, integration, and representation. The latter consists of data modeling, analysis, and interpretation.[2] Data mining techniques are among the most used ones in big data analytics. Hence it can be assumed that challenges in BDA concern also big data mining, even if not expressed explicitly. In a very general way, the first and foremost challenge of big data analytics is to generate business value.[23] It is also one of the ultimate goals of BDA and big data mining. The other ones are also provision of competitive advantage,[24][25] and generation of new business ideas from big data insights. Obviously, the quality of insights results from a proper orchestration of big data-related resources, that is data, technology, processes, and people within the framework of organization.[26] Human skills and organizational culture are as important as the technological dimensions of BDA in providing valuable results for the success of organization.[17][27] The overall success of big data analytics is therefore dependent on such factors as top management support, organizational change, technical infrastructure, the data science skillset, data availability and quality, data security and privacy.[28] Conceptually, the overall big data challenges can be summarized as presented in Figure 1.
Figure 1. Conceptual classification of big data challenges. Source: based on [29].
In the big data lifecycle, the first group of challenges concerns the characteristics of big data itself (the “Vs”) which in turn affect the issues of big data preprocessing, e.g., integration, cleansing, and transformation. At the data processing stage, the typical tasks of analysis, modeling, mining etc. must be adjusted to properly address the challenges of the first stage. Also, at the stage of presenting the results, the graphical methods must be able to cope with visualizing a huge amount of big data analysis results. The big data management stage extends over all other stages and is associated with challenges such as privacy, security, data ownership and other ethical issues.[30][31][32][33][34] Among the features determining data, information, and insights quality the completeness, accuracy, and currency are mentioned to be the most significant.[35] Especially the last feature is a challenge when applied to big data analysis and mining. Not only the data comes as a stream or flow (the velocity dimension of big data) but also it must be analyzed/mined in real-time manner to provide value to organizations. We will cover this temporal challenge later.
With data mining (DM) as one of the most important elements of big data analytics, it is not surprising that the DM software is one of the most appreciated tools among various analytical tools used for BDA.[36] However, even the best software will not produce valuable results from garbage data. Hence, among big data mining challenges the first group includes data inconsistence and incompleteness, scalability, timeliness, and data security. Challenges also concern data capture, storage, searching, sharing, analysis, and visualization.[37] There is a common agreement that before mining the data it is mandatory to consider such issues as validity and reliability of data. The bigger the data quantity is, the bigger the challenge. With the amount of data, discovering dependencies and valuable patterns becomes extremely difficult.[15][38] The next group of challenges is associated with the mining process itself. The algorithms and techniques used for “classic” data mining in e.g., data warehouses sometimes are not suited to be used with huge amounts of constantly incoming big data. This is so because traditional data mining approaches start with a centralized data repository, able to store and process data. With the prodigious size and variety characterizing big data such centralized approach may not be used. There is a strong need of more distributed approaches capable of mining huge amounts of unstructured data.[39] Some other challenges include e.g., lack of large-scale data representation (for mining purposes), lack of effective and efficient on-line large-scale machine learning techniques, lack of data confidentiality mechanism.[40] Challenges concern mining algorithms which must deal with sparse, uncertain, incomplete, complex, and dynamic data.[20] Also, the constant inflow of data to be mined can be recognized a momentous challenge, as many mining algorithms do not provide proper sequences or patterns.[41] Some of the proposals to overcome this obstacle include e.g., incremental pattern mining and cluster analysis, when the discovered patterns and clusters are incrementally augmented with updated information,[42] post-processing enhancements of mined patterns [43] and special spatio-temporal representations of data for further mining.[40] Therefore, these stages, like data cleaning, integration, ranking and querying, are often considered as the sources of “algorithmic bias”.[44][45] Reasoning about them as well as attenuating inequity upstream from the final data analysis phase is potentially more impactful.[45]
However, it becomes obvious that for valuable insights from big data mining it is essential to consider temporal-related issues. As Xindong Wu et al. [20] point out, in a dynamic world data and information representing interesting features from the environment of an enterprise enlargement. Hence while mining useful patterns from big data, it is indispensable to consider these evolving changes. However, it seems symptomatic that a miniature number of big data analytics definitions even mention the question of dynamics. For example, Mikalef et al. [24] while presenting sample definitions of BDA consider only two ones addressing the dynamic dimension of BDA. A challenge in big data mining hence arises – how to deal with dynamic/temporal aspects of the realm described by big data. One of the ways to do it is to implement agile big data analytics. BDA is seen as a ‘bridging’ instrument in development of software applications using agile methods.[46] Agility is achieved by creating a data infrastructure enabling identification and evaluation of various big data sources.[25] Afterwards, there are approaches focused on big data stream processing enabling a flexible mining solution.[39] However, these solutions are insufficient when it comes to real-time data processing.[47] The real-time big data analytics presents another challenge related to big data mining which must be considered.[30] Many of the phenomena of interest to the organization are represented as time series,[48] this applies to e.g., sentiment analysis or user’s website activities. But many other phenomena are too intricate to be represented this way. Knowledge coming from organization’s environment evolves very quickly because of a constant inflow of data and information. The big data mining in real time may ameliorate decision-making processes in organizations because it would enable dealing with real-time uncertainties.[16][49] The time dimension of big data is reflected in the speed of their inflow. This causes big data to be transient which implies the need to mine them as and when they are generated.[50] The timeliness of data analysis and mining is the succeeding challenge, tightly linked with the challenge of dealing with temporal dimension of big data. This timeliness challenge is discussed in [37] in more detail.
The most intuitive way to deal with temporal aspects of big data mining is to treat the data inflow as a set of events. This is quite natural because events are the building blocks of surroundings of organizations, hence they need to be represented and mined during big data analytics. The big data mining process should be therefore focused on events implied from the massive volume of data. It is thus clear that big data mining is closely related to events.[51] The consecutive big data mining challenges may be formulated as: the challenge of event capturing and representation for further analytics, the challenge of constructing temporal big data mining algorithms, and the challenge of representing temporal features of the mined knowledge. The events and temporal information in big data should be identified, the temporal relations among events should be found and represented, event-based information retrieval and analytics should be done. Wang et al. [40] proposed a big data temporal analytics solution but only for texts, while leaving apart many other forms of data leading to the big data variety feature.
Another approach to temporal big data has been proposed in the work of Singh et al. [42] where the frequent patterns mining, and cluster analysis model are used on constantly incoming data. The model encompasses a progressive and incremental update of mined patterns and clusters with new information, and newly discovered patterns and clusters are incrementally added to the existing ones. However, events are not addressed in this approach and there is lack of temporal representation. In fact, the model is concentrated on time series instead of event sequences. An answer to the challenge of temporal BD mining has been proposed in.[40][52] Both approaches consider Complex Event Processing (CEP) systems as a solution. CEP systems are particularly useful for real-time analytics and stream reasoning. These solutions differ in time representation. Some of them are based on point structures, while others are based on intervals (cf. [53]). The CEP systems also differ in their complexity and in orientation: computation-oriented vs. detection-oriented ones.[40] A variation of event processing systems has been proposed by in [39], namely Semantic Complex Event Processing augmented with an agent that dynamically builds an ontology which can then be queried temporally. However, even the mining systems based on event processing are yet not capable of mining causality relations [41] which would contain a lot of useful information on complex phenomena in organization’s environment.
Another group of approaches to analyzing streaming and/or temporal big data is built upon the so-called ontology-based data access (OBDA). OBDA origins from the Semantic Web analytics and its core feature lies in separating conceptual and database levels of data.[22] Unfortunately, OBDA itself does not adapt to changes in data sources. The W3C standardized an ontology and a query language for the ODBA: OWL2QL and SPARQL,[54] but these solutions do not handle essentially temporal big data. Incorporating complex temporal information into OBDA together with the ability to process heterogeneous data poses a serious challenge.[55] A temporal OBDA is then requisite. There are various ways to the development of such a temporal version of OBDA:
The advantage of all the above solutions lies in the direct incorporation of time dimension into analytics. On the other hand, the main disadvantage and weakness in the context of big data mining concern the nature of data and analytics. All the above solutions are directed towards relational/structured data and queries, and do not deal with any data mining tasks. Hence, they cannot be considered satisfactory for temporal big data mining. The challenge which then is seen concerns augmenting the existing big data mining models, methods, and algorithms with explicit temporal expressions and with ways to handle them to mine temporal big data.
Summing up, we note several challenges for big data mining, especially in the context of customer insights. These are:
All these challenges constitute important and promising research areas, but as shown, the most important and challenging issue concerns incorporating explicit time notion into representation and mining procedures of big data. There is a strong need to express temporal dimension of big data and in big data itself, using more complex temporal representations than event calculus. There is a need to represent causality of phenomena, of discovering changes in phenomena depicted by big data, and of mining useful temporal patterns to get deep and profound insights on the way the world around organizations evolves.
We have focused on the challenges associated with big data mining which is a specific subarea of BDA. Obviously, the broader BDA field also faces several challenges. These are primarily challenges with big data’s volume characteristics. The large sample size may result in several biases as e.g., sampling error, measurement error, aggregation error etc.[60] Especially the sampling error may result in highly biased data. Researchers have shown examples of such biased data collected from social networks. While gathering this data, it may be erroneously assumed that social media users are representative of the population [61] while there are many social groups excluded from using the SM. E.g., people digitally excluded – due to age, education, low socioeconomic status may not be represented in the retrieved big data sample.[61][62][63] A noticeable bias in big data may also result from gender and race issues.[64][65][66] All these challenges should be kept in mind while addressing the question of big data analytics, however, they are beyond the scope of this entry.