1000/1000
Hot
Most Recent
Interpretable decision tree model based on C4.5 capable of seamlessly using numerical, categorical, sequential, and time series information for classification purposes.
J48SS is a decision tree learner based on WEKA's J48 (a Java implementation of C4.5). The algorithm is capable of naturally exploiting categorical and numerical attributes as well as sequential and time series data during the same execution cycle. The resulting decision tree models are intuitively interpretable, meaning that a domain expert may easily read and validate them.
In fact, temporal data plays an important role in the extraction of information in many applications, and it comes in at least two flavours: it can be represented either by a discrete sequence of finite-domain values, e.g., a sequence of purchases, as well as by a real-valued time series (for instance, think of a stock price history). Sometimes, temporal information is complemented by other, "static" kinds of data, which can be numerical or categorical. As an example, this is the case with the medical history of a patient, which may include: (i) categorical attributes, with information about the gender or the smoking habits; (ii) numerical attributes, with information about age and weight; (iii) time series data, tracking the blood pressure over several days, and (iv) discrete sequences, describing which symptoms the patient has experienced and which medications have been administered. Such an heterogeneous set of information pieces may be useful, among others, for classification purposes, such as trying to determine the disease affecting the patient.
Another use case is that of phone call classification in contact centers: a conversation may be characterised by sequential data (for instance, textual data obtained from the call recordings), one or more time series (keeping track of the volume over time), and a set of categorical or numerical attributes (reporting, e.g., information pertaining to the speakers, or the kind of call). Unfortunately, different kinds of data typically require different kinds of preprocessing techniques and classification algorithms to be managed properly, which usually means that the heterogeneity of data and the complexity of the related analysis tasks are directly proportional. Moreover, since multiple algorithms have to be combined to produce a final classification, the final model may lack in interpretability. This is a fundamental problem in domains where understanding and validating the classification process is as important as the accuracy of the classification itself, e.g., production business systems and life critical medical applications.
J48SS relies on sequential pattern mining to extract meaningful information from discrete sequences. As for time series data, it makes use of shapelets, that are extracted by means of a multi-objective evolutionary algorithm. Both tasks are performed during the learning phase of the algorithm, on each node of the decision tree.
Specifically, the algorithm has been evaluated through two distinct experimental tasks.
The first is in the business speech analytics setting. The starting point was the observation that conversation transcripts may be viewed as a kind of sequential data, where phrases correspond to sequences that can be managed by our algorithm. The proposed solution has been tested on a set of recorded agent-side outbound call conversation transcripts produced in the context of a wide-range survey campaign run by Gap S.r.l.u., an Italian business process outsourcer specialised in contact center services. The considered task consisted of detecting relevant phrases in such transcripts, by determining the presence or absence of a predefined set of tags, that carry a semantic content. The ability of tagging phone conversations may greatly help the company in its goal of introducing a reliable speech analytics infrastructure with a real operational impact on business processes, that may help in the assessment of the training level of human employees or in the assignment a quality score to the calls. The importance of analysing conversational data in contact centers is remarked by several studies, based on the fact that the core part of the business still focuses on the management of oral interactions. Other benefits that speech analytics can deliver in such a domain include the tracking down of problematic calls, and, more generally, the development of end-to-end solutions to conversation analysis.
The second experiment focused on time series, which play a major role in many domains. For instance, in economy, they can be used for stock market price analysis, in the healthcare domain, they may allow to predict the arrival rate of patients in emergency rooms, and in geophysics, they convey important information about the evolution of the temperature in the oceanic waters. Specifically, we evaluated the proposed algorithm against a selection of well-known UCR time series datasets, focusing on the task of time series classification: given a training dataset of labeled time series data, the goal is that of devising a model capable of labelling new, previously unseen instances. Several techniques have been used in the past for such a purpose, including support vector machines and neural networks. Nonetheless, in all situations when interpretability has to be taken into account, decision trees are still mostly used.
In all considered cases, the experimental results allowed us to conclude that J48SS is capable of achieving a competitive classification performance with respect to both sequence and time series data classification. Moreover, as a further advantage over previous methods, the trees built by the proposed algorithm are easily interpretable, and powerful enough to effectively mix decision splits based on several kinds of attribute. Finally, such a flexibility allows one to reduce the data preparation effort.
In addition, some exploratory results suggest that J48SS ensembles might perform better than those proposed in previous studies, while also being smaller. While we are conscious that some approaches have recently been presented, capable of achieving a higher accuracy on time series classification than our solution, these typically lack two distinctive features of J48SS, i.e., the ability of seamlessly handling mixed kinds of data, which, again, reduces the data preparation effort, and the intuitive interpretability of the generated models.
We conclude with a description of an archetypal kind of dataset on which J48SS should be applied. As a matter of fact, the datasets that we used for the experimental tasks did not take full advantage of the capabilities of the new algorithm, as they consisted of sequences or time series only, and all static attributes were simply derived ones. In order to fully exploit the potentialities of J48SS, a dataset should contain sequential and/or time series data, and it should include meaningful static (numerical or categorical) attributes, somehow independent from the previous ones. The latter condition means that static attributes should not simply synthesize information already contained in the sequences / time series, but add something per se. Such a dataset has proven to be difficult to find in the literature, maybe because of the lack, until now, of an algorithm capable of training a meaningful model on it. Nonetheless, as mentioned in the introduction, we believe that datasets from the medical domain may have all the desired properties.