Parallel K-Means Algorithms in Java: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor:

K-means is a well-known clustering algorithm often used for its simplicity and potential efficiency. K-means, though, suffers from computational problems when dealing with large datasets with many dimensions and great number of clusters. 

  • parallel algorithms
  • multi-core machines
  • K-means clustering
  • Java

1. Introduction

K-Means algorithm [1,2] is very often used in unsupervised clustering applications such as data mining, pattern recognition, image processing, medical informatics, genoma analysis and so forth. This is due to its simplicity and its linear complexity O(NKT) where N is the number of data points in the dataset, K is the number of assumed clusters and T is the number of iterations needed for convergence. Fundamental properties and limitations of K-Means have been deeply investigated by many theoretical/empirical works reported in the literature. Specific properties, though, like ensuring accurate clustering solutions through sophisticated initialization methods [3], can be difficult to achieve in practice when large datasets are involved. More in general, dealing with a huge number of data records, possibly with many dimensions, create computational problems to K-Means which motivate the use of a parallel organization/execution of the algorithm [4] either on a shared-nothing architecture, that is a distributed multi-computer context, or on a shared-memory multiprocessor. Examples of the former case include message-passing solutions, e.g., based on MPI [5], with a master/slave organization [6], or based on the functional framework MapReduce [7,8]. For the second case, notable are the experiments conducted by using OpenMP [9] or based on the GPU architecture [10]. Distributed solutions for K-Means can handle very large datasets which can be decomposed and mapped on the local memory of the various computing nodes. Shared memory solutions, on the other hand, assume the dataset to be allocated, either monolithically or in subblocks to be separately operated in parallel by multiple computing units.

2. K-Means

A dataset X={x1,x2,…,xN} is considered with N data points (records). Each data point has D coordinates (number of features or dimensions). Data points must be partitioned into K clusters in such a way to ensure similarity among the points of a same cluster and dissimilarity among points belonging to different clusters. Every cluster Ck, 1≤k≤K, is represented by its centroid point ck, e.g., its mean or central point. The goal of K-Means is to assign points to clusters in such a way to minimize the sum of squared distances (SSD) objective function:

where nk is the number of points of X assigned to cluster Ck and d(xi,ck) denotes the Euclidean distance between xi and ck. The problem is NP-hard and practically is approximated by heuristic algorithms. The standard method of Lloyd’s K-means [14] is articulated in the iterative steps shown in Figure 1.

Algorithms 15 00117 g001 550

Figure 1. Sequential operation of K-Means.

Several initialization methods for the step 1 in Figure 1, either stochastic or deterministic (see also later in this section), have been considered and studied in the literature, see e.g., [3,15,16,17,18], where each one influences the evolution and the accuracy of K-Means. Often, also considering the complexity of some initialization algorithms, the random method is adopted which initializes the centroids as random selected points in the dataset, although it can imply K-Means evolves toward a local minima.

3. Parallel K-Means in Java

3.1. Supporting K-Means by Streams

A first solution to host parallel K-Means in Java is based on the use of streams, lambda expressions and a functional programming style [12], which were introduced since the Java 8 version. Streams are views (not copies) of collections (e.g., lists) of objects, which make it possible to express a fluent style of operations (method calls). Each operation works on a stream, transforms each object according to a lambda expression, and returns a new stream, ready to be handled by a new operation and so forth. In a fluent code segment, only the execution of the terminal operation actually triggers the execution of the intermediate operations.

Algorithms 15 00117 g004 550

Figure 4. A map/reduce schema for K-Means based on Java streams.

3.2. Actor-Based K-Means Using Theatre

Another solution for serial/parallel K-Means was achieved on top of the Theatre actor system. Theatre is both a formal modelling language [20] and an implementation framework in Java. It addresses modelling, analysis and synthesis of parallel/distributed time-dependent systems like cyber-physical systems with strict timing constraints [20,21].
A key difference from the classical actor computational model [22,23] is the adoption of a (transparent) reflective control-based layer which can reason on a time notion (real-time or simulated-time) or on no-time (for concurrent/parallel systems), and regulates the ultimate delivery order of the asynchronously exchanged messages among actors, which in [22] is truly non-deterministic.
Theatre is characterized by its lightweight and totally lock-free architecture. The design purposely minimizes the use of threads as only “programming-in-the-large” units, to be allocated on the available cores. The goal is to favor time predictability, as well as the development of high-performance parallel applications [13].

3.2.1. The Parallel Programming Model of Theatre

A system consists of a federation of computing nodes (theatres) which can be allocated to distinct cores of a multi-core machine. A theatre node is mapped onto a separate thread and is organized into three layers (see also [13]): (1) a transport-layer, which is used for sending/receiving inter-theatre messages); (2) a control-layer which provides the basic services for scheduling/dispatching of messages; (3) an application-layer which is a collection of local business actors.
Both intra-theatre and inter-theatre communications (message exchanges) are enabled. In addition, actors can be moved from a theatre to another, for configuration/load-balancing issues.
Within a same theatre, actors execute according to a cooperative concurrency model, ensured by message interleaving, that is dispatching and executing one message at a time. Actors are without an internal thread. Therefore, they are at rest until a message arrives. A message is the unit of scheduling and dispatching. Message processing is atomic and cannot be pre-empted nor suspended. In other terms, messages of any actor, naturally execute in mutual exclusion.
Actor executions (i.e., message processing) into distinct theatres can effectively occur in time-overlapping, that is truly in parallel. Since the lock-free schema adopted by Theatre, shared data among actors assigned to distinct theatres/cores, should be avoided to prevent data inconsistencies. Sharing data, though, among the actors of a same theatre, is absolutely safe due to the adopted cooperative concurrency schema.
A time server component, attached to a selected theatre, gets transparently contacted (through hidden control messages [13]) by the various control layers of the computing nodes, so as to regularly update the global time notion of a Theatre system. In a pure-concurrent system, a “time server” can be exploited to detect the termination condition of the parallel application which occurs when all the control-layers have no pending messages to process and there are no in-transit messages across theatres.

3.2.2. Mapping Parallel K-Means on Theatre

The simplified UML diagram of Figure 5 shows some developed Java classes for supporting Parallel (but also serial) K-Means using Theatre.
Figure 5. Supporting classes for Parallel K-Means based on Theatre.
As indicated in Figure 5, and also shown by the message exchanges abstracted in Figure 6, the Theatre based K-Means adopts a master/worker organization (see later in this section for further details).
Figure 6. Master/worker organization of Theatre-based K-means.
 
Figure 7. Configuration program for parallel K-Means based on Theatre.
Figure 8. The Worker actor class.

This entry is adapted from the peer-reviewed paper 10.3390/a15040117

This entry is offline, you can click here to edit this entry!
Video Production Service