StreamKM++

StreamKM++

StreamKM++ Clustering

Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)

General Description

StreamKM++ computes a small weighted sample of the data stream,called the coreset of the data stream. A new data structure called coreset tree is developed in order to significantly speed up the time necessary for sampling non-uniformly during the coreset construction. After the coreset is extracted from the data stream, a weighted k-means algorithm is applied on the coreset to get the final clusters for the original stream data.

Implementation

In StreamDM, three classes deal with the implementation of the StreamKM++ algorithm: StreamKM, BucketManager, and TreeCoreset.

BucketManager and TreeCoreset are the data structures to deal with the coreset tree construction process. For the TreeCoreset class no parameters are required; its behaviour is controlled by the BucketMaanger. The options for BucketManager are:

  • Stream data window size parameter (-n), which controls the stream data size (100000 by default), this value is inherited from the parameters of class StreamKM;
  • Coreset size parameter (-maxsize) which controls the number of examples in the coreset for the stream data (10000 by default).

StreamKM implements the main algorithm, and is controlled by the following parameters:

  • Number of clusters parameter (-k), which controls how many clusters are output by the offline k-means algorithm (10 by default);
  • k-means iterations parameter (-i), which controls for how many iterations the k-means algorithm iterates (1000 by default);
  • Coreset size parameter(-s), which controls the number of examples in the coreset for the StreamKM++ algorithm;
  • Windows size parameter(-w), which controls how many examples in the stream data are used for coreset extraction for the StreamKM++ algorithm .