Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)
StreamKM++ computes a small weighted sample of the data stream,called the coreset of the data stream. A new data structure called coreset tree is developed in order to significantly speed up the time necessary for sampling non-uniformly during the coreset construction. After the coreset is extracted from the data stream, a weighted k-means algorithm is applied on the coreset to get the final clusters for the original stream data.
In StreamDM, three classes deal with the implementation of the StreamKM++
TreeCoreset are the data structures to deal with the
coreset tree construction process. For the
TreeCoreset class no parameters are
required; its behaviour is controlled by the
BucketMaanger. The options for
- Stream data window size parameter (-n), which controls the stream data size
(100000 by default), this value is inherited from the parameters of class
- Coreset size parameter (-maxsize) which controls the number of examples in the coreset for the stream data (10000 by default).
StreamKM implements the main algorithm, and is controlled by the following
- Number of clusters parameter (-k), which controls how many clusters are output by the offline k-means algorithm (10 by default);
- k-means iterations parameter (-i), which controls for how many iterations the k-means algorithm iterates (1000 by default);
- Coreset size parameter(-s), which controls the number of examples in the coreset for the StreamKM++ algorithm;
- Windows size parameter(-w), which controls how many examples in the stream data are used for coreset extraction for the StreamKM++ algorithm .