CluStream
CluStream Clustering
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu. A Framework for Clustering Evolving Data Streams. VLDB 2003: 81-92
General Description
The CluStream method is a method of clustering data streams, based on the concept of microclusters. Microclusters are data structures which summarize a set of instances from the stream, and is composed of a set of statistics which are easily updated and allow fast analysis.
CluStream has two phases. In the online phase, a set of microclusters are kept in main memory; each instance coming from the input stream can then be either appended to an existing microcluster or created as a new microcluster. Space for the new microcluster is created either by deleting a microcluster (by analyzing its expiration timestamp) or by merging the two closest microclusters. The offline phase will apply a weighted k-means algorithm on the microclusters, to obtain the final clusters from the stream.
Implementation
In StreamDM, two classes deal with the implementation of the CluStream
algorithm: MicroClusters
and Clustream
.
MicroClusters
is the main data structure keeping the online microclusters
updated. It is controlled by the following options:
- Horizon parameter (-h), which controls the size of horizon window in seconds (1000 by default), which controls which microclusters become expired;
- Cluster radius multiplier parameter (-r), which controls how large the radius of a microcluster is to allow new instances to be appended (2 by default); and
- m-value parameter (-m) which controls the number of standard deviations from the mean timestamp in a microcluster, for expiry analysis (100 by default).
Clustream
implements the main algorithm, and is controlled by the following
parameters:
- Initial buffer parameter (-b), which controls the initial buffer from which microclusters are created (1000 by default);
- Number of microclusters parameter (-m), which controls how many microclusters are kept (100 by default);
- Number of clusters parameter (-k), which controls how many clusters are output by the offline k-means algorithm (10 by default); and
- k-means iterations parameter (-i), which controls for how many iterations the k-means algorithm iterates (1000 by default).