Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu. A Framework for Clustering Evolving Data Streams. VLDB 2003: 81-92
The CluStream method is a method of clustering data streams, based on the concept of microclusters. Microclusters are data structures which summarize a set of instances from the stream, and is composed of a set of statistics which are easily updated and allow fast analysis.
CluStream has two phases. In the online phase, a set of microclusters are kept in main memory; each instance coming from the input stream can then be either appended to an existing microcluster or created as a new microcluster. Space for the new microcluster is created either by deleting a microcluster (by analyzing its expiration timestamp) or by merging the two closest microclusters. The offline phase will apply a weighted k-means algorithm on the microclusters, to obtain the final clusters from the stream.
In StreamDM, two classes deal with the implementation of the CluStream
MicroClusters is the main data structure keeping the online microclusters
updated. It is controlled by the following options:
- Horizon parameter (-h), which controls the size of horizon window in seconds (1000 by default), which controls which microclusters become expired;
- Cluster radius multiplier parameter (-r), which controls how large the radius of a microcluster is to allow new instances to be appended (2 by default); and
- m-value parameter (-m) which controls the number of standard deviations from the mean timestamp in a microcluster, for expiry analysis (100 by default).
Clustream implements the main algorithm, and is controlled by the following
- Initial buffer parameter (-b), which controls the initial buffer from which microclusters are created (1000 by default);
- Number of microclusters parameter (-m), which controls how many microclusters are kept (100 by default);
- Number of clusters parameter (-k), which controls how many clusters are output by the offline k-means algorithm (10 by default); and
- k-means iterations parameter (-i), which controls for how many iterations the k-means algorithm iterates (1000 by default).