Big Data Stream Learning
Big Data stream learning is more challenging than batch or offline learning, since the data may not keep the same distribution over the lifetime of the stream. Moreover, each example coming in a stream can only be processed once, or they need to be summarized with a small memory footprint, and the learning algorithms must be very efficient.
Spark Streaming is an extension of the core Spark API that enables stream processing from a variety of sources. Spark is a extensible and programmable framework for massive distributed processing of datasets, called Resilient Distributed Datasets (RDD). Spark Streaming receives input data streams and divides the data into batches, which are then processed by the Spark engine to generate the results.
Spark Streaming data is organized into a sequence of DStreams, represented internally as a sequence of RDDs.
In this first release of StreamDM, we have implemented:
We have implemented following data generators:
We have also implemented SampleDataWriter, which can call data generators to create sample data for simulation or test.
In the next releases we plan to add:
- Classification: Random Forests
- Regression: Hoeffding Regression Tree, Bagging, Random Forests
- Clustering: Clustree, DenStream
- Frequent Itemset Miner: IncMine, IncSecMine
For a quick introduction to running StreamDM, refer to the Getting Started document. The StreamDM Programming Guide presents a detailed view of StreamDM. The full API documentation can be consulted here.