StreamDM Quick Start Guide

The purpose of this document is to provide a quick entry point for users desiring to quickly run a StreamDM task. We describe how StreamDM can be compiled, and how a quick task can be run.

The basic requirement for running the example in this documents is to have Spark 1.4 installed. The example works best on a Linux/Unix machine.

Compiling The Code

In the main folder of StreamDM, the following command generates the package needed to run StreamDM:

sbt package

Running The Task

The task that is run in this example is the EvaluatePrequential. By default, the task connects to a socket open on the localhost at port 9999 which sends dense instances as a stream. Then a linear binary classifier is trained by using StochasticGradientDescent and the predictions are evaluted by outputting the confusion matrix.

The example can be run by executing in commend line:

In the terminal, use the provided spark script to run the task (after modifying the SPARK_HOME variable with the folder of your Spark installation):

cd scripts
  ./spark.sh

It is possible to add command-line options to specify task, learner, and evaluation parameters:

cd scripts
  ./spark.sh "EvaluatePrequential -l (SGDLearner -l 0.01 -o LogisticLoss -r ZeroRegularizer) –s (FileReader –k 100 –d 60 –f ../data/mydata)"

[Optional] It is advisable to separate the standard and the error output, for better readability:

cd scripts
  ./spark.sh 1>results.txt 2>debug.log

The standard output will contain a confusion matrix aggregating the prediction results for every Spark RDD in the stream.

Four data generators can generate sample data by SampleDataWriter:

In the terminal, use the provided sample data generator script to generate sample data (after modifying the SPARK_HOME variable with the folder of your Spark installation and setting the correct jar files):

cd scripts
  ./generateData.sh "FileWriter -n 1000 -f ../sampledata/mysampledata -g (HyperPlaneGenerator -k 100 -f 10)"