StreamDM Quick Start Guide
The purpose of this document is to provide a quick entry point for users desiring to quickly run a StreamDM task. We describe how StreamDM can be compiled, and how a quick task can be run.
The basic requirement for running the example in this documents is to have Spark 1.4 installed. The example works best on a Linux/Unix machine.
Compiling The Code
In the main folder of StreamDM, the following command generates the package needed to run StreamDM:
sbt package
Running The Task
The task that is run in this example is the EvaluatePrequential. By default, the task connects to a socket open on the localhost at port 9999 which sends dense instances as a stream. Then a linear binary classifier is trained by using StochasticGradientDescent and the predictions are evaluted by outputting the confusion matrix.
The example can be run by executing in commend line:
- In the terminal, use the provided spark script to run the task (after
modifying the
SPARK_HOME
variable with the folder of your Spark installation):
cd scripts
./spark.sh
- It is possible to add command-line options to specify task, learner, and evaluation parameters:
cd scripts
./spark.sh "EvaluatePrequential -l (SGDLearner -l 0.01 -o LogisticLoss -r ZeroRegularizer) –s (FileReader –k 100 –d 60 –f ../data/mydata)"
- [Optional] It is advisable to separate the standard and the error output, for better readability:
cd scripts
./spark.sh 1>results.txt 2>debug.log
The standard output will contain a confusion matrix aggregating the prediction results for every Spark RDD in the stream.
Four data generators can generate sample data by SampleDataWriter:
- In the terminal, use the provided sample data generator script to generate sample data (after
modifying the
SPARK_HOME
variable with the folder of your Spark installation and setting the correct jar files):
cd scripts
./generateData.sh "FileWriter -n 1000 -f ../sampledata/mysampledata -g (HyperPlaneGenerator -k 100 -f 10)"