SADIT is the acronym of Systematic Anomaly Detection of Internet Traffic. The motivation of SADIT is to make the comparison and the validation of the Internet anomaly detection algorithms super easy. It addresses this problem from the following two perspectives:
- Facilitating the data generation
- Providing a standard library of anomaly detection algorithms
If you are a researcher interested in Internet Anomaly Detection, we strongly encourage you to implement your algorithms following the APIs and data format of SADIT so that you can easily compare your methods with the existing algorithms in SADIT. Your help will be highly appreciated if you can contribute your own algorithm(s) to the algorithms library of SADIT. Feel free to contact us if you have any questions.
If you are interested in our recent publications (see below) on network anomaly detection and want to use them as references, please cite the repository SADIT/GAD together with:
- Wrap the Detector Module as Sub-Package GAD (General Anomaly Detector): If you are merely interested in detection rathan than both simulated network traffic data generation and anomaly detection, then you may only need to install GAD.
- More Flexible Configuration Script: You can set
parameters in a separate configuration script and specify it with
option. - Generation of Traffic for Dynamic Network: The distribution of flow traffic and the arrival rate can change with time.
- Robust Anomaly Detection Method: A new anomaly detection algorithm that can work robustly in dynamic network environment has been added.
- Faster Data Access Speed: Using
to store data instead of list of list, which accelerates the data processing significantly. - Separable Check Data and Reference Data Files
- Better Structure of Classes
Version 1.0 is a result of big refactor of Version 0.0. The refactor makes the code more scalable and less buggy.
Object-Oriented Programming Paradigm: The Configure module and Detector module have been rewritten under object-oriented paradigm. In Version 0.0, all modules depend on the global settings file
, rendering the code more vulunerable to bugs. In this version only a few scripts depend
. Classes are widely used to reduce the need to pass parameters around. In case that parameters passing is required, well-defined structures are used. -
More Flexible Experiments: A new folder
appears to contain different experiments. You can write your own experiment scripts and put them in this folder. -
Better Sensitivity Analysis: In Version 0.0, sensitivity analysis is done by changing the global
file and rerunning the simulation.
is a typical python module, changing it during the run is really not a good idea. In this version, a special experiment is designed to support sensitivity analysis.
SADIT consists of two parts. The first part is a collection of anomaly detection algorithms. The second part is a labeled flow record generator. The following sections will describe them accordingly.
All the detection algorithms locate in the ROOT/Detector/gad/Detector
- contains two SVM-based anomaly detection algorithmes:
- SVM Temporal Detector
- SVM Flow-by-Flow Detector
- contains two anomaly detection algorithms (model-free and model-based) based on the Large Deviation Theory.
- contains an algorithm that works robustly under dynamic network environment.
Labeled Flow Records Generator consists of a Configurer and a Simulator. The Simulator part is essentially a revised fs-simulator, developed by researchers at UW Madison. The Configurer first generates a flow specification (DOT format) file with certain types of anomalies, and then the Simulator will generate flow records with labels.
The Configurer generates the corresponding DOT file according to descriptions of user behaviour. The important concepts in Configurer are as follows:
Generator: Descriptions of a certain type of flow traffic. For example, Harpoon generator represents harpoon flows.
Behaviour: Descriptions of temporal pattern. There are three typical types of behaviours:
- Normal behaviour is described by start time and duration.
- I.I.D. behaviour has a list of possible states, but one state will be selected as the current state every t seconds according to a certain probability distribution.
- Markovian behaviour assumes that states at different times are not independent and identically distributed, but form a Markov process.
Modulator: Combination of Behaviour and Generator; i.e., descriptions of generator behaviours. There are three types of modulators, corresponding to the three behaviours described above.
Node: Host in the network, with
attributes. -
Edge: Channel connecting two network nodes, with
attributes. -
Network: A collection of network nodes and edges.
Anomaly: Descriptions of anomalies. When an anomaly is injected into the network, some attributes in the network (Node, Edge) will be changed.
Simulator is basically a revised version of fs-simulator. We have added support to exporting anomalous flows (with label information).
SADIT can be installed on Linux, Mac OS X and Windows (through cygwin) with python 2.7. However, we strongly recommend the debian-based OS, e.g., Ubuntu 12.04, 14.04, or 16.04, for which we have prepared a one-command installation script. We recommend using Anaconda2 as the Python environment; conda has a good ability to manage external packages.
To be specific, if you are working on Ubuntu, then proceed as follows:
- Change the working directory to where you want to install SADIT, create a new folder
, and then type:
$ git clone --recursive sadit/
- Change the working directory to be
, and then type:
sadit/install$ sudo sh
For general installation instructions, see the Installation section of the older version of README.
NOTE: We have encountered cases where a user still can not run the sim
experiments after running Step 2 above; possible causes could be: one or more packages that fs simulator depends on are still missing. In such cases, you might need to manually add them to the Simulator
folder. As a straightforward alternative, you could replace your Simulator
folder using the Simulator
folder contained in this release.
First, you need to specify the environment variable SADIT_ROOT
in Bash. To do this, again, assuming you are working on Ubuntu, then, change the working directory to be your home folder, open the file .profile
therein and add the following content:
export SADIT_ROOT='<path_of_your_sadit_installation>'
For instance, my path of sadit
installation is /home/jzh/Dropbox/Git/sadit
, so, at the end of the file .profile
, I added
export SADIT_ROOT='/home/jzh/Dropbox/Git/sadit'
Alternatively, you may also specify the above content (export SADIT_ROOT=<path_of_your_sadit_installation>
) each time in the terminal when asked to.
Now, you are ready to use SADIT. Assume sadit
is your working directory from now on.
To get general help message, just type
sadit$ ./cmdsadit -h
Then, you will see the following:
usage: cmdsadit [--profile PROFILE] [-h] [experiment]
positional arguments:
experiment type ./cmdsadit <exper> -h for help of an experiment;
available experiments are [botnet_detection_eval | detect
| detectbatch | detectcompare | detectrealtime | eval |
guitoposim | sim | timebased_botnet_detection_eval]
optional arguments:
--profile PROFILE profile the program
-h, --help print help message and exit
specifies the experiment you want to execute. An
experiment is actually a subcommand that has some certain functionality.
Some of the available experiments are explained as follows:
detect: Detect the flow records data specified by
option. -
detectbatch: Run detection algorithms with all combinations of parameters and output the results to a folder; helps to select the optimal parameters.
detectcompare: Run several detection algorithms and save the intermediate results; can also load results computed before and show comparison figure.
eval: Evaluation of the detection algorithm (calculate fpr, fnr, and plot the ROC curve).
guitoposim: Simulate using network topology created by GUI topology editor.
sim: Simulate and generate flow records.
To see the help message of a method for an experiment, just type:
/sadit$ ./cmdsadit <exper> -m <method> -h
For instance, if you type:
sadit$ ./cmdsadit detect -m mfmb -h
then you will see the following help message:
usage: cmdsadit [-h] [-c CONFIG] [--logging LOGGING] [-d DATA] [-m METHOD]
[--help_method HELP_METHOD] [--data_type DATA_TYPE]
[--feature_option FEATURE_OPTION]
[--export_flows EXPORT_FLOWS] [--pic_name PIC_NAME]
[--pic_show] [--csv CSV]
optional arguments:
-h, --help print help message
-c CONFIG, --config CONFIG
--logging LOGGING logging level. See
-d DATA, --data DATA --data [filename] will simply detect the flow file,
simulator will not run in this case
-m METHOD, --method METHOD
--method [method] will specify the method to use.
Avaliable options are: ['gen_fb_mb': FBAnoDetector
model free and model based together, will be faster
then run model free | 'robust': RobustDetector Robust
Detector is designed for dynamic network environment |
'2w': TwoWindowAnoDetector Two Window Stochastic
Anomaly Detector. | 'speriod': PeriodStaticDetector
Reference Empirical Measure is calculated by
periodically selection. | 'mb': ModelBaseAnoDetector
Model based approach, use Markovian Assumption |
'gen_fb_mf': FBAnoDetector model free and model based
together, will be faster then run model free |
'two_win': TwoWindowAnoDetector Two Window Stochastic
Anomaly Detector. | 'mf': ModelFreeAnoDetector Model
Free approach, use I.I.D Assumption | 'mfmb':
FBAnoDetector model free and model based together,
will be faster then run model free | 'period':
PeriodStoDetector Stochastic Detector Designed to
Detect Anomaly when the]. If you want to compare the
results of several methods, simple use / as seperator,
for example [gen_fb_mb/robust/2w/speriod/mb/gen_fb_mf/
--help_method HELP_METHOD
print the detailed help message for a method.
Avaliable method [gen_fb_mb | robust | 2w | speriod |
mb | gen_fb_mf | two_win | mf | mfmb | period]
--data_type DATA_TYPE
--specify the type of the data you use, the availiable
option are: ['fs': MEM_FS Data generated by `fs-
simulator | 'xflow': MEM_Xflow Data generated by xflow
tool. | 'pt': PT_Data Pytables format. (HDF5 format).
| 'pcap2netflow': MEM_Pcap2netflow Data generated
pcap2netflow, (the | 'Sperotto': SperottoIPOM Data
File wrapper for SperottoIPOM2009 format. | 'csv':
CSVFile | 'flow_exporter': MEM_FlowExporter Data
generated FlowExporter. It is a simple tool to convert
pcap to]
--feature_option FEATURE_OPTION
specify the feature option. feature option is a
dictionary describing the quantization level for each
feature. You need at least specify 'cluster' and
'dist_to_center'. Note that, the value of 'cluster' is
the cluster number. The avaliability of other features
depend on the data handler.
--export_flows EXPORT_FLOWS
specify the file name of exported abnormal flows.
Default is not export
--pic_name PIC_NAME picture name for the detection result
--pic_show whether to show the picture after finishing running
--csv CSV the path of the file to save plots a text output
usage: cmdsadit [-h] [--interval INTERVAL] [--win_size WIN_SIZE]
[--win_type WIN_TYPE] [--max_detect_num MAX_DETECT_NUM]
[--normal_rg NORMAL_RG] [--hoeff_far HOEFF_FAR]
[--entropy_th ENTROPY_TH] [--enable_sanov] [--lw LW]
optional arguments:
-h, --help show this help message and exit
--interval INTERVAL interval between two consequent detection
--win_size WIN_SIZE window_size
--win_type WIN_TYPE window type 'flow'|'time'
--max_detect_num MAX_DETECT_NUM
max detection number, useful for debug
--normal_rg NORMAL_RG
normal range, when it is none, use the whole data as
the norminal data set
--hoeff_far HOEFF_FAR
false alarm rate for hoeffding rule, if this parameter
is set while entropy_th parameter is not set, will
calculate threshold according to hoeffding rule.
Increase hoeff_far will decrease threshold
--entropy_th ENTROPY_TH
entropy threshold to determine the anomaly, has higher
priority than hoeff_far
--enable_sanov whether or not to use Sanov's theorem to estimate the
--lw LW line width of the plot
Whenever you are not sure about the options you can set, just add -h
the end of the command and execute it, and then the help message will be
printed correspondingly.
- imalse/
Example Commands:
sadit$ ./cmdsadit sim -c ./Example/
sadit$ ./cmdsadit sim -c ./Example/
Example commands:
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --method='mfmb' --pic_show
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --method='mfmb' --pic_show
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --method='mfmb' --pic_show --lw 3
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --m mb --pic_show --lw 1
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --m mf --pic_show --hoeff_far 0.5
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --m mf --pic_show --hoeff_far 0.1
./cmdsadit detect -c ./Example/ -d ./test/n0_flow.txt --m mfmb --pic_show --hoeff_far 0.6
NOTE: Before running these commands, you may need to change the ROOT variable accordingly in the configuration files. You may also need to check the existence and the actual path of the data file to be imported.
Use the Labeled Flow Records Generated by fs-simulator
The generated flows will be saved in the ROOT/Simulator
folder. The flows end with _flow.txt
, for example, n0_flow.txt
is the network flows through node 0. File starting with abnormal_
is the exported abnormal flows correspondingly.
A typical line is
: text-export n0 1348412129.925416 1348412129.925416 1348412130.070733> tcp 0x0 n1 5 4215 FSA
Line format
: prefix node-name time flow_start_time flow_end_time src_ip:src_port->dst_ip:dst_port protocol payload destination-name flow-size(in packets) flow-size(in bytes) protocol-flags
After finishing your detection algorithm(s), the last thing you need to do
is add the corresponding class name to detector_map
in ROOT/Detector/gad/Detector/
. After that you will be able to use your
detection algorithm(s). You can use detectcompare
experiment to compare with
other algorithms or eval
algorithm to Evaluate your algorithm(s). You
can also implement new experiment to play with your new algorithm(s).
SADIT does not only support the text output format of fs-simulator, but
also several other types of flow data. The data wrapper classes are defined in sadit.Detector.gad.Detector.Data
module and the handler classes locate in the sadit.Detector.gad.Detector.DataHandler
In order to use data in a new format, you need to implement two new classes:
First, a data class that satisfies Data interface (
, Line 9). Namely, such a class has to at least provide the following three functions:
: row slicingget_where
: get range of rows that satisfies a criterion.get_min_max
: get min and max values of a certain feature at a certain range.
The package has included several data classes, which all locates in
. In some cases, you can re-use existing classes.
: base class for disk file data.MEM_FS
: disk file generated by fs-simulator.MEM_FlowExport
: disk file generated by FlowExport toolMySQLDatabase
: base class for data in disk file.
Second, a data handler class that implements data preprocessing, e.g., quantization.
: will quantize the input data.IPHanlder
: for logs with IP addresses. It will first cluster IPs and replace IPV4 with(cluster label, dist to cluster center)
Then you just need to add your data_handler
defined in ROOT/Detector/gad/Detector/
I have recorded several hand by hand video tutorials for SADIT 1.0. The usage of SADIT 1.1 is a little bit different, but I think these videos will still be useful. I will record new videos for the latest version of SADIT when I have time.
If you have no access to youtube, you can download the videos(all AVI format) in the following link(it is hosted in Google Drive Server).
Please see the LICENSE
