This is the replication package for the paper titled "Improving State-of-the-art Compression Techniques for Log Management Tools".
LogBlock is a log preprocessing tool that improves the compression of small blocks of log data. Modern log management tools usually splits log data into small blocks to improve the performance of information query. As shown in the following table, different sizes are adopted by different log management tools LogBlock has better compression ratio than direct compression, or traditional log preprocessing tools which have good compression ratio on large-sized log files.
We include sample logs in this repo for evaluation purposes. To access the full dataset, please contact Loghub.
The repository contains framework for evaluating different log preprocessing approaches. We take the following approaches into consideration.
- LogBlock - Reduce repetitiveness through preprocessing heuritstics.
- LogZip - Extract reptitve template & variables through iterative clustering. Please check the full paper for more details: Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression.
- Cowic - Compress log entries with pretrain a compression models. Please check the full paper for more details: Cowic: A Column-Wise Independent Compression for Log Stream Analysis.
- LogArchive (not taken into comparison) - Cluster log messages according to text similarity then compress. For more details, please check: Adaptive log compression for massive log data.
We do not include the source code of Logzip, Cowic and LogArchive due to copyright reasons.
To evaluate these approaches from our framework, such tools should be cloned and compiled.
Start from here to evaluate the compression performance of each approach on small logs.
During the execution, the random truncated log blocks will be saved under 'temp' folder; the preprocess data will be recoreded under 'data' folder; and the compression performance result will be saved under 'result' folder.
All of these folders will be created at runtime.