samp
is a simple command-line program that randomly samples lines from standard input. This can
be used to trim down on large newline-delimited files for testing or other purposes.
To get started, install samp
from source:
# Directly from Github
cargo install --git https://github.com/jerluc/samp.git
# Or from local source
git clone https://github.com/jerluc/samp.git && cd samp/ && cargo install --path .
To use samp
:
Usage: samp [-r <ratio>] [-s <seed>]
Sample stdin
Options:
-r, --ratio sample ratio
-s, --seed seed string
--help display usage information
For example, here's how you can randomly sample ~10% of your computer's dictionary file:
cat /usr/share/dict/words | samp -r 0.1
And here's how you can randomly sample ~5% of "War and Peace" using a reproducible text seed:
# Save sample to file
curl -s https://www.gutenberg.org/cache/epub/2600/pg2600.txt | samp -r 0.05 -s tolstoy > wp.txt
# Save second sample to another file
curl -s https://www.gutenberg.org/cache/epub/2600/pg2600.txt | samp -r 0.05 -s tolstoy > wp2.txt
diff wp.txt wp2.txt
# No differences!
I basically had two motivations in creating this software:
- I often find myself working with very large, newline-delimited data; I use
samp
to randomly down-sample this data for running various tests - I wanted an excuse to practice some more Rust :)
When contributing to this repository, please follow the steps below:
- Fork the repository
- Submit your patch in one commit, or a series of well-defined commits
- Submit your pull request and make sure you reference the issue you are addressing
See LICENSE