This python script provides a easy and parameterizeable way of defining typical dvc (sub-)stages for:
- data prepossessing
- data transformation
- data splitting
- data validation
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
pandas>=0.20.*
dvc>=2.12.*
pyyaml>=5
This package is available on PyPI. You install it and all of its dependencies using pip:
pip install dvc-stage
DVC-Stage works ontop of two files: dvc.yaml
and params.yaml
. They
are expected to be at the root of an initialized dvc
project. From there you can execute dvc-stage -h
to see available
commands or dvc-stage get-config STAGE
to generate the dvc stages from
the params.yaml
file. The tool then generates the respective yaml
which you can then manually paste into the dvc.yaml
file. Existing
stages can then be updated inplace using dvc-stage update-stage STAGE
.
Stages are defined inside params.yaml
in the following schema:
STAGE_NAME:
load: {}
transformations: []
validations: []
write: {}
The load
and write
sections both require the yaml-keys path
and
format
to read and save data respectively.
The transformations
and validations
sections require a sequence of
functions to apply, where transformations
return data and
validations
return a truth value (derived from data). Functions are
defined by the key id
an can be either:
-
Methods defined on Pandas Dataframes, e.g.
transformations: - id: transpose
-
Imported from any python module, e.g.
transformations: - id: custom description: duplikate rows import_from: demo.duplicate
-
Predefined by DVC-Stage, e.g.
validations: - id: validate_pandera_schema schema: import_from: demo.get_schema
When writing a custom function, you need to make sure the function
gracefully handles data being None
, which is required for type
inference. Data is passed as first argument. Further arguments can be
provided as additional keys, as shown above for
validate_pandera_schema
, where schema is passed as second argument to
the function.
A working demonstration can be found at examples/
.
Any Contributions are greatly appreciated! If you have a question, an issue or would like to contribute, please read our contributing guidelines.
Distributed under the GNU General Public License v3
Marcel Arpogaus - [email protected] (encrypted with ROT13)
Project Link: https://github.com/MArpogaus/dvc-stage
Parts of this work have been funded by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety due to a decision of the German Federal Parliament (AI4Grids: 67KI2012A).