Data Prep Kit Release notes

Release 0.2.1 - 9/24/2024

General

Bug fixes across the repo
Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
Added new transforms and single package for transforms published to pypi
Improved CI/CD with targeted workflow triggered on specific changes to specific modules
New enhancements for cutting a release

data-prep-toolkit libraries (python, ray, spark)

Restructure the repository to distinguish/separate runtime libraries
Split data-processing-lib/ray into python and ray
Spark runtime
Updated pyarrow version
Define required transform() method as abstract to AbstractTableTransform
Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies

KFP Workloads

Add a configurable timeout before destroying the deployed Ray cluster.

Transforms

Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, pdf2parquet, HTML2Parquet and PII Transform
Added ededup python implementation and incremental ededup
Added fuzzy floating point comparison

Release 0.2.0 - 6/27/2024

General

Many bug fixes across the repo, plus the following specifics.
Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
Automation of release process branch/tag management
Documentation improvements

data-prep-toolkit libraries (python, ray, spark)

Split libraries into 3 runtime-specific implementations
Fix missing final count of processed and add percentages
Improved fault tolerance in python and ray runtimes
Report global DataAccess retry metric
Support for binary data transforms
Updated to Ray version to 2.24
Updated to PyArrow version 16.1.0

KFP Workloads

Add KFP V2 support
Create a distinct (timestamped) execution.log file for each retry
Support for multiple inputs/outputs

Transforms

Added language/lang_id - detects language in documents
Added universal/profiler - counts works/tokens in documents
Converted ingest2parquet tool to transform named code2parquet
Split transforms, as appropriate, into python, ray and/or spark.
Added spark implementations of filter, doc_id and noop transforms.
Switch from using requirements.txt to pyproject.toml file for each transform runtime
Repository restructured to move kfp workflow definitions to associated transform project directory

Release 0.1.1 - 5/24/2024

Release 0.1.0 - 5/15/2024

Release 0.1.0 - 5/08/2024