This is a prototype library currently under heavy development. It does not currently have stable releases, and as such will likely be modified significantly in backwards compatibility breaking ways until alpha release (targeting early 2022). If you have suggestions on the API or use cases you would like to be covered, please open a GitHub issue. We would love to hear thoughts and feedback.
TorchArrow is a torch.Tensor-like Python DataFrame library for data preprocessing in deep learning. It supports multiple execution runtimes and Arrow as a common format.
It plans to provide:
- Python Dataframe library focusing on streaming-friendly APIs for data preprocessing in deep learning
- Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
- Zero copy for external readers via Arrow in-memory columnar format
- Multiple execution runtimes support:
- High-performance C++ UDF support with vectorization
You will need Python 3.7 or later. Also, we highly recommend installing an Miniconda environment.
First, set up an environment. If you are using conda, create a conda environment:
conda create --name torcharrow python=3.7
conda activate torcharrow
Follow the instructions in this Colab notebook
Experimental nightly binary on MacOS (requires macOS SDK >= 10.15) and Linux (requires glibc >= 2.17) for Python 3.7, 3.8, and 3.9 can be installed via pip wheels:
pip install --pre torcharrow -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
If you are installing from source, you will need Python 3.7 or later and a C++17 compiler.
git clone --recursive https://github.com/facebookresearch/torcharrow
cd torcharrow
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive
On MacOS
HomeBrew is required to install development tools on MacOS.
# Install dependencies from Brew
brew install --formula ninja cmake ccache icu4c boost gflags glog libevent
# Build and install other dependencies
scripts/build_mac_dep.sh ranges_v3 fmt double_conversion folly re2
On Ubuntu (20.04 or later)
# Install dependencies from APT
apt install -y g++ cmake ccache ninja-build checkinstall \
libssl-dev libboost-all-dev libdouble-conversion-dev libgoogle-glog-dev \
libgflags-dev libevent-dev libre2-dev
# Build and install folly and fmt
scripts/setup-ubuntu.sh
For local development, you can build with debug mode:
DEBUG=1 python setup.py develop
And run unit tests with
python -m unittest -v
To build and install TorchArrow with release mode:
python setup.py install
This 10 minutes tutorial provides a short introduction to TorchArrow, and you can also try it in this Colab. More documents on advanced topics are coming soon!
You can find the example about integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing here. More examples are coming soon!
We hope to sufficiently expand the library, harden APIs, and gather feedback to enable a alpha release (early 2022).
TorchArrow is BSD licensed, as found in the LICENSE file.