arrow-csv-benchmark

Short benchmark for arrow's read_csv

Why

I made this repo after experiencing low read speeds (0.5GiB/s) on real work csvs.

What this does

Generates a big csv with many string, float, and null columns, using joblib for parallelization,
Puts the csv into a BytesIO object,
Calls pyarrow.csv.read_csv a few times on the csv bytes.

The Dockerfile sets up a minimal container for running the benchmark.

My Results

Running this on Azure, machine size Standard E48s_v3 (48 vcpus, 384 GiB memory), on Linux (ubuntu 18.04), unused other than this benchmark, consistently shows speeds of less than 1GiB/s, and often below 0.5GiB/s.

Included in the repo are profiling dumps, made manually with py-spy. I started them 5 seconds after the beginning of each read_csv, and stopped them after about 15 seconds. This was always more than 5 seconds before the read_csv finished.

If the profiles are to be trusted, there is considerable time spent in the shared pointer's lock mechanisms. As for the reading of the bytes, I'm not sure what goes into this or why it takes time.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark-csv.py		benchmark-csv.py
profile1.svg		profile1.svg
profile2.svg		profile2.svg
profile3.svg		profile3.svg
profile4.svg		profile4.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arrow-csv-benchmark

Why

What this does

My Results

About

Releases

Packages

Languages

License

drorspei/arrow-csv-benchmark

Folders and files

Latest commit

History

Repository files navigation

arrow-csv-benchmark

Why

What this does

My Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages