Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
doc		doc
vinum		vinum
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

Repository files navigation

Vinum

Vinum is a SQL processor written in pure Python, designed for data analysis workflows and in-memory analytics. Conceptually, Vinum's design goal is to provide deeper integration of Python data analysis tools such as Numpy, Pandas or in general any Python code with the SQL language. Key features include native support of vectorized Numpy and Python functions as UDFs in SQL queries.

Key Features:

Natively supports vectorized Numpy and Python functions inside of SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses. All the numpy functions are available by default via the 'np.*' namespace.
Written in pure Python and built from ground up on top of Apache Arrow and Numpy.
Apache Arrow provides the foundation for "moving" data and enables minimal overhead for transferring data to and from Numpy and Pandas.
Designed for in-memory analytics workflows, based on columnar memory layout.

Design

Vinum's query planner compiles SQL SELECT statement into a DAG of vectorized Arrow and Numpy operators and therefore integration with Numpy, Arrow or native Python functions comes naturally. In Vinum, all Numpy functions are first class citizens and can be used inside of SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses.

Below is an example of a possible simplified query plan.

Install

pip install vinum

Examples

Query python dict

Create a Table from a python dict and return result of the query as a Pandas DataFrame.

>>> import vinum as vn
>>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'")
   value    np.log
0  300.1  5.704116
1  880.0  6.779922

Query pandas dataframe

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC')
   col1  col2
0     3    17
1     2    13

Query csv

>>> import vinum as vn
>>> tbl = vn.read_csv('test.csv')
>>> res_tbl = tbl.sql('SELECT * FROM t WHERE fare > 5 LIMIT 3')
>>> res_tbl.to_pandas()
   id                            ts        lat        lng  fare
0   1   2010-01-05 16:52:16.0000002  40.711303 -74.016048  16.9
1   2  2011-08-18 00:35:00.00000049  40.761270 -73.982738   5.7
2   3   2012-04-21 04:30:42.0000001  40.733143 -73.987130   7.7

Compute Euclidean distance with numpy functions

Use any numpy functions via the 'np.*' namespace.

>>> import vinum as vn
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with vectorized UDF

Register UDF performing vectorized operations on Numpy arrays.

>>> import vinum as vn
>>> vn.register_numpy('distance',
...                   lambda x, y: np.sqrt(np.square(x) + np.square(y)))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, distance(x, y) AS dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with python UDF

Register Python lambda function as UDF.

>>> import math
>>> import vinum as vn
>>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t')
   x   y       dist
0  1   7   7.071068
1  2  13  13.152946
2  3  17  17.262677

Group by z-score

>>> import numpy as np
>>> import vinum as vn
>>> def z_score(x: np.ndarray):
...     """Compute Standard Score"""
...     mean = np.mean(x)
...     std = np.std(x)
...     return (x - mean) / std
...
>>> vn.register_numpy('score', z_score)
>>> tbl = vn.read_csv('test.csv')
>>> tbl.sql_pd('select int(score(fare)) AS bucket, avg(fare), count(*) '
...            'FROM t GROUP BY bucket ORDER BY bucket')
   bucket        avg  count
0       0   8.111630     92
1       1  19.380000      3
2       2  27.433333      3
3       3  34.670000      1
4       6  58.000000      1

Documentation

What Vinum is not

Vinum is not a Database Management System, there are no plans to support INSERT or UPDATE statements, as well as MVCC.

Dependencies

Future plans

Performance improvements.
Support sub-queries and JOINs.
Parallel execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vinum

Key Features:

Design

Install

Examples

Query python dict

Query pandas dataframe

Query csv

Compute Euclidean distance with numpy functions

Compute Euclidean distance with vectorized UDF

Compute Euclidean distance with python UDF

Group by z-score

Documentation

What Vinum is not

Dependencies

Future plans

About

Releases 7

Languages

License

dmitrykoval/vinum

Folders and files

Latest commit

History

Repository files navigation

Vinum

Key Features:

Design

Install

Examples

Query python dict

Query pandas dataframe

Query csv

Compute Euclidean distance with numpy functions

Compute Euclidean distance with vectorized UDF

Compute Euclidean distance with python UDF

Group by z-score

Documentation

What Vinum is not

Dependencies

Future plans

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 7

Languages