s3select

The s3select is another S3 request, that enables the client to push down an SQL statement(according to spec) into CEPH storage.
The s3select is an implementation of a push-down paradigm.
The push-down paradigm is about moving(“pushing”) the operation close to the data.
It's contrary to what is commonly done, i.e. moving the data to the “place” of operation.
In a big-data ecosystem, it makes a big difference.
In order to execute “select sum( x + y) from s3object where a + b > c”
It needs to fetch the entire object to the client side, and only then execute the operation with an analytic application,
With push-down(s3-select) the entire operation is executed on the server side, and only the result is returned to the client side.

Analyzing huge amount of cold/warm data without moving or converting

The s3-storage is reliable, efficient, cheap, and already contains a huge amount of objects, It contains many CSV, JSON, and Parquet objects, and these objects contain a huge amount of data to analyze.
An ETL may convert these objects into Parquet and then run queries on these converted objects.
But it comes with an expensive price, downloading all of these objects close to the analytic application.

The s3select-engine that resides on s3-storage can do these jobs for many use cases, saving time and resources.

The s3select engine stands by itself

The engine resides on a dedicated GitHub repo, and it is also capable to execute SQL statements on standard input or files residing on a local file system.
Users may clone and build this repo, and execute various SQL statements as CLI.

A docker image containing a development environment

An immediate way for a quick start is available using the following container. That container already contains the cloned repo, enabling code review and modification.

Running the s3select container image

sudo docker run -w /s3select -it galsl/ubunto_arrow_parquet_s3select:dev

Running google test suite, it contains hundreads of queries

./test/s3select_test

Running SQL statements using CLI on standard input

./example/s3select_example, is a small demo app, it lets you run queries on local file or standard input. for one example, the following runs the engine on standard input. seq 1 1000 | ./example/s3select_example -q 'select count(0) from stdin;'

SQL statement on ps command (standard input)

ps -ef | tr -s ' ' | CSV_COLUMN_DELIMETER=' ' CSV_HEADER_INFO= ./example/s3select_example -q 'select PID,CMD from stdin where PPID="1";'

SQL statement processed by the container, the input-data pipe into the container.

seq 1 1000000 | sudo docker run -w /s3select -i galsl/ubunto_arrow_parquet_s3select:dev bash -c "./example/s3select_example -q 'select count(0) from stdin;'"

Running SQL statements using CLI on local file

it possible to run a query on local file, as follows.

./example/s3select_example -q 'select count(0) from /full/path/file_name;'

SQL statement processed by the container, the input-data is mapped to container FS.

sudo docker run -w /s3select -v /home/gsalomon/work:/work -it galsl/ubunto_arrow_parquet_s3select:dev bash -c "./example/s3select_example -q 'select count(*) from /work/datatime.csv;'"

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
TPCDS		TPCDS
container/trino		container/trino
example		example
include		include
rapidjson @ fcb23c2		rapidjson @ fcb23c2
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
parquet_mix_types.parquet		parquet_mix_types.parquet
s3select-parse-s.png		s3select-parse-s.png
s3select.rst		s3select.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

s3select

Analyzing huge amount of cold/warm data without moving or converting

The s3select engine stands by itself

A docker image containing a development environment

Running the s3select container image

Running google test suite, it contains hundreads of queries

Running SQL statements using CLI on standard input

SQL statement on ps command (standard input)

SQL statement processed by the container, the input-data pipe into the container.

Running SQL statements using CLI on local file

SQL statement processed by the container, the input-data is mapped to container FS.

About

Releases

Packages

Languages

License

leonid-s-usov/ceph-s3select

Folders and files

Latest commit

History

Repository files navigation

s3select

Analyzing huge amount of cold/warm data without moving or converting

The s3select engine stands by itself

A docker image containing a development environment

Running the s3select container image

Running google test suite, it contains hundreads of queries

Running SQL statements using CLI on standard input

SQL statement on ps command (standard input)

SQL statement processed by the container, the input-data pipe into the container.

Running SQL statements using CLI on local file

SQL statement processed by the container, the input-data is mapped to container FS.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages