Skip to content

samay-sharma/cstore_fdw

 
 

Repository files navigation

cstore_fdw

Build Status Coverage

This extension implements a columnar store for PostgreSQL. Columnar stores provide notable benefits for analytic use-cases where data is loaded in batches.

The extension uses the Optimized Row Columnar (ORC) format for its data layout. ORC improves upon the RCFile format developed at Facebook, and brings the following benefits:

  • Compression: Reduces in-memory and on-disk data size by 2-4x. Can be extended to support different codecs.
  • Column projections: Only reads column data relevant to the query. Improves performance for I/O bound queries.
  • Skip indexes: Stores min/max statistics for row groups, and uses them to skip over unrelated rows.

Further, we used the Postgres foreign data wrapper APIs and type representations with this extension. This brings:

  • Support for 40+ Postgres data types. The user can also create new types and use them.
  • Statistics collection. PostgreSQL's query optimizer uses these stats to evaluate different query plans and pick the best one.
  • Simple setup. Create foreign table and copy data. Run SQL.

Building

cstore_fdw depends on protobuf-c for serializing and deserializing table metadata. So we need to install these packages first:

# Fedora 17+, CentOS, and Amazon Linux
sudo yum install protobuf-c-devel

# Ubuntu 10.4+
sudo apt-get install protobuf-c-compiler
sudo apt-get install libprotobuf-c0-dev

# Mac OS X
brew install protobuf-c

Note. In CentOS 5 and 6, you may need to install or update EPEL 5 or EPEL 6 repositories. See [this page] (http://www.rackspace.com/knowledge_center/article/installing-rhel-epel-repo-on-centos-5x-or-6x) for instructions.

Note. In Amazon Linux, EPEL 6 repository is installed by default, but it is not enabled. See these instructions for how to enable it.

Once you have protobuf-c installed on your machine, you are ready to build cstore_fdw. For this, you need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:

PATH=/usr/local/pgsql/bin/:$PATH make
sudo PATH=/usr/local/pgsql/bin/:$PATH make install

Note. cstore_fdw requires PostgreSQL 9.3 or 9.4. It doesn't support earlier versions of PostgreSQL.

Usage

Before using cstore_fdw, you need to add it to shared_preload_libraries in your postgresql.conf and restart Postgres:

shared_preload_libraries = 'cstore_fdw'    # (change requires restart)

The following parameters can be set on a cstore foreign table object.

  • filename: The absolute path to the location for storing table data. Before creating your columnar tables, you may want to choose and create a directory to keep your cstore files.
  • compression: The compression used for compressing value streams. Valid options are none and pglz. The default is none.
  • stripe_row_count: Number of rows per stripe. The default is 150000. Reducing this decreases the amount memory used for loading data and querying, but also decreases the performance.
  • block_row_count: Number of rows per column block. The default is 10000. cstore_fdw compresses, creates skip indexes, and reads from disk at the block granularity. Increasing this value helps with compression and results in fewer reads from disk. However, higher values also reduce the probability of skipping over unrelated row blocks.

You can use PostgreSQL's COPY command to load or append data into the table. You can use PostgreSQL's ANALYZE table_name command to collect statistics about the table. These statistics help the query planner to help determine the most efficient execution plan for each query.

Note. We currently don't support updating table using INSERT, DELETE, and UPDATE commands.

Example

As an example, we demonstrate loading and querying data to/from a column store table from scratch here. Let's start with downloading and decompressing the data files.

wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

gzip -d customer_reviews_1998.csv.gz
gzip -d customer_reviews_1999.csv.gz

Then, let's log into Postgres, and run the following commands to create a column store foreign table:

-- load extension first time after install
CREATE EXTENSION cstore_fdw;

-- create server object
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

-- create foreign table
CREATE FOREIGN TABLE customer_reviews
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)
SERVER cstore_server
OPTIONS(filename '/usr/local/pgsql/cstore/customer_reviews.cstore',
        compression 'pglz');

Note. Make sure that you have created and set the permissions for the directory where you are storing the cstore files (In this case, it is /usr/local/pgsql/cstore/).

Next, we load data into the table:

COPY customer_reviews FROM '/home/user/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews FROM '/home/user/customer_reviews_1999.csv' WITH CSV;

Note. If you are getting ERROR: cannot copy to foreign table "customer_reviews" when trying to run the COPY commands, double check that you have added cstore_fdw to shared_preload_libraries in postgresql.conf and restarted Postgres.

Next, we collect data distribution statistics about the table. This is optional, but usually very helpful:

ANALYZE customer_reviews;

Finally, let's run some example SQL queries on the column store table.

-- Find all reviews a particular customer made on the Dune series in 1998.
SELECT
    customer_id, review_date, review_rating, product_id, product_title
FROM
    customer_reviews
WHERE
    customer_id ='A27T7HVDXA3K2A' AND
    product_title LIKE '%Dune%' AND
    review_date >= '1998-01-01' AND
    review_date <= '1998-12-31';

-- Do we have a correlation between a book's title's length and its review ratings?
SELECT
    width_bucket(length(product_title), 1, 50, 5) title_length_bucket,
    round(avg(review_rating), 2) AS review_average,
    count(*)
FROM
   customer_reviews
WHERE
    product_group = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;

Usage with CitusDB

The example above illustrated how to load data into a PostgreSQL database running on a single host. However, sometimes your data is too large to analyze effectively on a single host. CitusDB is a product built by Citus Data that allows you to run a distributed PostgreSQL database to analyze your data using the power of multiple hosts. CitusDB is based on a modern PostgreSQL version and allows you to easily install PostgreSQL extensions and foreign data wrappers, including cstore_fdw. For an example of how to use cstore_fdw with CitusDB see the CitusDB documentation.

Copyright

Copyright (c) 2014 Citus Data, Inc.

This module is free software; you can redistribute it and/or modify it under the Apache v2.0 License.

For all types of questions and comments about the wrapper, please contact us at engage @ citusdata.com.

About

Fast columnar store for analytics with PostgreSQL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 99.0%
  • C++ 1.0%