Skip to content

Commit

Permalink
Move scan-query from a contrib extension into core. (apache#4751)
Browse files Browse the repository at this point in the history
* Move scan-query from a contrib extension into core.

Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion

This patch also adds support for virtual columns to the Scan query,
and updates Druid SQL to use Scan instead of Select.

This patch also makes some behavioral changes to handling of the __time
column. In particular, it is now is returned as "__time" rather than
"timestamp"; it is no longer included if you do not specifically ask for
it in your "columns"; and it is returned as a long rather than a string.

Users can revert time handling to the legacy extension behavior by
setting "legacy" : true in their queries, or setting the property
druid.query.scan.legacy = true. This is meant to provide a migration
path for users that were formerly using the contrib extension.

* Adjustments from review.

* Add back Select query.

* Adjust SQL docs.

* Restore SelectQuery link.
  • Loading branch information
gianm authored and jihoonson committed Sep 13, 2017
1 parent 587f180 commit 2ce8123
Show file tree
Hide file tree
Showing 29 changed files with 739 additions and 536 deletions.
2 changes: 0 additions & 2 deletions distribution/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,6 @@
<argument>-c</argument>
<argument>io.druid.extensions.contrib:druid-redis-cache</argument>
<argument>-c</argument>
<argument>io.druid.extensions.contrib:scan-query</argument>
<argument>-c</argument>
<argument>io.druid.extensions.contrib:sqlserver-metadata-storage</argument>
<argument>-c</argument>
<argument>io.druid.extensions.contrib:statsd-emitter</argument>
Expand Down
1 change: 0 additions & 1 deletion docs/content/development/extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ All of these community extensions can be downloaded using *pull-deps* with the c
|statsd-emitter|StatsD metrics emitter|[link](../development/extensions-contrib/statsd.html)|
|kafka-emitter|Kafka metrics emitter|[link](../development/extensions-contrib/kafka-emitter.html)|
|druid-thrift-extensions|Support thrift ingestion |[link](../development/extensions-contrib/thrift.html)|
|scan-query|Scan query|[link](../development/extensions-contrib/scan-query.html)|

## Promoting Community Extension to Core Extension

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,11 @@ There are several main parts to a scan query:
|columns|A String array of dimensions and metrics to scan. If left empty, all dimensions and metrics are returned.|no|
|batchSize|How many rows buffered before return to client. Default is `20480`|no|
|limit|How many rows to return. If not specified, all rows will be returned.|no|
|legacy|Return results consistent with the legacy "scan-query" contrib extension. Defaults to the value set by `druid.query.scan.legacy`, which in turn defaults to false. See [Legacy mode](#legacy-mode) for details.|no|
|context|An additional JSON Object which can be used to specify certain flags.|no|

## Example results

The format of the result when resultFormat equals to `list`:

```json
Expand Down Expand Up @@ -154,4 +157,19 @@ The format of the result when resultFormat equals to `compactedList`:
The biggest difference between select query and scan query is that, scan query doesn't retain all rows in memory before rows can be returned to client.
It will cause memory pressure if too many rows required by select query.
Scan query doesn't have this issue.
Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly.
Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly.

## Legacy mode

The Scan query supports a legacy mode designed for protocol compatibility with the former scan-query contrib extension.
In legacy mode you can expect the following behavior changes:

- The __time column is returned as "timestamp" rather than "__time". This will take precedence over any other column
you may have that is named "timestamp".
- The __time column is included in the list of columns even if you do not specifically ask for it.
- Timestamps are returned as ISO8601 time strings rather than integers (milliseconds since 1970-01-01 00:00:00 UTC).

Legacy mode can be triggered either by passing `"legacy" : true` in your query JSON, or by setting
`druid.query.scan.legacy = true` on your Druid nodes. If you were previously using the scan-query contrib extension,
the best way to migrate is to activate legacy mode during a rolling upgrade, then switch it off after the upgrade
is complete.
8 changes: 8 additions & 0 deletions docs/content/querying/select-query.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: doc_page
---
# Select Queries

Select queries return raw Druid rows and support pagination.

```json
Expand All @@ -19,6 +20,13 @@ Select queries return raw Druid rows and support pagination.
}
```

<div class="note info">
Consider using the [Scan query](scan-query.html) instead of the Select query if you don't need pagination, and you
don't need the strict time-ascending or time-descending ordering offered by the Select query. The Scan query returns
results without pagination, and offers "looser" ordering than Select, but is significantly more efficient in terms of
both processing time and memory requirements. It is also capable of returning a virtually unlimited number of results.
</div>

There are several main parts to a select query:

|property|description|required?|
Expand Down
4 changes: 3 additions & 1 deletion docs/content/querying/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,9 @@ converted to zeroes).

## Query execution

Queries without aggregations will use Druid's [Select](select-query.html) native query type.
Queries without aggregations will use Druid's [Scan](scan-query.html) or [Select](select-query.html) native query types.
Scan is used whenever possible, as it is generally higher performance and more efficient than Select. However, Select
is used in one case: when the query includes an `ORDER BY __time`, since Scan does not have a sorting feature.

Aggregation queries (using GROUP BY, DISTINCT, or any aggregation functions) will use one of Druid's three native
aggregation query types. Two (Timeseries and TopN) are specialized for specific types of aggregations, whereas the other
Expand Down
1 change: 1 addition & 0 deletions docs/content/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ layout: toc
* [DataSource Metadata](/docs/VERSION/querying/datasourcemetadataquery.html)
* [Search](/docs/VERSION/querying/searchquery.html)
* [Select](/docs/VERSION/querying/select-query.html)
* [Scan](/docs/VERSION/querying/scan-query.html)
* Components
* [Datasources](/docs/VERSION/querying/datasource.html)
* [Filters](/docs/VERSION/querying/filters.html)
Expand Down
63 changes: 0 additions & 63 deletions extensions-contrib/scan-query/pom.xml

This file was deleted.

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,6 @@
<module>extensions-contrib/virtual-columns</module>
<module>extensions-contrib/thrift-extensions</module>
<module>extensions-contrib/ambari-metrics-emitter</module>
<module>extensions-contrib/scan-query</module>
<module>extensions-contrib/sqlserver-metadata-storage</module>
<module>extensions-contrib/kafka-emitter</module>
<module>extensions-contrib/redis-cache</module>
Expand Down
3 changes: 3 additions & 0 deletions processing/src/main/java/io/druid/query/Query.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import io.druid.query.filter.DimFilter;
import io.druid.query.groupby.GroupByQuery;
import io.druid.query.metadata.metadata.SegmentMetadataQuery;
import io.druid.query.scan.ScanQuery;
import io.druid.query.search.search.SearchQuery;
import io.druid.query.select.SelectQuery;
import io.druid.query.spec.QuerySegmentSpec;
Expand All @@ -46,6 +47,7 @@
@JsonSubTypes.Type(name = Query.SEARCH, value = SearchQuery.class),
@JsonSubTypes.Type(name = Query.TIME_BOUNDARY, value = TimeBoundaryQuery.class),
@JsonSubTypes.Type(name = Query.GROUP_BY, value = GroupByQuery.class),
@JsonSubTypes.Type(name = Query.SCAN, value = ScanQuery.class),
@JsonSubTypes.Type(name = Query.SEGMENT_METADATA, value = SegmentMetadataQuery.class),
@JsonSubTypes.Type(name = Query.SELECT, value = SelectQuery.class),
@JsonSubTypes.Type(name = Query.TOPN, value = TopNQuery.class),
Expand All @@ -58,6 +60,7 @@ public interface Query<T>
String SEARCH = "search";
String TIME_BOUNDARY = "timeBoundary";
String GROUP_BY = "groupBy";
String SCAN = "scan";
String SEGMENT_METADATA = "segmentMetadata";
String SELECT = "select";
String TOPN = "topN";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import com.fasterxml.jackson.annotation.JsonSubTypes;
import com.fasterxml.jackson.annotation.JsonTypeInfo;
import io.druid.guice.annotations.ExtensionPoint;
import io.druid.java.util.common.Cacheable;
import io.druid.query.lookup.LookupExtractionFn;
import io.druid.query.lookup.RegisteredLookupExtractionFn;

Expand Down Expand Up @@ -57,16 +58,8 @@
* regular expression with a capture group. When the regular expression matches the value of a dimension,
* the value captured by the group is used for grouping operations instead of the dimension value.
*/
public interface ExtractionFn
public interface ExtractionFn extends Cacheable
{
/**
* Returns a byte[] unique to all concrete implementations of DimExtractionFn. This byte[] is used to
* generate a cache key for the specific query.
*
* @return a byte[] unit to all concrete implements of DimExtractionFn
*/
public byte[] getCacheKey();

/**
* The "extraction" function. This should map an Object into some String value.
* <p>
Expand Down
Loading

0 comments on commit 2ce8123

Please sign in to comment.