Move scan-query from a contrib extension into core. (apache#4751)

* Move scan-query from a contrib extension into core. Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion This patch also adds support for virtual columns to the Scan query, and updates Druid SQL to use Scan instead of Select. This patch also makes some behavioral changes to handling of the __time column. In particular, it is now is returned as "__time" rather than "timestamp"; it is no longer included if you do not specifically ask for it in your "columns"; and it is returned as a long rather than a string. Users can revert time handling to the legacy extension behavior by setting "legacy" : true in their queries, or setting the property druid.query.scan.legacy = true. This is meant to provide a migration path for users that were formerly using the contrib extension. * Adjustments from review. * Add back Select query. * Adjust SQL docs. * Restore SelectQuery link.
vogali · Sep 13, 2017 · 2ce8123 · 2ce8123
1 parent 587f180
commit 2ce8123
Show file tree

Hide file tree

Showing 29 changed files with 739 additions and 536 deletions.
diff --git a/distribution/pom.xml b/distribution/pom.xml
@@ -232,8 +232,6 @@
                                         <argument>-c</argument>
                                         <argument>io.druid.extensions.contrib:druid-redis-cache</argument>
                                         <argument>-c</argument>
-                                        <argument>io.druid.extensions.contrib:scan-query</argument>
-                                        <argument>-c</argument>
                                         <argument>io.druid.extensions.contrib:sqlserver-metadata-storage</argument>
                                         <argument>-c</argument>
                                         <argument>io.druid.extensions.contrib:statsd-emitter</argument>

diff --git a/docs/content/development/extensions.md b/docs/content/development/extensions.md
@@ -70,7 +70,6 @@ All of these community extensions can be downloaded using *pull-deps* with the c
 |statsd-emitter|StatsD metrics emitter|[link](../development/extensions-contrib/statsd.html)|
 |kafka-emitter|Kafka metrics emitter|[link](../development/extensions-contrib/kafka-emitter.html)|
 |druid-thrift-extensions|Support thrift ingestion |[link](../development/extensions-contrib/thrift.html)|
-|scan-query|Scan query|[link](../development/extensions-contrib/scan-query.html)|
 
 ## Promoting Community Extension to Core Extension
 

diff --git a/...elopment/extensions-contrib/scan-query.md → docs/content/querying/scan-query.md b/...elopment/extensions-contrib/scan-query.md → docs/content/querying/scan-query.md
@@ -31,8 +31,11 @@ There are several main parts to a scan query:
 |columns|A String array of dimensions and metrics to scan. If left empty, all dimensions and metrics are returned.|no|
 |batchSize|How many rows buffered before return to client. Default is `20480`|no|
 |limit|How many rows to return. If not specified, all rows will be returned.|no|
+|legacy|Return results consistent with the legacy "scan-query" contrib extension. Defaults to the value set by `druid.query.scan.legacy`, which in turn defaults to false. See [Legacy mode](#legacy-mode) for details.|no|
 |context|An additional JSON Object which can be used to specify certain flags.|no|
 
+## Example results
+
 The format of the result when resultFormat equals to `list`:
 
 ```json
@@ -154,4 +157,19 @@ The format of the result when resultFormat equals to `compactedList`:
 The biggest difference between select query and scan query is that, scan query doesn't retain all rows in memory before rows can be returned to client.  
 It will cause memory pressure if too many rows required by select query.  
 Scan query doesn't have this issue.  
-Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly.
+Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly.
+
+## Legacy mode
+
+The Scan query supports a legacy mode designed for protocol compatibility with the former scan-query contrib extension.
+In legacy mode you can expect the following behavior changes:
+
+- The __time column is returned as "timestamp" rather than "__time". This will take precedence over any other column
+you may have that is named "timestamp".
+- The __time column is included in the list of columns even if you do not specifically ask for it.
+- Timestamps are returned as ISO8601 time strings rather than integers (milliseconds since 1970-01-01 00:00:00 UTC).
+
+Legacy mode can be triggered either by passing `"legacy" : true` in your query JSON, or by setting
+`druid.query.scan.legacy = true` on your Druid nodes. If you were previously using the scan-query contrib extension,
+the best way to migrate is to activate legacy mode during a rolling upgrade, then switch it off after the upgrade
+is complete.
diff --git a/docs/content/querying/select-query.md b/docs/content/querying/select-query.md
@@ -2,6 +2,7 @@
 layout: doc_page
 ---
 # Select Queries
+
 Select queries return raw Druid rows and support pagination.
 
 ```json
@@ -19,6 +20,13 @@ Select queries return raw Druid rows and support pagination.
  }
 ```
 
+<div class="note info">
+Consider using the [Scan query](scan-query.html) instead of the Select query if you don't need pagination, and you
+don't need the strict time-ascending or time-descending ordering offered by the Select query. The Scan query returns
+results without pagination, and offers "looser" ordering than Select, but is significantly more efficient in terms of
+both processing time and memory requirements. It is also capable of returning a virtually unlimited number of results.
+</div>
+
 There are several main parts to a select query:
 
 |property|description|required?|

diff --git a/docs/content/querying/sql.md b/docs/content/querying/sql.md
@@ -256,7 +256,9 @@ converted to zeroes).
 
 ## Query execution
 
-Queries without aggregations will use Druid's [Select](select-query.html) native query type.
+Queries without aggregations will use Druid's [Scan](scan-query.html) or [Select](select-query.html) native query types.
+Scan is used whenever possible, as it is generally higher performance and more efficient than Select. However, Select
+is used in one case: when the query includes an `ORDER BY __time`, since Scan does not have a sorting feature.
 
 Aggregation queries (using GROUP BY, DISTINCT, or any aggregation functions) will use one of Druid's three native
 aggregation query types. Two (Timeseries and TopN) are specialized for specific types of aggregations, whereas the other

diff --git a/docs/content/toc.md b/docs/content/toc.md
@@ -34,6 +34,7 @@ layout: toc
   * [DataSource Metadata](/docs/VERSION/querying/datasourcemetadataquery.html)
   * [Search](/docs/VERSION/querying/searchquery.html)
   * [Select](/docs/VERSION/querying/select-query.html)
+  * [Scan](/docs/VERSION/querying/scan-query.html)
   * Components
     * [Datasources](/docs/VERSION/querying/datasource.html)
     * [Filters](/docs/VERSION/querying/filters.html)

diff --git a/extensions-contrib/scan-query/pom.xml b/extensions-contrib/scan-query/pom.xml
diff --git a/extensions-contrib/scan-query/src/main/java/io/druid/query/scan/ScanQueryDruidModule.java b/extensions-contrib/scan-query/src/main/java/io/druid/query/scan/ScanQueryDruidModule.java
diff --git a/...ntrib/scan-query/src/main/resources/META-INF/services/io.druid.initialization.DruidModule b/...ntrib/scan-query/src/main/resources/META-INF/services/io.druid.initialization.DruidModule
diff --git a/pom.xml b/pom.xml
@@ -134,7 +134,6 @@
         <module>extensions-contrib/virtual-columns</module>
         <module>extensions-contrib/thrift-extensions</module>
         <module>extensions-contrib/ambari-metrics-emitter</module>
-        <module>extensions-contrib/scan-query</module>
         <module>extensions-contrib/sqlserver-metadata-storage</module>
         <module>extensions-contrib/kafka-emitter</module>
         <module>extensions-contrib/redis-cache</module>

diff --git a/processing/src/main/java/io/druid/query/Query.java b/processing/src/main/java/io/druid/query/Query.java
@@ -27,6 +27,7 @@
 import io.druid.query.filter.DimFilter;
 import io.druid.query.groupby.GroupByQuery;
 import io.druid.query.metadata.metadata.SegmentMetadataQuery;
+import io.druid.query.scan.ScanQuery;
 import io.druid.query.search.search.SearchQuery;
 import io.druid.query.select.SelectQuery;
 import io.druid.query.spec.QuerySegmentSpec;
@@ -46,6 +47,7 @@
     @JsonSubTypes.Type(name = Query.SEARCH, value = SearchQuery.class),
     @JsonSubTypes.Type(name = Query.TIME_BOUNDARY, value = TimeBoundaryQuery.class),
     @JsonSubTypes.Type(name = Query.GROUP_BY, value = GroupByQuery.class),
+    @JsonSubTypes.Type(name = Query.SCAN, value = ScanQuery.class),
     @JsonSubTypes.Type(name = Query.SEGMENT_METADATA, value = SegmentMetadataQuery.class),
     @JsonSubTypes.Type(name = Query.SELECT, value = SelectQuery.class),
     @JsonSubTypes.Type(name = Query.TOPN, value = TopNQuery.class),
@@ -58,6 +60,7 @@ public interface Query<T>
   String SEARCH = "search";
   String TIME_BOUNDARY = "timeBoundary";
   String GROUP_BY = "groupBy";
+  String SCAN = "scan";
   String SEGMENT_METADATA = "segmentMetadata";
   String SELECT = "select";
   String TOPN = "topN";

diff --git a/processing/src/main/java/io/druid/query/extraction/ExtractionFn.java b/processing/src/main/java/io/druid/query/extraction/ExtractionFn.java
@@ -22,6 +22,7 @@
 import com.fasterxml.jackson.annotation.JsonSubTypes;
 import com.fasterxml.jackson.annotation.JsonTypeInfo;
 import io.druid.guice.annotations.ExtensionPoint;
+import io.druid.java.util.common.Cacheable;
 import io.druid.query.lookup.LookupExtractionFn;
 import io.druid.query.lookup.RegisteredLookupExtractionFn;
 
@@ -57,16 +58,8 @@
  * regular expression with a capture group.  When the regular expression matches the value of a dimension,
  * the value captured by the group is used for grouping operations instead of the dimension value.
  */
-public interface ExtractionFn
+public interface ExtractionFn extends Cacheable
 {
-  /**
-   * Returns a byte[] unique to all concrete implementations of DimExtractionFn.  This byte[] is used to
-   * generate a cache key for the specific query.
-   *
-   * @return a byte[] unit to all concrete implements of DimExtractionFn
-   */
-  public byte[] getCacheKey();
-
   /**
    * The "extraction" function.  This should map an Object into some String value.
    * <p>