Skip to content

Commit

Permalink
Support ingestion of long/float dimensions (apache#3966)
Browse files Browse the repository at this point in the history
* Support ingestion for long/float dimensions

* Allow non-arrays for key components in indexing type strategy interfaces

* Add numeric index merge test, fixes

* Docs for numeric dims at ingestion

* Remove unused import

* Adjust docs, add aggregate on numeric dims tests

* remove unused imports

* Throw exception for bitmap method on numerics

* Move typed selector creation to DimensionIndexer interface

* unused imports

* Fix

* Remove unused DimensionSpec from indexer methods, check for dims first in inc index storage adapter

* Remove spaces
  • Loading branch information
jon-wei authored and fjy committed Mar 1, 2017
1 parent 5ccfdcc commit a08660a
Show file tree
Hide file tree
Showing 39 changed files with 1,773 additions and 145 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ public void setup(BitmapIterationBenchmark state)

/**
* Benchmark of cumulative cost of construction of an immutable bitmap and then iterating over it. This is a pattern
* from realtime nodes, see {@link io.druid.segment.StringDimensionIndexer#fillBitmapsFromUnsortedEncodedArray}.
* from realtime nodes, see {@link io.druid.segment.StringDimensionIndexer#fillBitmapsFromUnsortedEncodedKeyComponent}.
* However this benchmark is yet approximate and to be improved to better reflect actual workloads of realtime nodes.
*/
@Benchmark
Expand Down
68 changes: 66 additions & 2 deletions docs/content/ingestion/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,32 @@ An example dataSchema is shown below:
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensions": [
"page",
"language",
"user",
"unpatrolled",
"newPage",
"robot",
"anonymous",
"namespace",
"continent",
"country",
"region",
"city",
{
"type": "long",
"name": "countryNum"
},
{
"type": "float",
"name": "userLatitude"
},
{
"type": "float",
"name": "userLongitude"
}
],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
Expand Down Expand Up @@ -169,10 +194,49 @@ handle all formatting decisions on their own, without using the ParseSpec.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| dimensions | JSON String array | The names of the dimensions. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as dimension columns. | yes |
| dimensions | JSON array | A list of [dimension schema](#dimension-schema) objects or dimension names. Providing a name is equivalent to providing a String-typed dimension schema with the given name. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as String-typed dimension columns. | yes |
| dimensionExclusions | JSON String array | The names of dimensions to exclude from ingestion. | no (default == [] |
| spatialDimensions | JSON Object array | An array of [spatial dimensions](../development/geo.html) | no (default == [] |

#### Dimension Schema
A dimension schema specifies the type and name of a dimension to be ingested.

For example, the following `dimensionsSpec` section from a `dataSchema` ingests one column as Long (`countryNum`), two columns as Float (`userLatitude`, `userLongitude`), and the other columns as Strings:

```json
"dimensionsSpec" : {
"dimensions": [
"page",
"language",
"user",
"unpatrolled",
"newPage",
"robot",
"anonymous",
"namespace",
"continent",
"country",
"region",
"city",
{
"type": "long",
"name": "countryNum"
},
{
"type": "float",
"name": "userLatitude"
},
{
"type": "float",
"name": "userLongitude"
}
],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
```


## GranularitySpec

The default granularity spec is `uniform`, and can be changed by setting the `type` field.
Expand Down
11 changes: 10 additions & 1 deletion docs/content/ingestion/schema-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,21 @@ of OLAP data.
For more detailed information:

* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
* Dimensions are fields that can be filtered on or grouped by. They are always either single Strings or arrays of Strings.
* Dimensions are fields that can be filtered on or grouped by. They are always single Strings, arrays of Strings, single Longs, or single Floats.
* Metrics are fields that can be aggregated. They are often stored as numbers (integers or floats) but can also be stored as complex objects like HyperLogLog sketches or approximate histogram sketches.

Typical production tables (or datasources as they are known in Druid) have fewer than 100 dimensions and fewer
than 100 metrics, although, based on user testimony, datasources with thousands of dimensions have been created.

Below, we outline some best practices with schema design:

## Numeric dimensions

If the user wishes to ingest a column as a numeric-typed dimension (Long or Float), it is necessary to specify the type of the column in the `dimensions` section of the `dimensionsSpec`. If the type is omitted, Druid will ingest a column as the default String type.

See [Dimension Schema](../ingestion/index.html#dimension-schema) for more information.


## High cardinality dimensions (e.g. unique IDs)

In practice, we see that exact counts for unique IDs are often not required. Storing unique IDs as a column will kill
Expand Down Expand Up @@ -77,6 +84,8 @@ a dimension that has been excluded, or a metric column as a dimension. It should
these segments will be slightly larger than if the list of dimensions was explicitly specified in lexicographic order. This limitation
does not impact query correctness- just storage requirements.

Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.

## Including the same column as a dimension and a metric

One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ public void testFullOnSelect()
ScanResultValue.timestampKey,
"market",
"quality",
"qualityLong",
"qualityFloat",
"qualityNumericString",
"placement",
"placementish",
Expand All @@ -141,9 +143,7 @@ public void testFullOnSelect()
"index",
"indexMin",
"indexMaxPlusTen",
"quality_uniques",
"qualityLong",
"qualityFloat"
"quality_uniques"
);
ScanQuery query = newTestQuery()
.intervals(I_0112_0114)
Expand Down
82 changes: 45 additions & 37 deletions processing/src/main/java/io/druid/segment/DimensionHandler.java
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,15 @@
*
* The EncodedType and ActualType are Comparable because columns used as dimensions must have sortable values.
*
* @param <EncodedType> class of the encoded values
* @param <ActualType> class of the actual values
* @param <EncodedType> class of a single encoded value
* @param <EncodedKeyComponentType> A row key contains a component for each dimension, this param specifies the
* class of this dimension's key component. A column type that supports multivalue rows
* should use an array type (Strings would use int[]). Column types without multivalue
* row support should use single objects (e.g., Long, Float).
* @param <ActualType> class of a single actual value
*/
public interface DimensionHandler<EncodedType extends Comparable<EncodedType>, EncodedTypeArray, ActualType extends Comparable<ActualType>>
public interface DimensionHandler
<EncodedType extends Comparable<EncodedType>, EncodedKeyComponentType, ActualType extends Comparable<ActualType>>
{
/**
* Get the name of the column associated with this handler.
Expand All @@ -66,12 +71,12 @@ public interface DimensionHandler<EncodedType extends Comparable<EncodedType>, E


/**
* Creates a new DimensionIndexer, a per-dimension object responsible for processing ingested rows in-memory, used by the
* IncrementalIndex. See {@link DimensionIndexer} interface for more information.
* Creates a new DimensionIndexer, a per-dimension object responsible for processing ingested rows in-memory, used
* by the IncrementalIndex. See {@link DimensionIndexer} interface for more information.
*
* @return A new DimensionIndexer object.
*/
DimensionIndexer<EncodedType, EncodedTypeArray, ActualType> makeIndexer();
DimensionIndexer<EncodedType, EncodedKeyComponentType, ActualType> makeIndexer();


/**
Expand All @@ -88,7 +93,7 @@ public interface DimensionHandler<EncodedType extends Comparable<EncodedType>, E
* @return A new DimensionMergerV9 object.
*/
DimensionMergerV9<EncodedTypeArray> makeMerger(
DimensionMergerV9<EncodedKeyComponentType> makeMerger(
IndexSpec indexSpec,
File outDir,
IOPeon ioPeon,
Expand All @@ -98,8 +103,8 @@ DimensionMergerV9<EncodedTypeArray> makeMerger(


/**
* Creates a new DimensionMergerLegacy, a per-dimension object responsible for merging indexes/row data across segments
* and building the on-disk representation of a dimension. For use with IndexMerger only.
* Creates a new DimensionMergerLegacy, a per-dimension object responsible for merging indexes/row data across
* segments and building the on-disk representation of a dimension. For use with IndexMerger only.
*
* See {@link DimensionMergerLegacy} interface for more information.
*
Expand All @@ -111,7 +116,7 @@ DimensionMergerV9<EncodedTypeArray> makeMerger(
* @return A new DimensionMergerLegacy object.
*/
DimensionMergerLegacy<EncodedTypeArray> makeLegacyMerger(
DimensionMergerLegacy<EncodedKeyComponentType> makeLegacyMerger(
IndexSpec indexSpec,
File outDir,
IOPeon ioPeon,
Expand All @@ -120,53 +125,55 @@ DimensionMergerLegacy<EncodedTypeArray> makeLegacyMerger(
) throws IOException;

/**
* Given an array representing a single set of row value(s) for this dimension as an Object,
* return the length of the array after appropriate type-casting.
* Given an key component representing a single set of row value(s) for this dimension as an Object,
* return the length of the key component after appropriate type-casting.
*
* For example, a dictionary encoded String dimension would receive an int[] as an Object.
* For example, a dictionary encoded String dimension would receive an int[] as input to this method,
* while a Long numeric dimension would receive a single Long object (no multivalue support)
*
* @param dimVals Array of row values
* @param dimVals Values for this dimension from a row
* @return Size of dimVals
*/
int getLengthFromEncodedArray(EncodedTypeArray dimVals);
int getLengthOfEncodedKeyComponent(EncodedKeyComponentType dimVals);


/**
* Given two arrays representing sorted encoded row value(s), return the result of their comparison.
* Given two key components representing sorted encoded row value(s), return the result of their comparison.
*
* If the two arrays have different lengths, the shorter array should be ordered first in the comparison.
* If the two key components have different lengths, the shorter component should be ordered first in the comparison.
*
* Otherwise, this function should iterate through the array values and return the comparison of the first difference.
* Otherwise, this function should iterate through the key components and return the comparison of the
* first difference.
*
* @param lhs array of row values
* @param rhs array of row values
* For dimensions that do not support multivalue rows, lhs and rhs can be compared directly.
*
* @return integer indicating comparison result of arrays
* @param lhs key component from a row
* @param rhs key component from a row
*
* @return integer indicating comparison result of key components
*/
int compareSortedEncodedArrays(EncodedTypeArray lhs, EncodedTypeArray rhs);
int compareSortedEncodedKeyComponents(EncodedKeyComponentType lhs, EncodedKeyComponentType rhs);


/**
* Given two arrays representing sorted encoded row value(s), check that the two arrays have the same encoded values,
* or if the encoded values differ, that they translate into the same actual values, using the mappings
* provided by lhsEncodings and rhsEncodings (if applicable).
* Given two key components representing sorted encoded row value(s), check that the two key components
* have the same encoded values, or if the encoded values differ, that they translate into the same actual values,
* using the mappings provided by lhsEncodings and rhsEncodings (if applicable).
*
* If validation fails, this method should throw a SegmentValidationException.
*
* Used by IndexIO for validating segments.
*
* See StringDimensionHandler.validateSortedEncodedArrays() for a reference implementation.
* See StringDimensionHandler.validateSortedEncodedKeyComponents() for a reference implementation.
*
* @param lhs array of row values
* @param rhs array of row values
* @param lhs key component from a row
* @param rhs key component from a row
* @param lhsEncodings encoding lookup from lhs's segment, null if not applicable for this dimension's type
* @param rhsEncodings encoding lookup from rhs's segment, null if not applicable for this dimension's type
*
* @return integer indicating comparison result of arrays
*/
void validateSortedEncodedArrays(
EncodedTypeArray lhs,
EncodedTypeArray rhs,
void validateSortedEncodedKeyComponents(
EncodedKeyComponentType lhs,
EncodedKeyComponentType rhs,
Indexed<ActualType> lhsEncodings,
Indexed<ActualType> rhsEncodings
) throws SegmentValidationException;
Expand All @@ -186,15 +193,16 @@ void validateSortedEncodedArrays(


/**
* Given a subcolumn from getSubColumn, and the index of the current row, retrieve a row as an array of values.
* Given a subcolumn from getSubColumn, and the index of the current row, retrieve a dimension's values
* from a row as an EncodedKeyComponentType.
*
* For example:
* - A String-typed implementation would read the current row from a DictionaryEncodedColumn as an int[].
* - A long-typed implemention would read the current row from a GenericColumn return the current row as a long[].
* - A long-typed implemention would read the current row from a GenericColumn and return a Long.
*
* @param column Column for this dimension from a QueryableIndex
* @param currRow The index of the row to retrieve
* @return The row from "column" specified by "currRow", as an array of values
* @return The key component for this dimension from the current row of the column.
*/
Object getRowValueArrayFromColumn(Closeable column, int currRow);
EncodedKeyComponentType getEncodedKeyComponentFromColumn(Closeable column, int currRow);
}
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,14 @@ public static DimensionHandler getHandlerFromCapabilities(
return new StringDimensionHandler(dimensionName, multiValueHandling);
}

if (capabilities.getType() == ValueType.LONG) {
return new LongDimensionHandler(dimensionName);
}

if (capabilities.getType() == ValueType.FLOAT) {
return new FloatDimensionHandler(dimensionName);
}

// Return a StringDimensionHandler by default (null columns will be treated as String typed)
return new StringDimensionHandler(dimensionName, multiValueHandling);
}
Expand Down
Loading

0 comments on commit a08660a

Please sign in to comment.