Skip to content

Commit

Permalink
Merge pull request apache#2625 from gianm/clarify-parser-docs
Browse files Browse the repository at this point in the history
Clarify parser docs.
  • Loading branch information
fjy committed Mar 10, 2016
2 parents 68cffe1 + a2b1652 commit cf3965c
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 14 deletions.
53 changes: 41 additions & 12 deletions docs/content/ingestion/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,15 +81,15 @@ If `type` is not included, the parser defaults to `string`.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| type | String | This should say `string`. | no |
| parseSpec | JSON Object | Specifies the format of the data. | yes |
| type | String | This should say `string` in general, or `hadoopyString` when used in a Hadoop indexing job. | no |
| parseSpec | JSON Object | Specifies the format, timestamp, and dimensions of the data. | yes |

### Protobuf Parser

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| type | String | This should say `protobuf`. | no |
| parseSpec | JSON Object | Specifies the format of the data. | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |

### Avro Stream Parser

Expand All @@ -99,7 +99,7 @@ This is for realtime ingestion. Make sure to include `druid-avro-extensions` as
|-------|------|-------------|----------|
| type | String | This should say `avro_stream`. | no |
| avroBytesDecoder | JSON Object | Specifies how to decode bytes to Avro record. | yes |
| parseSpec | JSON Object | Specifies the format of the data. | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |

For example, using Avro stream parser with schema repo Avro bytes decoder:

Expand All @@ -117,7 +117,11 @@ For example, using Avro stream parser with schema repo Avro bytes decoder:
"url" : "${YOUR_SCHEMA_REPO_END_POINT}",
}
},
"parseSpec" : <standard_druid_parseSpec>
"parseSpec" : {
"type": "timeAndDims",
"timestampSpec": <standard timestampSpec>,
"dimensionsSpec": <standard dimensionsSpec>
}
}
```

Expand Down Expand Up @@ -157,7 +161,7 @@ This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `
| Field | Type | Description | Required |
|-------|------|-------------|----------|
| type | String | This should say `avro_hadoop`. | no |
| parseSpec | JSON Object | Specifies the format of the data. | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
| fromPigAvroStorage | Boolean | Specifies whether the data file is stored using AvroStorage. | no(default == false) |

For example, using Avro Hadoop parser with custom reader's schema file:
Expand All @@ -170,7 +174,11 @@ For example, using Avro Hadoop parser with custom reader's schema file:
"dataSource" : "",
"parser" : {
"type" : "avro_hadoop",
"parseSpec" : <standard_druid_parseSpec>
"parseSpec" : {
"type": "timeAndDims",
"timestampSpec": <standard timestampSpec>,
"dimensionsSpec": <standard dimensionsSpec>
}
}
},
"ioConfig" : {
Expand All @@ -192,18 +200,24 @@ For example, using Avro Hadoop parser with custom reader's schema file:

### ParseSpec

ParseSpecs serve two purposes:

- The String Parser use them to determine the format (i.e. JSON, CSV, TSV) of incoming rows.
- All Parsers use them to determine the timestamp and dimensions of incoming rows.

If `format` is not included, the parseSpec defaults to `tsv`.

#### JSON ParseSpec

Use this with the String Parser to load JSON.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| format | String | This should say `json`. | no |
| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes |
| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes |
| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [Flattening JSON](./flatten-json.html) for more info. | no |



#### JSON Lowercase ParseSpec

This is a special variation of the JSON ParseSpec that lower cases all the column names in the incoming JSON data. This parseSpec is required if you are updating to Druid 0.7.x from Druid 0.6.x, are directly ingesting JSON with mixed case column names, do not have any ETL in place to lower case those column names, and would like to make queries that include the data you created using 0.6.x and 0.7.x.
Expand All @@ -214,9 +228,10 @@ This is a special variation of the JSON ParseSpec that lower cases all the colum
| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes |
| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes |


#### CSV ParseSpec

Use this with the String Parser to load CSV. Strings are parsed using the net.sf.opencsv library.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| format | String | This should say `csv`. | yes |
Expand All @@ -225,7 +240,10 @@ This is a special variation of the JSON ParseSpec that lower cases all the colum
| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default == ctrl+A) |
| columns | JSON array | Specifies the columns of the data. | yes |

#### TSV ParseSpec
#### TSV / Delimited ParseSpec

Use this with the String Parser to load any delimited text that does not require special escaping. By default,
the delimiter is a tab, so this will load TSV.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
Expand All @@ -236,7 +254,18 @@ This is a special variation of the JSON ParseSpec that lower cases all the colum
| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default == ctrl+A) |
| columns | JSON String array | Specifies the columns of the data. | yes |

### Timestamp Spec
#### TimeAndDims ParseSpec

Use this with non-String Parsers to provide them with timestamp and dimensions information. Non-String Parsers
handle all formatting decisions on their own, without using the ParseSpec.

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| format | String | This should say `timeAndDims`. | yes |
| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes |
| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes |

### TimestampSpec

| Field | Type | Description | Required |
|-------|------|-------------|----------|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
package io.druid.data.input;

import io.druid.data.input.impl.DimensionsSpec;
import io.druid.data.input.impl.JSONParseSpec;
import io.druid.data.input.impl.TimeAndDimsParseSpec;
import io.druid.data.input.impl.TimestampSpec;
import org.joda.time.DateTime;
import org.junit.Test;
Expand Down Expand Up @@ -57,7 +57,7 @@ public void testParse() throws Exception

//configure parser with desc file
ProtoBufInputRowParser parser = new ProtoBufInputRowParser(
new JSONParseSpec(
new TimeAndDimsParseSpec(
new TimestampSpec("timestamp", "iso", null),
new DimensionsSpec(Arrays.asList(DIMENSIONS), Arrays.<String>asList(), null)
),
Expand Down

0 comments on commit cf3965c

Please sign in to comment.