Skip to content

Commit

Permalink
Update Spark ingestion instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
velvia committed Jan 30, 2016
1 parent 778d48f commit 325cb1d
Showing 1 changed file with 13 additions and 9 deletions.
22 changes: 13 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,8 @@ You may specify a function, or computed column, for use with any key column. Th
| :-------- | :-------------- | :---------- |
| string | returns a constant string value | `:string /0` |
| getOrElse | returns default value if column value is null | `:getOrElse columnA ---` |
| round | rounds down a numeric column. Useful for bucketing by time or bucketing numeric IDs. | `:round timestamp 10000` |
| stringPrefix | takes the first N chars of a string; good for partitioning | `:stringPrefix token 4` |
### FiloDB vs Cassandra Data Modelling
Expand Down Expand Up @@ -261,7 +263,7 @@ You can follow along using the [Spark Notebook](http://github.com/andypetrella/s
Or you can start a spark-shell locally,
```bash
bin/spark-shell --jars ../FiloDB/spark/target/scala-2.10/filodb-spark-assembly-0.1-SNAPSHOT.jar --packages com.databricks:spark-csv_2.10:1.2.0 --driver-memory 3G --executor-memory 3G
bin/spark-shell --jars ../FiloDB/spark/target/scala-2.10/filodb-spark-assembly-0.2-SNAPSHOT.jar --packages com.databricks:spark-csv_2.10:1.2.0 --driver-memory 3G --executor-memory 3G
```

Loading CSV file from Spark:
Expand All @@ -279,18 +281,18 @@ import org.apache.spark.sql.SaveMode

scala> csvDF.write.format("filodb.spark").
option("dataset", "gdelt").
option("sort_column", "GLOBALEVENTID").
option("row_keys", "GLOBALEVENTID").
option("segment_key", ":round GLOBALEVENTID 10000").
option("partition_keys", ":getOrElse MonthYear -1").
mode(SaveMode.Overwrite).save()
```

Or, specifying the `partition_column`,
Above, we partition the GDELT dataset by MonthYear, creating roughly 72 partitions for 1979-1984, with the unique GLOBALEVENTID used as a row key. We group every 10000 eventIDs into a segment using the convenient `:round` computed column. You could use multiple columns for the partition or row keys, of course. For example, to partition by country code and year instead:

```scala
scala> csvDF.write.format("filodb.spark").
option("dataset", "gdelt").
option("sort_column", "GLOBALEVENTID").
option("partition_column", "MonthYear").
mode(SaveMode.Overwrite).save()
option("row_keys", "GLOBALEVENTID").
option("segment_key", ":round GLOBALEVENTID 10000").
option("partition_keys", ":getOrElse Actor2CountryCode NONE,:getOrElse Year -1")
```

Note that for efficient columnar encoding, wide rows with fewer partition keys are better for performance.
Expand All @@ -313,7 +315,7 @@ For an example, see the [StreamingTest](spark/src/test/scala/filodb.spark/Stream
Start Spark-SQL:

```bash
bin/spark-sql --jars path/to/FiloDB/spark/target/scala-2.10/filodb-spark-assembly-0.1-SNAPSHOT.jar
bin/spark-sql --jars path/to/FiloDB/spark/target/scala-2.10/filodb-spark-assembly-0.2-SNAPSHOT.jar
```

Create a temporary table using an existing dataset,
Expand All @@ -328,6 +330,8 @@ Create a temporary table using an existing dataset,

Then, start running SQL queries!

NOTE: The above syntax should also work with remote SQL clients like beeline / spark-beeline. Just run the Hive ThriftServer that comes with Spark (NOTE: not all distributions of Spark comes with this, you may need to built it).

### Querying Datasets

Now do some queries, using the DataFrame DSL:
Expand Down

0 comments on commit 325cb1d

Please sign in to comment.