Spark2Elasticsearch

Spark Library for Bulk Loading into Elasticsearch

Requirements

Spark2Elasticsearch supports Spark 1.4 and above.

Spark2Elasticsearch Version	Elasticsearch Version
`2.0.X`	`2.0.X`
`2.1.X`	`2.1.X`

Downloads

SBT

libraryDependencies += "com.github.jparkie" %% "spark2elasticsearch" % "2.0.2"

Or:

libraryDependencies += "com.github.jparkie" %% "spark2elasticsearch" % "2.1.2"

Add the following resolver if needed:

resolvers += "Sonatype OSS Releases" at "https://oss.sonatype.org/content/repositories/releases"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

Maven

<dependency>
  <groupId>com.github.jparkie</groupId>
  <artifactId>spark2elasticsearch_2.10</artifactId>
  <version>x.y.z-SNAPSHOT</version>
</dependency>

It is planned for Spark2Elasticsearch to be available on the following:

http://spark-packages.org/

Features

Utilizes Elasticsearch Java API with a TransportClient to bulk load data from a DataFrame into Elasticsearch.

Usage

Bulk Loading into Elasticsearch

// Import the following to have access to the `bulkLoadToEs()` function.
import com.github.jparkie.spark.elasticsearch.sql._

val sparkConf = new SparkConf()
val sc = SparkContext.getOrCreate(sparkConf)
val sqlContext = SQLContext.getOrCreate(sc)

val df = sqlContext.read.parquet("<PATH>")

// Specify the `index` and the `type` to write.
df.bulkLoadToEs(
  esIndex = "twitter",
  esType = "tweets"
)

Refer to for more: SparkEsDataFrameFunctions.scala

Configurations

When adding configurations to through spark-submit, prefix property names with spark..

SparkEsMapperConf

Refer to for more: SparkEsMapperConf.scala

Property Name	Default	Description
`es.mapping.id`	None	The document field/property name containing the document id.
`es.mapping.parent`	None	The document field/property name containing the document parent. To specify a constant, use the format.
`es.mapping.version`	None	The document field/property name containing the document version. To specify a constant, use the format.
`es.mapping.version.type`	None	Indicates the type of versioning used. http://www.elastic.co/guide/en/elasticsearch/reference/2.0/docs-index_.html#_version_types If es.mapping.version is undefined (default), its value is unspecified. If es.mapping.version is specified, its value becomes external.
`es.mapping.routing`	None	The document field/property name containing the document routing. To specify a constant, use the format.
`es.mapping.ttl`	None	The document field/property name containing the document time-to-live. To specify a constant, use the format.
`es.mapping.timestamp`	None	The document field/property name containing the document timestamp. To specify a constant, use the format.

SparkEsTransportClientConf

Refer to for more: SparkEsTransportClientConf.scala

Property Name	Default	Description
`es.nodes`	Required	The minimum set of hosts to connect to when establishing a client. Comma separated, colon separated host and port.
`es.port`	9300	The port to connect when establishing a client.
`es.cluster.name`	None	The name of the Elasticsearch cluster to connect.
`es.client.transport.sniff`	None	If set to true, will discover other IP addresses to connect.
`es.client.transport.ignore_cluster_name`	None	Set to true to ignore cluster name validation of connected nodes.
`es.client.transport.ping_timeout`	5s	The time to wait for a ping response from a node.
`es.client.transport.nodes_sampler_interval`	5s	How often to sample / ping the nodes listed and connected.

SparkEsWriteConf

Refer to for more: SparkEsWriteConf.scala

Property Name	Default	Description
`es.batch.size.entries`	1000	The number of IndexRequests to batch in one request.
`es.batch.size.bytes`	5	The maximum size in MB of a batch.
`es.batch.concurrent.request`	1	The number of concurrent requests in flight.
`es.batch.flush.timeout`	10	The maximum time in seconds to wait while closing a BulkProcessor.

Documentation

Scaladocs are currently unavailable.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark2Elasticsearch

Requirements

Downloads

SBT

Maven

Features

Usage

Bulk Loading into Elasticsearch

Configurations

SparkEsMapperConf

SparkEsTransportClientConf

SparkEsWriteConf

Documentation

About

Releases

Packages

Languages

License

jparkie/Spark2Elasticsearch

Folders and files

Latest commit

History

Repository files navigation

Spark2Elasticsearch

Requirements

Downloads

SBT

Maven

Features

Usage

Bulk Loading into Elasticsearch

Configurations

SparkEsMapperConf

SparkEsTransportClientConf

SparkEsWriteConf

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages