Skip to content

Commit

Permalink
addressed all current feedback including: removed unused images, corr…
Browse files Browse the repository at this point in the history
…ected missing TMs, removed unnecessary newline changes, deleted outdated FAQ entry
  • Loading branch information
kywe665 authored and ashvina committed Mar 8, 2024
1 parent 24bd0c9 commit f28c50a
Show file tree
Hide file tree
Showing 23 changed files with 47 additions and 438 deletions.
34 changes: 9 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ of a few interfaces, which we believe will facilitate the expansion of supported
future.

# Building the project and running tests.

1. Use Java11 for building the project. If you are using some other java version, you can
use [jenv](https://github.com/jenv/jenv) to use multiple java versions locally.
2. Build the project using `mvn clean package`. Use `mvn clean package -DskipTests` to skip tests while building.
Expand All @@ -23,17 +22,14 @@ future.
4. Similarly, use `mvn clean verify` or `mvn verify` to run integration tests.

# Style guide

1. We use [Maven Spotless plugin](https://github.com/diffplug/spotless/tree/main/plugin-maven) and
[Google java format](https://github.com/google/google-java-format) for code style.
2. Use `mvn spotless:check` to find out code style violations and `mvn spotless:apply` to fix them. Code style check is
tied to compile phase by default, so code style violations will lead to build failures.

# Running the bundled jar

1. Get a pre-built bundled jar or create the jar with `mvn install -DskipTests`
2. create a yaml file that follows the format below:

```yaml
sourceFormat: HUDI
targetFormats:
Expand All @@ -53,12 +49,10 @@ datasets:
- tableBasePath: abfs://[email protected]/multi-partition-dataset
tableName: multi_partition_dataset
```
- `sourceFormat` is the format of the source table that you want to convert
- `targetFormats` is a list of formats you want to create from your source tables
- `tableBasePath` is the basePath of the table
- `tableDataPath` is an optional field specifying the path to the data files. If not specified, the tableBasePath will
be used. For Iceberg source tables, you will need to specify the `/data` path.
- `tableDataPath` is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the `/data` path.
- `namespace` is an optional field specifying the namespace of the table and will be used when syncing to a catalog.
- `partitionSpec` is a spec that allows us to infer partition values. This is only required for Hudi source tables. If
the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list
Expand All @@ -72,10 +66,8 @@ datasets:
- `HOUR`: same as `YEAR` but with hour granularity
- `format`: if your partition type is `YEAR`, `MONTH`, `DAY`, or `HOUR` specify the format for the date string as it
appears in your file paths

3. The default implementations of table format clients can be replaced with custom implementations by specifying a
client configs yaml file in the format below:

```yaml
# sourceClientProviderClass: The class name of a table format's client factory, where the client is
# used for reading from a table of this format. All user configurations, including hadoop config
Expand All @@ -85,49 +77,41 @@ datasets:
# used for writing to a table of this format.
# configuration: A map of configuration values specific to this client.
tableFormatsClients:
HUDI:
sourceClientProviderClass: io.onetable.hudi.HudiSourceClientProvider
DELTA:
targetClientProviderClass: io.onetable.delta.DeltaClient
configuration:
spark.master: local[2]
spark.app.name: onetableclient
HUDI:
sourceClientProviderClass: io.onetable.hudi.HudiSourceClientProvider
DELTA:
targetClientProviderClass: io.onetable.delta.DeltaClient
configuration:
spark.master: local[2]
spark.app.name: onetableclient
```

4. A catalog can be used when reading and updating Iceberg tables. The catalog can be specified in a yaml file and
passed in with the `--icebergCatalogConfig` option. The format of the catalog config file is:

```yaml
catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
key1: value1
key2: value2
```

5. run
with `java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--clientsConfig clients.yaml] [--icebergCatalogConfig catalog.yaml]`
The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Authentication for AWS is done with
`com.amazonaws.auth.DefaultAWSCredentialsProviderChain`. To override this setting, specify a different implementation
with the `--awsCredentialsProvider` option.

# Contributing

## Setup

For setting up the repo on IntelliJ, open the project and change the java version to Java11 in File->ProjectStructure
![img.png](style/IDE.png)

You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub
issue [here](https://github.com/onetable-io/onetable/issues)
You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub issue [here](https://github.com/onetable-io/onetable/issues)

## Adding a new target format

Adding a new target format requires a developer
implement [TargetClient](./api/src/main/java/io/onetable/spi/sync/TargetClient.java). Once you have implemented that
interface, you can integrate it into the [OneTableClient](./core/src/main/java/io/onetable/client/OneTableClient.java).
If you think others may find that target useful, please raise a Pull Request to add it to the project.

## Overview of the sync process

![img.png](assets/images/sync_flow.jpg)
2 changes: 1 addition & 1 deletion website/blog/OneTable-is-now-Apache-XTable.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "OneTable is now “Apache XTable™ (Incubating)”"
excerpt: "XTable is now Incubating in the Apache Software Foundation"
excerpt: "Apache XTable™ (Incubating) is now Incubating in the Apache Software Foundation"
author: Dipankar Mazumdar, JB Onofré
category: blog
image: /images/blog/XTable/xtable-cover.png
Expand Down
2 changes: 1 addition & 1 deletion website/docs/athena.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: "Amazon Athena"
---

# Querying from Amazon Athena
To read a Apache XTable™ synced target table (regardless of the table format) in Amazon Athena,
To read an Apache XTable™ synced target table (regardless of the table format) in Amazon Athena,
you can create the table either by:
* Using a DDL statement as mentioned in the following AWS docs:
* [Example](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html#querying-hudi-in-athena-creating-hudi-tables) for Hudi
Expand Down
14 changes: 7 additions & 7 deletions website/docs/biglake-metastore.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Syncing to BigLake Metastore
This document walks through the steps to register a XTable synced Iceberg table in BigLake Metastore on GCP.
This document walks through the steps to register an Apache XTable™ (Incubating) synced Iceberg table in BigLake Metastore on GCP.

## Pre-requisites
1. Source (Hudi/Delta) table(s) already written to Google Cloud Storage.
Expand All @@ -19,21 +19,21 @@ This document walks through the steps to register a XTable synced Iceberg table
3. To ensure that the Storage Account API's caller (your service account used by XTable) has the
necessary permissions to write log/metadata files in GCS, ask your administrator to grant [Storage Object User](https://cloud.google.com/storage/docs/access-control/iam-roles) (roles/storage.objectUser)
access to the service account.
4. If you're running XTable outside GCP, you need to provide the machine access to interact with BigLake and GCS.
4. If you're running Apache XTable™ (Incubating) outside GCP, you need to provide the machine access to interact with BigLake and GCS.
To do so, store the permissions key for your service account in your machine using
```shell
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json
```
5. Clone the XTable [repository](https://github.com/apache/incubator-xtable) and create the
5. Clone the Apache XTable™ (Incubating) [repository](https://github.com/apache/incubator-xtable) and create the
`utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](/docs/setup)
6. Download the [BigLake Iceberg JAR](gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar) locally.
XTable requires the JAR to be present in the classpath.
Apache XTable™ (Incubating) requires the JAR to be present in the classpath.

## Steps
:::danger Important:
Currently BigLake Metastore is only accessible through Google's
[BigLake Rest APIs](https://cloud.google.com/bigquery/docs/reference/biglake/rest), and as such
XTable requires you to setup the below items prior to running sync on your source dataset.
Apache XTable™ (Incubating) requires you to setup the below items prior to running sync on your source dataset.
* BigLake Catalog
* BigLake Database
:::
Expand Down Expand Up @@ -114,7 +114,7 @@ catalogOptions:
warehouse: gs://path/to/warehouse
```

From your terminal under the cloned XTable directory, run the sync process using the below command.
From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.

```shell md title="shell"
java -cp utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar:/path/to/downloaded/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar io.onetable.utilities.RunSync --datasetConfig my_config.yaml --icebergCatalogConfig catalog.yaml
Expand All @@ -127,7 +127,7 @@ to interpret the data as an Iceberg table.
:::

### Validating the results
Once the sync succeeds, XTable would have written the table directly to BigLake Metastore.
Once the sync succeeds, Apache XTable™ (Incubating) would have written the table directly to BigLake Metastore.
We can use `Try this method` option on Google's REST reference docs for
[`projects.locations.catalogs.databases.tables.get`](https://cloud.google.com/bigquery/docs/reference/biglake/rest/v1/projects.locations.catalogs.databases.tables/get)
method to view the created table.
Expand Down
4 changes: 2 additions & 2 deletions website/docs/demo/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ This demo was tested in both x86-64 and AArch64 based macOS operating systems
:::

## Setting up Docker cluster
After cloning the XTable repository, change directory to `demo` and run the `start_demo.sh` script.
This script builds XTable jars required for the demo and then spins up docker containers to start a Jupyter notebook
After cloning the Apache XTable™ (Incubating) repository, change directory to `demo` and run the `start_demo.sh` script.
This script builds Apache XTable™ (Incubating) jars required for the demo and then spins up docker containers to start a Jupyter notebook
with Scala interpreter, Hive Metastore, Presto and Trino.

```shell md title="shell"
Expand Down
14 changes: 7 additions & 7 deletions website/docs/hms.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ This document walks through the steps to register an Apache XTable™ (Incubatin
2. A compute instance where you can run Apache Spark. This can be your local machine, docker,
or a distributed system like Amazon EMR, Google Cloud's Dataproc, Azure HDInsight etc.
This is a required step to register the table in HMS using a Spark client.
3. Clone the XTable [repository](https://github.com/apache/incubator-xtable) and create the
3. Clone the XTable™ (Incubating) [repository](https://github.com/apache/incubator-xtable) and create the
`utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](/docs/setup)
4. This guide also assumes that you have configured the Hive Metastore locally or on EMR/Dataproc/HDInsight
and is already running.

## Steps
### Running sync
Create `my_config.yaml` in the cloned XTable directory.
Create `my_config.yaml` in the cloned Apache XTable™ (Incubating) directory.

<Tabs
groupId="table-format"
Expand Down Expand Up @@ -86,7 +86,7 @@ datasets:
* ADLS - `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>`
:::

From your terminal under the cloned XTable directory, run the sync process using the below command.
From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.
```shell md title="shell"
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
```
Expand All @@ -97,7 +97,7 @@ directory with relevant metadata files that helps query engines to interpret the
:::

### Register the target table in Hive Metastore
Now you need to register the XTable synced target table in Hive Metastore.
Now you need to register the Apache XTable™ (Incubating) synced target table in Hive Metastore.

<Tabs
groupId="table-format"
Expand Down Expand Up @@ -137,7 +137,7 @@ if you have your source table in S3/GCS/ADLS i.e.


Now you will be able to query the created table directly as a Hudi table from the same `spark` session or
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
[Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.

```sql md title="sql"
Expand Down Expand Up @@ -171,7 +171,7 @@ if you have your source table in S3/GCS/ADLS i.e.
:::

Now you will be able to query the created table directly as a Delta table from the same `spark` session or
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
[Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.

```sql md title="sql"
Expand Down Expand Up @@ -211,7 +211,7 @@ in S3/GCS/ADLS i.e.
:::

Now you will be able to query the created table directly as an Iceberg table from the same `spark` session or
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
[Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.

```sql md title="sql"
Expand Down
8 changes: 4 additions & 4 deletions website/docs/how-to.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ history to enable proper point in time queries.
* Google Cloud Storage by following the steps
[here](https://cloud.google.com/iam/docs/keys-create-delete#creating)

For the purpose of this tutorial, we will walk through the steps to using XTable locally.
For the purpose of this tutorial, we will walk through the steps to using Apache XTable™ (Incubating) locally.

## Steps

Expand Down Expand Up @@ -348,7 +348,7 @@ Authentication for GCP requires service account credentials to be exported. i.e.
`export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json`
:::

In your terminal under the cloned XTable directory, run the below command.
In your terminal under the cloned Apache XTable™ (Incubating) directory, run the below command.

```shell md title="shell"
java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
Expand All @@ -359,9 +359,9 @@ At this point, if you check your local path, you will be able to see the necessa
commit history, partitions, and column stats that helps query engines to interpret the data in the target table format.

## Conclusion
In this tutorial, we saw how to create a source table and use XTable to create the metadata files
In this tutorial, we saw how to create a source table and use Apache XTable™ (Incubating) to create the metadata files
that can be used to query the source table in different target table formats.

## Next steps
Go through the [Catalog Integration guides](/docs/catalogs-index) to register the XTable synced tables
Go through the [Catalog Integration guides](/docs/catalogs-index) to register the Apache XTable™ (Incubating) synced tables
in different data catalogs.
4 changes: 2 additions & 2 deletions website/docs/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This page covers the essential steps to setup Apache XTable™ (incubating) in y

## Pre-requisites
1. Building the project requires Java 11 and Maven to be setup and configured using PATH or environment variables.
2. Clone the XTable project GitHub [repository](https://github.com/apache/incubator-xtable) in your environment.
2. Clone the Apache XTable™ (Incubating) project GitHub [repository](https://github.com/apache/incubator-xtable) in your environment.

## Steps
#### Building the project
Expand All @@ -22,5 +22,5 @@ mvn clean package -DskipTests
For more information on the steps, follow the project's GitHub [README.md](https://github.com/apache/incubator-xtable/blob/main/README.md)

## Next Steps
See the [Quickstart](/docs/how-to) guide to learn to use XTable to add interoperability between
See the [Quickstart](/docs/how-to) guide to learn to use Apache XTable™ (Incubating) to add interoperability between
different table formats.
2 changes: 1 addition & 1 deletion website/docs/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Querying from Apache Spark
To read a Apache XTable™ (Incubating) synced target table (regardless of the table format) in Apache Spark locally or on services like
To read an Apache XTable™ (Incubating) synced target table (regardless of the table format) in Apache Spark locally or on services like
Amazon EMR, Google Cloud's Dataproc, Azure HDInsight, or Databricks, you do not need additional jars or configs
other than what is needed by the respective table formats.

Expand Down
Loading

0 comments on commit f28c50a

Please sign in to comment.