addressed all current feedback including: removed unused images, corr…

…ected missing TMs, removed unnecessary newline changes, deleted outdated FAQ entry
sullis · Mar 8, 2024 · f28c50a · f28c50a
1 parent 24bd0c9
commit f28c50a
Show file tree

Hide file tree

Showing 23 changed files with 47 additions and 438 deletions.
diff --git a/README.md b/README.md
@@ -14,7 +14,6 @@ of a few interfaces, which we believe will facilitate the expansion of supported
 future.
 
 # Building the project and running tests.
-
 1. Use Java11 for building the project. If you are using some other java version, you can
  use [jenv](https://github.com/jenv/jenv) to use multiple java versions locally.
 2. Build the project using `mvn clean package`. Use `mvn clean package -DskipTests` to skip tests while building.
@@ -23,17 +22,14 @@ future.
 4. Similarly, use `mvn clean verify` or `mvn verify` to run integration tests.
 
 # Style guide
-
 1. We use [Maven Spotless plugin](https://github.com/diffplug/spotless/tree/main/plugin-maven) and
  [Google java format](https://github.com/google/google-java-format) for code style.
 2. Use `mvn spotless:check` to find out code style violations and `mvn spotless:apply` to fix them. Code style check is
  tied to compile phase by default, so code style violations will lead to build failures.
 
 # Running the bundled jar
-
 1. Get a pre-built bundled jar or create the jar with `mvn install -DskipTests`
 2. create a yaml file that follows the format below:
-
 ```yaml
 sourceFormat: HUDI
 targetFormats:
@@ -53,12 +49,10 @@ datasets:
  - tableBasePath: abfs://[email protected]/multi-partition-dataset
  tableName: multi_partition_dataset
 ```
-
 - `sourceFormat` is the format of the source table that you want to convert
 - `targetFormats` is a list of formats you want to create from your source tables
 - `tableBasePath` is the basePath of the table
-- `tableDataPath` is an optional field specifying the path to the data files. If not specified, the tableBasePath will
- be used. For Iceberg source tables, you will need to specify the `/data` path.
+- `tableDataPath` is an optional field specifying the path to the data files. If not specified, the tableBasePath will be used. For Iceberg source tables, you will need to specify the `/data` path.
 - `namespace` is an optional field specifying the namespace of the table and will be used when syncing to a catalog.
 - `partitionSpec` is a spec that allows us to infer partition values. This is only required for Hudi source tables. If
  the table is not partitioned, leave it blank. If it is partitioned, you can specify a spec with a comma separated list
@@ -72,10 +66,8 @@ datasets:
  - `HOUR`: same as `YEAR` but with hour granularity
  - `format`: if your partition type is `YEAR`, `MONTH`, `DAY`, or `HOUR` specify the format for the date string as it
  appears in your file paths
-
 3. The default implementations of table format clients can be replaced with custom implementations by specifying a
  client configs yaml file in the format below:
-
 ```yaml
 # sourceClientProviderClass: The class name of a table format's client factory, where the client is
 # used for reading from a table of this format. All user configurations, including hadoop config
@@ -85,49 +77,41 @@ datasets:
 # used for writing to a table of this format.
 # configuration: A map of configuration values specific to this client.
 tableFormatsClients:
- HUDI:
- sourceClientProviderClass: io.onetable.hudi.HudiSourceClientProvider
- DELTA:
- targetClientProviderClass: io.onetable.delta.DeltaClient
- configuration:
- spark.master: local[2]
- spark.app.name: onetableclient
+  HUDI:
+  sourceClientProviderClass: io.onetable.hudi.HudiSourceClientProvider
+  DELTA:
+  targetClientProviderClass: io.onetable.delta.DeltaClient
+  configuration:
+  spark.master: local[2]
+  spark.app.name: onetableclient
 ```
-
 4. A catalog can be used when reading and updating Iceberg tables. The catalog can be specified in a yaml file and
  passed in with the `--icebergCatalogConfig` option. The format of the catalog config file is:
-
 ```yaml
 catalogImpl: io.my.CatalogImpl
 catalogName: name
 catalogOptions: # all other options are passed through in a map
  key1: value1
  key2: value2
 ```
-
 5. run
  with `java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml [--hadoopConfig hdfs-site.xml] [--clientsConfig clients.yaml] [--icebergCatalogConfig catalog.yaml]`
  The bundled jar includes hadoop dependencies for AWS, Azure, and GCP. Authentication for AWS is done with
  `com.amazonaws.auth.DefaultAWSCredentialsProviderChain`. To override this setting, specify a different implementation
  with the `--awsCredentialsProvider` option.
 
 # Contributing
-
 ## Setup
-
 For setting up the repo on IntelliJ, open the project and change the java version to Java11 in File->ProjectStructure
 ![img.png](style/IDE.png)
 
-You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub
-issue [here](https://github.com/onetable-io/onetable/issues)
+You have found a bug, or have a cool idea you that want to contribute to the project ? Please file a GitHub issue [here](https://github.com/onetable-io/onetable/issues)
 
 ## Adding a new target format
-
 Adding a new target format requires a developer
 implement [TargetClient](./api/src/main/java/io/onetable/spi/sync/TargetClient.java). Once you have implemented that
 interface, you can integrate it into the [OneTableClient](./core/src/main/java/io/onetable/client/OneTableClient.java).
 If you think others may find that target useful, please raise a Pull Request to add it to the project.
 
 ## Overview of the sync process
-
 ![img.png](assets/images/sync_flow.jpg)
diff --git a/website/blog/OneTable-is-now-Apache-XTable.md b/website/blog/OneTable-is-now-Apache-XTable.md
@@ -1,6 +1,6 @@
 ---
 title: "OneTable is now “Apache XTable™ (Incubating)”"
-excerpt: "XTable is now Incubating in the Apache Software Foundation"
+excerpt: "Apache XTable™ (Incubating) is now Incubating in the Apache Software Foundation"
 author: Dipankar Mazumdar, JB Onofré
 category: blog
 image: /images/blog/XTable/xtable-cover.png

diff --git a/website/docs/athena.md b/website/docs/athena.md
@@ -4,7 +4,7 @@ title: "Amazon Athena"
 ---
 
 # Querying from Amazon Athena
-To read a Apache XTable™ synced target table (regardless of the table format) in Amazon Athena,
+To read an Apache XTable™ synced target table (regardless of the table format) in Amazon Athena,
 you can create the table either by:
 * Using a DDL statement as mentioned in the following AWS docs:
  * [Example](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html#querying-hudi-in-athena-creating-hudi-tables) for Hudi

diff --git a/website/docs/biglake-metastore.md b/website/docs/biglake-metastore.md
@@ -7,7 +7,7 @@ import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
 # Syncing to BigLake Metastore
-This document walks through the steps to register a XTable synced Iceberg table in BigLake Metastore on GCP.
+This document walks through the steps to register an Apache XTable™ (Incubating) synced Iceberg table in BigLake Metastore on GCP.
 
 ## Pre-requisites
 1. Source (Hudi/Delta) table(s) already written to Google Cloud Storage.
@@ -19,21 +19,21 @@ This document walks through the steps to register a XTable synced Iceberg table
 3. To ensure that the Storage Account API's caller (your service account used by XTable) has the
  necessary permissions to write log/metadata files in GCS, ask your administrator to grant [Storage Object User](https://cloud.google.com/storage/docs/access-control/iam-roles) (roles/storage.objectUser)
  access to the service account.
-4. If you're running XTable outside GCP, you need to provide the machine access to interact with BigLake and GCS.
+4. If you're running Apache XTable™ (Incubating) outside GCP, you need to provide the machine access to interact with BigLake and GCS.
  To do so, store the permissions key for your service account in your machine using 
  ```shell
  export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json
  ```
-5. Clone the XTable [repository](https://github.com/apache/incubator-xtable) and create the
+5. Clone the Apache XTable™ (Incubating) [repository](https://github.com/apache/incubator-xtable) and create the
  `utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](/docs/setup)
 6. Download the [BigLake Iceberg JAR](gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar) locally.
- XTable requires the JAR to be present in the classpath.
+ Apache XTable™ (Incubating) requires the JAR to be present in the classpath.
 
 ## Steps
 :::danger Important:
 Currently BigLake Metastore is only accessible through Google's 
 [BigLake Rest APIs](https://cloud.google.com/bigquery/docs/reference/biglake/rest), and as such
-XTable requires you to setup the below items prior to running sync on your source dataset.
+Apache XTable™ (Incubating) requires you to setup the below items prior to running sync on your source dataset.
  * BigLake Catalog
  * BigLake Database
 :::
@@ -114,7 +114,7 @@ catalogOptions:
  warehouse: gs://path/to/warehouse
 ```
 
-From your terminal under the cloned XTable directory, run the sync process using the below command.
+From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.
 
 ```shell md title="shell"
 java -cp utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar:/path/to/downloaded/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar io.onetable.utilities.RunSync --datasetConfig my_config.yaml --icebergCatalogConfig catalog.yaml
@@ -127,7 +127,7 @@ to interpret the data as an Iceberg table.
 :::
 
 ### Validating the results
-Once the sync succeeds, XTable would have written the table directly to BigLake Metastore.
+Once the sync succeeds, Apache XTable™ (Incubating) would have written the table directly to BigLake Metastore.
 We can use `Try this method` option on Google's REST reference docs for
 [`projects.locations.catalogs.databases.tables.get`](https://cloud.google.com/bigquery/docs/reference/biglake/rest/v1/projects.locations.catalogs.databases.tables/get)
 method to view the created table.

diff --git a/website/docs/demo/docker.md b/website/docs/demo/docker.md
@@ -17,8 +17,8 @@ This demo was tested in both x86-64 and AArch64 based macOS operating systems
 :::
 
 ## Setting up Docker cluster
-After cloning the XTable repository, change directory to `demo` and run the `start_demo.sh` script.
-This script builds XTable jars required for the demo and then spins up docker containers to start a Jupyter notebook
+After cloning the Apache XTable™ (Incubating) repository, change directory to `demo` and run the `start_demo.sh` script.
+This script builds Apache XTable™ (Incubating) jars required for the demo and then spins up docker containers to start a Jupyter notebook
 with Scala interpreter, Hive Metastore, Presto and Trino.
 
 ```shell md title="shell"

diff --git a/website/docs/hms.md b/website/docs/hms.md
@@ -16,14 +16,14 @@ This document walks through the steps to register an Apache XTable™ (Incubatin
 2. A compute instance where you can run Apache Spark. This can be your local machine, docker,
  or a distributed system like Amazon EMR, Google Cloud's Dataproc, Azure HDInsight etc.
  This is a required step to register the table in HMS using a Spark client.
-3. Clone the XTable [repository](https://github.com/apache/incubator-xtable) and create the
+3. Clone the XTable™ (Incubating) [repository](https://github.com/apache/incubator-xtable) and create the
  `utilities-0.1.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](/docs/setup) 
 4. This guide also assumes that you have configured the Hive Metastore locally or on EMR/Dataproc/HDInsight
  and is already running.
 
 ## Steps
 ### Running sync
-Create `my_config.yaml` in the cloned XTable directory.
+Create `my_config.yaml` in the cloned Apache XTable™ (Incubating) directory.
 
 <Tabs
 groupId="table-format"
@@ -86,7 +86,7 @@ datasets:
  * ADLS - `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>`
 :::
 
-From your terminal under the cloned XTable directory, run the sync process using the below command.
+From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.
 ```shell md title="shell"
 java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
 ```
@@ -97,7 +97,7 @@ directory with relevant metadata files that helps query engines to interpret the
 :::
 
 ### Register the target table in Hive Metastore 
-Now you need to register the XTable synced target table in Hive Metastore. 
+Now you need to register the Apache XTable™ (Incubating) synced target table in Hive Metastore. 
 
 <Tabs
 groupId="table-format"
@@ -137,7 +137,7 @@ if you have your source table in S3/GCS/ADLS i.e.
 
 
 Now you will be able to query the created table directly as a Hudi table from the same `spark` session or
-using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
+using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
 [Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.
 
 ```sql md title="sql"
@@ -171,7 +171,7 @@ if you have your source table in S3/GCS/ADLS i.e.
 :::
 
 Now you will be able to query the created table directly as a Delta table from the same `spark` session or
-using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
+using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
 [Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.
 
 ```sql md title="sql"
@@ -211,7 +211,7 @@ in S3/GCS/ADLS i.e.
 :::
 
 Now you will be able to query the created table directly as an Iceberg table from the same `spark` session or
-using query engines like `Presto` and/or `Trino`. Check out the guides for querying the XTable synced tables on
+using query engines like `Presto` and/or `Trino`. Check out the guides for querying the Apache XTable™ (Incubating) synced tables on
 [Presto](/docs/presto) or [Trino](/docs/trino) query engines for more information.
 
 ```sql md title="sql"

diff --git a/website/docs/how-to.md b/website/docs/how-to.md
@@ -33,7 +33,7 @@ history to enable proper point in time queries.
  * Google Cloud Storage by following the steps 
  [here](https://cloud.google.com/iam/docs/keys-create-delete#creating)
 
-For the purpose of this tutorial, we will walk through the steps to using XTable locally.
+For the purpose of this tutorial, we will walk through the steps to using Apache XTable™ (Incubating) locally.
 
 ## Steps
 
@@ -348,7 +348,7 @@ Authentication for GCP requires service account credentials to be exported. i.e.
 `export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_key.json`
 :::
 
-In your terminal under the cloned XTable directory, run the below command.
+In your terminal under the cloned Apache XTable™ (Incubating) directory, run the below command.
 
 ```shell md title="shell"
 java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
@@ -359,9 +359,9 @@ At this point, if you check your local path, you will be able to see the necessa
 commit history, partitions, and column stats that helps query engines to interpret the data in the target table format.
 
 ## Conclusion
-In this tutorial, we saw how to create a source table and use XTable to create the metadata files 
+In this tutorial, we saw how to create a source table and use Apache XTable™ (Incubating) to create the metadata files 
 that can be used to query the source table in different target table formats.
 
 ## Next steps
-Go through the [Catalog Integration guides](/docs/catalogs-index) to register the XTable synced tables
+Go through the [Catalog Integration guides](/docs/catalogs-index) to register the Apache XTable™ (Incubating) synced tables
 in different data catalogs.
diff --git a/website/docs/setup.md b/website/docs/setup.md
@@ -4,7 +4,7 @@ This page covers the essential steps to setup Apache XTable™ (incubating) in y
 
 ## Pre-requisites
 1. Building the project requires Java 11 and Maven to be setup and configured using PATH or environment variables. 
-2. Clone the XTable project GitHub [repository](https://github.com/apache/incubator-xtable) in your environment.
+2. Clone the Apache XTable™ (Incubating) project GitHub [repository](https://github.com/apache/incubator-xtable) in your environment.
 
 ## Steps
 #### Building the project 
@@ -22,5 +22,5 @@ mvn clean package -DskipTests
 For more information on the steps, follow the project's GitHub [README.md](https://github.com/apache/incubator-xtable/blob/main/README.md) 
 
 ## Next Steps
-See the [Quickstart](/docs/how-to) guide to learn to use XTable to add interoperability between
+See the [Quickstart](/docs/how-to) guide to learn to use Apache XTable™ (Incubating) to add interoperability between
 different table formats.
diff --git a/website/docs/spark.md b/website/docs/spark.md
@@ -7,7 +7,7 @@ import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
 # Querying from Apache Spark
-To read a Apache XTable™ (Incubating) synced target table (regardless of the table format) in Apache Spark locally or on services like
+To read an Apache XTable™ (Incubating) synced target table (regardless of the table format) in Apache Spark locally or on services like
 Amazon EMR, Google Cloud's Dataproc, Azure HDInsight, or Databricks, you do not need additional jars or configs 
 other than what is needed by the respective table formats.