Skip to content

Commit

Permalink
Merge pull request dbt-labs#2537 from dbt-labs/ly-docs-databricks
Browse files Browse the repository at this point in the history
dbt-databricks migration doc
  • Loading branch information
nghi-ly authored Dec 22, 2022
2 parents 824a819 + d30a74f commit 0a43f94
Showing 1 changed file with 61 additions and 41 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,77 +3,95 @@ title: "Migrating from dbt-spark to dbt-databricks"
id: "migrating-from-spark-to-databricks"
---

You can [migrate your projects](#migrate-your-dbt-projects) from using the `dbt-spark` adapter to using the [dbt-databricks adapter](https://github.com/databricks/dbt-databricks). In collaboration with dbt Labs, Databricks built this adapter using dbt-spark as the foundation and added some critical improvements. With it, you get an easier set up — requiring only three inputs for authentication — and more features such as support for [Unity Catalog](https://www.databricks.com/product/unity-catalog).

## Pre-requisites
## Simpler authentication

In order to migrate to dbt-databricks, your project must be compatible with `dbt 1.0` or greater as dbt-databricks is not supported pre `dbt 1.0`. [This guide](https://docs.getdbt.com/guides/migration/versions/upgrading-to-v1.0) will help you upgrade your project if necessary.
Previously, you had to provide a `cluster` or `endpoint` ID which was hard to parse from the `http_path` that you were given. Now, it doesn't matter if you're using a cluster or an SQL endpoint because the [dbt-databricks setup](/reference/warehouse-setups/databricks-setup) requires the _same_ inputs for both. All you need to provide is:
- hostname of the Databricks workspace
- HTTP path of the Databricks SQL warehouse or cluster
- appropriate credentials

## Why change to dbt-databricks?
## Better defaults

The Databricks team, in collaboration with dbt Labs, built on top of the foundation that the dbt Labs’ dbt-spark adapter provided, and they added some critical improvements. The dbt-databricks adapter offers an easier set up, as it only requires three inputs for authentication, and it also has more features available via the Delta file format.
The `dbt-databricks` adapter provides better defaults than `dbt-spark` does. The defaults help optimize your workflow so you can get the fast performance and cost-effectiveness of Databricks. They are:

### Authentication Simplification
- The dbt models use the [Delta](https://docs.databricks.com/delta/index.html) table format. You can remove any declared configurations of `file_format = 'delta'` since they're now redundant.
- Accelerate your expensive queries with the [Photon engine](https://docs.databricks.com/runtime/photon.html).
- The `incremental_strategy` config is set to `merge`.

Previously users had to provide a `cluster` or `endpoint` ID which was hard to parse out of the http_path provided in the Databricks UI. Now the [dbt-databricks profile](https://docs.getdbt.com/reference/warehouse-setups/databricks-setup) requires the same inputs regardless if you are using a Cluster or a SQL endpoint. All you need to provide is:
- the hostname of the Databricks workspace
- the HTTP path of the Databricks SQL warehouse or cluster
- an appropriate credential
With dbt-spark, however, the default for `incremental_strategy` is `append`. If you want to continue using `incremental_strategy=append`, you must set this config specifically on your incremental models. If you already specified `incremental_strategy=merge` on your incremental models, you don't need to change anything when moving to dbt-databricks; but, you can keep your models clean (tidy) by removing the config since it's redundant. Read [About incremental_strategy](/docs/build/incremental-models#about-incremental_strategy) to learn more.

For more information on defaults, see [Caveats](/reference/warehouse-setups/databricks-setup#caveats).

### Better defaults
## Pure Python

With dbt-databricks, by default, dbt models will use the Delta format and expensive queries will be accelerated with the [Photon engine](https://docs.databricks.com/runtime/photon.html). See [the caveats section of Databricks Profile documentation](https://docs.getdbt.com/reference/warehouse-profiles/databricks-profile#choosing-between-dbt-databricks-and-dbt-spark) for more information. Any declared configurations of `file_format = 'delta'` are now redundant and can be removed.
If you use dbt Core, you no longer have to download an independent driver to interact with Databricks. The connection information is all embedded in a pure-Python library called `databricks-sql-connector`.

Additionally, dbt-databricks's default `incremental_strategy` is now `merge`. The default `incremental_strategy` with dbt-spark is `append`.
If you have been using the default `incremental_strategy=append` with dbt-spark, and would like to continue doing so, you'll have to set this config specifically on your incremental models. Read more [about `incremental_strategy` in dbt](https://docs.getdbt.com/docs/build/incremental-models#about-incremental_strategy).
If you already specified `incremental_strategy=merge` on your incremental models, you do not need to change anything when moving to dbt-databricks, though you could remove the param as it is now the default.

### Pure Python (Core only)
## Migrate your dbt projects

A huge benefit to Core only users is that with the new dbt-databricks adapter, you no longer have to download an independent driver to interact with Databricks. The connection information is all embedded in a pure-Python library, `databricks-sql-connector`.
In both dbt Core and dbt Cloud, you can migrate your projects to the Databricks-specific adapter from the generic Apache Spark adapter.

### Prerequisites

## Migration
### dbt Cloud
- Your project must be compatible with dbt 1.0 or greater. Refer to [Upgrading to v1.0](/guides/migration/versions/upgrading-to-v1.0) for details. For the latest version of dbt, refer to [Upgrading to v1.3](/guides/migration/versions/upgrading-to-v1.3).
- For dbt Cloud, you need administrative (admin) privileges to migrate dbt projects.

#### Credentials
If you are already successfully connected to Databricks using the dbt-spark ODBC method in dbt Cloud, then you have already supplied credentials in dbt Cloud to connect to your Databricks workspace. Each user will have added their Personal Access Token in their dbt Cloud profile for the given dbt project, which allows them to connect to Databricks in the dbt Cloud IDE, and additionally, an admin will have added an access token for each deployment environment, allowing for dbt Cloud to connect to Databricks during production jobs.
<!-- tabs for dbt Cloud and dbt Core -->
<Tabs>

When an admin changes the dbt Cloud's connection to use the dbt-databricks adapter instead of the dbt-spark adapter, your team will not lose their credentials. This makes migrating from dbt-spark to dbt-databricks straightforward as it only requires deleting the connection and re-adding the cluster/endpoint information. Both the admin and users of the project need not re-enter personal access tokens.
<TabItem value="cloud" label="dbt Cloud">

#### Procedure
The migration to the `dbt-databricks` adapter from `dbt-spark` shouldn't cause any downtime for production jobs. dbt Labs recommends that you schedule the connection change when usage of the IDE is light to avoid disrupting your team.

An admin of the dbt Cloud project running on Databricks should take the following steps to migrate from using the generic Spark adapter to the Databricks-specfic adapter. This should not cause any downtime for production jobs, but we recommend that you schedule the connection change when there is not heavy IDE usage for your team to avoid disruption.
To update your Databricks connection in dbt Cloud:

1. Select **Account Settings** in the main navigation bar.
2. On the Projects tab, scroll until you find the project you'd like to migrate to the new dbt-databricks adapter.
2. On the **Projects** tab, find the project you want to migrate to the dbt-databricks adapter.
3. Click the hyperlinked Connection for the project.
4. Click the "Edit" button in the top right corner.
5. Select Databricks for the warehouse
6. Select Databricks (dbt-databricks) for the adapter and enter:
1. the `hostname`
2. the `http_path`
3. optionally the catalog name
7. Click save.
4. Click **Edit** in the top right corner.
5. Select **Databricks** for the warehouse
6. Select **Databricks (dbt-databricks)** for the adapter and enter the:
1. `hostname`
2. `http_path`
3. (optional) catalog name
7. Click **Save**.

After the above steps have been performed, all users will have to refresh their IDE before being able to start working again. It should complete in less than a minute.
Everyone in your organization who uses dbt Cloud must refresh the IDE before starting work again. It should refresh in less than a minute.

#### About your credentials

When you update the Databricks connection in dbt Cloud, your team will not lose their credentials. This makes migrating easier since it only requires you to delete the Databricks connection and re-add the cluster or endpoint information.

These credentials will not get lost when there's a successful connection to Databricks using the `dbt-spark` ODBC method:

- The credentials you supplied to dbt Cloud to connect to your Databricks workspace.
- The personal access tokens your team added in their dbt Cloud profile so they can develop in the IDE for a given project.
- The access token you added for each deployment environment so dbt Cloud can connect to Databricks during production jobs.

### dbt Core
</TabItem>

In dbt Core, migrating to the dbt-databricks adapter from dbt-spark requires that you:
1. install the new adapter in your environment, and
2. modify your target in your `~/.dbt/profiles.yml`
<TabItem value="core" label="dbt Core">

These changes will be needed for all users of your project.
To migrate your dbt Core projects to the `dbt-databricks` adapter from `dbt-spark`, you:
1. Install the [dbt-databricks adapter](https://github.com/databricks/dbt-databricks) in your environment
1. Update your Databricks connection by modifying your `target` in your `~/.dbt/profiles.yml` file

#### Example
Anyone who's using your project must also make these changes in their environment.

If you're using `dbt-spark` today to connect to a Databricks SQL Endpoint, the below examples show a good before and after of how to authenticate. The cluster example is also effectively the same.
</TabItem>

</Tabs>

<!-- End tabs for dbt Cloud and dbt Core -->

### Examples

You can use the following examples of the `profiles.yml` file to see the authentication setup with `dbt-spark` compared to the simpler setup with `dbt-databricks` when connecting to an SQL endpoint. A cluster example would look similar.


An example of what authentication looks like with `dbt-spark`:

<File name='~/.dbt/profiles.yml'>

Expand All @@ -89,11 +107,13 @@ your_profile_name:
host: dbc-l33t-nwb.cloud.databricks.com
endpoint: 8657cad335ae63e3
token: [my_secret_token]

```

</File>

An example of how much simpler authentication is with `dbt-databricks`:

<File name='~/.dbt/profiles.yml'>

```yaml
Expand All @@ -108,4 +128,4 @@ your_profile_name:
token: [my_secret_token]
```
</File>
</File>

0 comments on commit 0a43f94

Please sign in to comment.