Skip to content

Commit d5a4d12

Browse files
sungchun12Sung Won Chungelliotgunn
authored
Update README to be ergonomic and excite new users (datafold#816)
* update draft * elliot's edits * remove kafka * less awkward than the center * elliot's edits * added docs link * Scope the utility to single players * swapped order --------- Co-authored-by: Sung Won Chung <[email protected]> Co-authored-by: elliotgunn <[email protected]>
1 parent d05de0f commit d5a4d12

File tree

1 file changed

+86
-60
lines changed

1 file changed

+86
-60
lines changed

README.md

Lines changed: 86 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -9,79 +9,61 @@ data-diff: Compare datasets fast, within or across SQL databases
99
</h2>
1010
<br>
1111

12-
> [Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!](https://www.datafold.com/virtual-hands-on-lab)
12+
> [Join our live virtual lab series to learn how to set it up!](https://www.datafold.com/virtual-hands-on-lab)
1313
14-
# Use Cases
14+
# What's a Data Diff?
15+
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
16+
17+
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.
1518

16-
## Data Migration & Replication Testing
17-
Compare source to target and check for discrepancies when moving data between systems:
18-
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
19-
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
20-
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
19+
# Use Cases
2120

21+
### Data Migration & Replication Testing
22+
data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:
23+
- **Migrating** to a new data warehouse (e.g., Oracle -> Snowflake)
24+
- **Converting SQL** to a new transformation framework (e.g., stored procedures -> dbt)
25+
- Continuously **replicating data** from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift)
2226

23-
## Data Development Testing
24-
Test SQL code and preview changes by comparing development/staging environment data to production:
25-
1. Make a change to some SQL code
27+
### Data Development Testing
28+
When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:
29+
1. Make a change to your SQL code
2630
2. Run the SQL code to create a new dataset
27-
3. Compare the dataset with its production version or another iteration
31+
3. Compare this dataset with its production version or other iterations
2832

33+
# dbt Integration
2934
<p align="left">
3035
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
31-
</p>
32-
33-
<details>
34-
<summary> data-diff integrates with dbt Core to seamlessly compare local development to production datasets
36+
</p>
3537

36-
</summary>
38+
data-diff integrates with [dbt Core](https://github.com/dbt-labs/dbt-core) to seamlessly compare local development to production datasets.
3739

38-
![data-development-testing](docs/development_testing.png)
40+
Learn more about how data-diff works with dbt:
41+
* Read our docs to get started with [data-diff & dbt](https://docs.datafold.com/development_testing/cli) or :eyes: **watch the [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
42+
* dbt Cloud users should check out [Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
43+
* Get support from the dbt Community Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU)
3944

40-
</details>
4145

42-
> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
46+
# Getting Started
4347

44-
:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
48+
### ⚡ Validating dbt model changes between dev and prod
49+
Looking to use data-diff in dbt development?
4550

46-
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
47-
48-
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
49-
50-
51-
# How it works
51+
Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.
5252

53-
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
53+
Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/cli) to get started with a development testing workflow!
5454

55-
## `joindiff`
56-
- Recommended for comparing data within the same database
57-
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
58-
- Fully relies on the underlying database engine for computation
59-
- Requires both datasets to be queryable with a single SQL query
60-
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
55+
### 🔀 Compare data tables between databases
56+
1. Install `data-diff` with adapters
6157

62-
## `hashdiff`
63-
- Recommended for comparing datasets across different databases
64-
- Can also be helpful in diffing very large tables with few expected differences within the same database
65-
- Employs a divide-and-conquer algorithm based on hashing and binary search
66-
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
67-
- Time complexity approximates COUNT(*) operation when there are few differences
68-
- Performance degrades when datasets have a large number of differences
69-
70-
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
71-
72-
# Get started
73-
74-
## Validating dbt model changes between dev and prod
75-
⚡ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/how_it_works) to get started!
76-
77-
## Compare data tables between databases
78-
🔀 To compare data between databases, install `data-diff` with specific database adapters, e.g.:
58+
To compare data between databases, install `data-diff` with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:
7959

8060
```
8161
pip install data-diff 'data-diff[postgresql,snowflake]' -U
8262
```
8363

84-
Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
64+
2. Run `data-diff` with connection URIs
65+
66+
Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
8567

8668
```bash
8769
data-diff \
@@ -93,8 +75,9 @@ data-diff \
9375
-c <columns to compare> \
9476
-w <filter condition>
9577
```
78+
3. Set up your configuration
9679

97-
Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
80+
You can use a `toml` configuration file to run your `data-diff` job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm:
9881

9982
```toml
10083
## DATABASE CONNECTION ##
@@ -103,7 +86,6 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
10386
# filepath = "datafold_demo.duckdb" # local duckdb file example
10487
# filepath = "md:" # default motherduck connection example
10588
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
106-
database = "datafold_demo"
10789

10890
[database.snowflake_connection]
10991
driver = "snowflake"
@@ -132,8 +114,12 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
132114

133115
verbose = false
134116
```
117+
4. Run your `data-diff` job
118+
119+
Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:
135120

136121
```bash
122+
137123
# export relevant environment variables, example below
138124
export motherduck_token=<MOTHERDUCK_TOKEN>
139125

@@ -148,11 +134,13 @@ data-diff --conf datadiff.toml \
148134
+ 1, returned
149135
```
150136

151-
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
137+
5. Review the output
152138

139+
After running your data-diff job, review the output to identify and analyze differences in your data.
153140

154-
# Supported databases
141+
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
155142

143+
# Supported databases
156144

157145
| Database | Status | Connection string |
158146
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
@@ -161,8 +149,8 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
161149
| Snowflake | 🟢 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
162150
| BigQuery | 🟢 | `bigquery://<project>/<dataset>` |
163151
| Redshift | 🟢 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
164-
| DuckDB | 🟢 | `duckdb://<dbname>@<filepath>` |
165-
| MotherDuck | 🟢 | `duckdb://<dbname>@<filepath>` |
152+
| DuckDB | 🟢 | `duckdb://<filepath>` |
153+
| MotherDuck | 🟢 | `duckdb://<filepath>` |
166154
| Oracle | 🟡 | `oracle://<username>:<password>@<hostname>/servive_or_sid` |
167155
| Presto | 🟡 | `presto://<username>:<password>@<hostname>:8080/<database>` |
168156
| Databricks | 🟡 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
@@ -172,8 +160,7 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
172160
| ElasticSearch | 📝 | |
173161
| Planetscale | 📝 | |
174162
| Pinot | 📝 | |
175-
| Druid | 📝 | |
176-
| Kafka | 📝 | |
163+
| Druid | 📝 | | |
177164
| SQLite | 📝 | |
178165

179166
* 🟢: Implemented and thoroughly tested.
@@ -189,9 +176,48 @@ Your database not listed here?
189176

190177
<br>
191178

179+
# How it works
180+
181+
`data-diff` efficiently compares data using two modes:
182+
183+
**joindiff**: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.
184+
185+
**hashdiff**: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.
186+
187+
<details>
188+
<summary>Click here to learn more about joindiff and hashdiff</summary>
189+
190+
### `joindiff`
191+
* Recommended for comparing data within the same database
192+
* Uses the outer join operation to diff the rows as efficiently as possible within the same database
193+
* Fully relies on the underlying database engine for computation
194+
* Requires both datasets to be queryable with a single SQL query
195+
* Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
196+
197+
### `hashdiff`:
198+
* Recommended for comparing datasets across different databases
199+
* Can also be helpful in diffing very large tables with few expected differences within the same database
200+
* Employs a divide-and-conquer algorithm based on hashing and binary search
201+
* Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
202+
* Time complexity approximates COUNT(*) operation when there are few differences
203+
* Performance degrades when datasets have a large number of differences
204+
205+
</details>
206+
<br>
207+
208+
For detailed algorithm and performance insights, explore [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md), or head to our docs to [learn more about how Datafold diffs data](https://docs.datafold.com/data_diff/how-datafold-diffs-data).
209+
210+
211+
# data-diff OSS & Datafold Cloud
212+
data-diff is an open source utility for running stateless diffs on your local computer for a great single player experience.
213+
214+
Scale up with [Datafold Cloud](https://www.datafold.com/) to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in the UI. This includes [column-level lineage](https://www.datafold.com/column-level-lineage), [CI testing](https://docs.datafold.com/deployment_testing/how_it_works/), and diff history.
215+
192216
## Contributors
193217

194-
We thank everyone who contributed so far!
218+
We thank everyone who contributed so far!
219+
220+
We'd love to see your face here: [Contributing Instructions](CONTRIBUTING.md)
195221

196222
<a href="https://github.com/datafold/data-diff/graphs/contributors">
197223
<img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />

0 commit comments

Comments
 (0)