You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update README to be ergonomic and excite new users (datafold#816)
* update draft
* elliot's edits
* remove kafka
* less awkward than the center
* elliot's edits
* added docs link
* Scope the utility to single players
* swapped order
---------
Co-authored-by: Sung Won Chung <[email protected]>
Co-authored-by: elliotgunn <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+86-60Lines changed: 86 additions & 60 deletions
Original file line number
Diff line number
Diff line change
@@ -9,79 +9,61 @@ data-diff: Compare datasets fast, within or across SQL databases
9
9
</h2>
10
10
<br>
11
11
12
-
> [Make sure to join us at our virtual hands-on lab series where our team walks through live how to get set-up with it!](https://www.datafold.com/virtual-hands-on-lab)
12
+
> [Join our live virtual lab series to learn how to set it up!](https://www.datafold.com/virtual-hands-on-lab)
13
13
14
-
# Use Cases
14
+
# What's a Data Diff?
15
+
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
16
+
17
+
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.
15
18
16
-
## Data Migration & Replication Testing
17
-
Compare source to target and check for discrepancies when moving data between systems:
18
-
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
19
-
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
20
-
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
19
+
# Use Cases
21
20
21
+
### Data Migration & Replication Testing
22
+
data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:
23
+
-**Migrating** to a new data warehouse (e.g., Oracle -> Snowflake)
24
+
-**Converting SQL** to a new transformation framework (e.g., stored procedures -> dbt)
25
+
- Continuously **replicating data** from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift)
22
26
23
-
## Data Development Testing
24
-
Test SQL codeand preview changes by comparing development/staging environment data to production:
25
-
1. Make a change to some SQL code
27
+
###Data Development Testing
28
+
When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:
29
+
1. Make a change to your SQL code
26
30
2. Run the SQL code to create a new dataset
27
-
3. Compare the dataset with its production version or another iteration
31
+
3. Compare this dataset with its production version or other iterations
* Read our docs to get started with [data-diff & dbt](https://docs.datafold.com/development_testing/cli) or :eyes:**watch the [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
42
+
* dbt Cloud users should check out [Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
43
+
* Get support from the dbt Community Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU)
39
44
40
-
</details>
41
45
42
-
> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
### ⚡ Validating dbt model changes between dev and prod
49
+
Looking to use data-diff in dbt development?
45
50
46
-
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
47
-
48
-
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
49
-
50
-
51
-
# How it works
51
+
Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.
52
52
53
-
When comparing the data, `data-diff`utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
53
+
Head over to [our `data-diff`+ `dbt` documentation](https://docs.datafold.com/development_testing/cli) to get started with a development testing workflow!
54
54
55
-
## `joindiff`
56
-
- Recommended for comparing data within the same database
57
-
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
58
-
- Fully relies on the underlying database engine for computation
59
-
- Requires both datasets to be queryable with a single SQL query
60
-
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
55
+
### 🔀 Compare data tables between databases
56
+
1. Install `data-diff` with adapters
61
57
62
-
## `hashdiff`
63
-
- Recommended for comparing datasets across different databases
64
-
- Can also be helpful in diffing very large tables with few expected differences within the same database
65
-
- Employs a divide-and-conquer algorithm based on hashing and binary search
66
-
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
67
-
- Time complexity approximates COUNT(*) operation when there are few differences
68
-
- Performance degrades when datasets have a large number of differences
69
-
70
-
More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)
71
-
72
-
# Get started
73
-
74
-
## Validating dbt model changes between dev and prod
75
-
⚡ Looking to use `data-diff` in dbt development? Head over to [our `data-diff` + `dbt` documentation](https://docs.datafold.com/development_testing/how_it_works) to get started!
76
-
77
-
## Compare data tables between databases
78
-
🔀 To compare data between databases, install `data-diff` with specific database adapters, e.g.:
58
+
To compare data between databases, install `data-diff` with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:
Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
64
+
2. Run `data-diff` with connection URIs
65
+
66
+
Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
85
67
86
68
```bash
87
69
data-diff \
@@ -93,8 +75,9 @@ data-diff \
93
75
-c <columns to compare> \
94
76
-w <filter condition>
95
77
```
78
+
3. Set up your configuration
96
79
97
-
Run `data-diff` with a `toml` configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
80
+
You can use a `toml` configuration file to run your `data-diff` job. In this example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
98
81
99
82
```toml
100
83
## DATABASE CONNECTION ##
@@ -103,7 +86,6 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
103
86
# filepath = "datafold_demo.duckdb" # local duckdb file example
104
87
# filepath = "md:" # default motherduck connection example
105
88
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}"# API token recommended for motherduck connection
106
-
database = "datafold_demo"
107
89
108
90
[database.snowflake_connection]
109
91
driver = "snowflake"
@@ -132,8 +114,12 @@ Run `data-diff` with a `toml` configuration file. In the following example, we c
132
114
133
115
verbose = false
134
116
```
117
+
4. Run your `data-diff` job
118
+
119
+
Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:
135
120
136
121
```bash
122
+
137
123
# export relevant environment variables, example below
@@ -172,8 +160,7 @@ Check out [documentation](https://docs.datafold.com/reference/open_source/cli) f
172
160
| ElasticSearch | 📝 ||
173
161
| Planetscale | 📝 ||
174
162
| Pinot | 📝 ||
175
-
| Druid | 📝 ||
176
-
| Kafka | 📝 ||
163
+
| Druid | 📝 |||
177
164
| SQLite | 📝 ||
178
165
179
166
* 🟢: Implemented and thoroughly tested.
@@ -189,9 +176,48 @@ Your database not listed here?
189
176
190
177
<br>
191
178
179
+
# How it works
180
+
181
+
`data-diff` efficiently compares data using two modes:
182
+
183
+
**joindiff**: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.
184
+
185
+
**hashdiff**: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.
186
+
187
+
<details>
188
+
<summary>Click here to learn more about joindiff and hashdiff</summary>
189
+
190
+
### `joindiff`
191
+
* Recommended for comparing data within the same database
192
+
* Uses the outer join operation to diff the rows as efficiently as possible within the same database
193
+
* Fully relies on the underlying database engine for computation
194
+
* Requires both datasets to be queryable with a single SQL query
195
+
* Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
196
+
197
+
### `hashdiff`:
198
+
* Recommended for comparing datasets across different databases
199
+
* Can also be helpful in diffing very large tables with few expected differences within the same database
200
+
* Employs a divide-and-conquer algorithm based on hashing and binary search
201
+
* Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
202
+
* Time complexity approximates COUNT(*) operation when there are few differences
203
+
* Performance degrades when datasets have a large number of differences
204
+
205
+
</details>
206
+
<br>
207
+
208
+
For detailed algorithm and performance insights, explore [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md), or head to our docs to [learn more about how Datafold diffs data](https://docs.datafold.com/data_diff/how-datafold-diffs-data).
209
+
210
+
211
+
# data-diff OSS & Datafold Cloud
212
+
data-diff is an open source utility for running stateless diffs on your local computer for a great single player experience.
213
+
214
+
Scale up with [Datafold Cloud](https://www.datafold.com/) to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in the UI. This includes [column-level lineage](https://www.datafold.com/column-level-lineage), [CI testing](https://docs.datafold.com/deployment_testing/how_it_works/), and diff history.
215
+
192
216
## Contributors
193
217
194
-
We thank everyone who contributed so far!
218
+
We thank everyone who contributed so far!
219
+
220
+
We'd love to see your face here: [Contributing Instructions](CONTRIBUTING.md)
0 commit comments