Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Commit

Permalink
Merge pull request #159 from snaplet/peterp-fix-subset-docs
Browse files Browse the repository at this point in the history
peterp fix subset docs
  • Loading branch information
peterp authored May 9, 2023
2 parents b649b70 + c47a46e commit 7cc133c
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 125 deletions.
2 changes: 1 addition & 1 deletion docs/03-getting-started/03-data-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Snaplet has four operations for manipulating the data in a snapshot:

- **Transform:** Make existing data suitable for development by transforming the original value into a new one
- **Exclude:** Remove data in specific schemas and tables
- **Reduce (Subset):** Capture a subset of data whilst keeping referential integrity intact
- **Sample (Subset):** Capture a sample of data whilst keeping referential integrity intact
- **Generate:** Seed values when you don't have any data

These operations are defined as code via config files and JavaScript functions.
Expand Down
2 changes: 1 addition & 1 deletion docs/04-references/data-operations/01-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Snaplet has four operations for manipulating the data in a snapshot:

- **Transform:** Make existing data suitable for development by transforming the original value into a new one
- **Exclude:** Remove data in specific tables
- **Reduce (Subset):** Capture a subset of data whilst keeping referential integrity intact
- **Sample (Subset):** Capture a sample of data whilst keeping referential integrity intact
- **Generate:** Seed values when you don't have any data

These operations are defined as code via config files and JavaScript functions.
Expand Down
139 changes: 18 additions & 121 deletions docs/04-references/data-operations/04-reduce.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,12 @@
# Subset data

:::note Experimental

This is a preview feature. We would love your [feedback](https://app.snaplet.dev/chat)!

:::

# Sample (subset) data

Capturing a snapshot of a large database in its entirety can be lengthy, and ultimately unncessary, as only a representative sample of the data is typically needed to code against.

Snaplet can be configured to capture a subset of data during the snapshot process, reducing the snapshot's size, and the subsequent time spent uploading and downloading snapshots.

## Getting started

To reduce the size of your next snapshot and get a small, representative sample of your database, add the `subset` object to your `transform.ts file`.
To reduce the size of your next snapshot and get a small, representative sample of your database, export the `subset` object to your `transform.ts file`.

An example of a `transform.ts` file with a basic `subset` config:

Expand All @@ -26,7 +19,7 @@ export const config: Transform = () => {

export const subset = {
enabled: true,
version: "2", // the latest version
version: "3", // the latest version
targets: [
{
table: "public.User",
Expand All @@ -38,30 +31,30 @@ export const subset = {

```
When `snaplet snapshot capture` is run against the above example config the following will happen:
* The `User` table is subset to roughly 5% of its original size.
* Related rows in related tables connected to the `User` table via foreign key relationships are included in the new snapshot, and are similarly subset.
* As `keepDisconnectedTables` is set to `true`, any tables not connected to the `User` table via foreign key relationships will be included in the new snapshot, but **won't** be subset.
* The `User` table is sampled to roughly 5% of its original size.
* Related rows in related tables connected to the `User` table via foreign key relationships are included in the new snapshot.
* As `keepDisconnectedTables` is set to `true`, any tables not connected to the `User` table via foreign key relationships will be included in the new snapshot, and **won't** be sampled.

## Configuring Subsetting
## Configuring sampling

Various commands permit more granular control over subsetting. <!-- Chat to us [on Discord](https://app.snaplet.dev/chat) if your use case isn't supported.-->
Various commands permit more granular control over sampling. Chat to us [on Discord](https://app.snaplet.dev/chat) if your use case isn't supported.

### Enabled (enabled: boolean)
When set to true, subsetting will occur during `snaplet snapshot capture`.
When set to true, sampling will occur during `snaplet snapshot capture`.

### Targets (targets: array)
The first table defined in `targets` is the starting point of subsetting. Subsetting specifics are controlled by the `percent` (or `rowLimit`), `where` and `orderBy` properties.
The first table defined in `targets` is the starting point of sampling. Sampling specifics are controlled by the `percent` (or `rowLimit`), `where` and `orderBy` properties.

Subset traverses tables related to the `target` table and selects all the rows that are connected to the `target` table via foreign key relationship. This process is repeated for each `target` table. At least one `target` must be defined.
Sample traverses tables related to the `target` table and selects all the rows that are connected to the `target` table via foreign key relationship. This process is repeated for each `target` table. At least one `target` must be defined.

Each `target` requires:
* A `table` name
* One or more of the following subsetting properties:
* One or more of the following sampling properties:
* `percent` (percent of rows captured: number)
* `rowLimit` (limit on the number of rows captured: number)
* `where` (filter by string: string)

Optionally, you can also define an `orderBy` property to sort the rows before subsetting.
Optionally, you can also define an `orderBy` property to sort the rows before sampling.

Here is an example of a config with multiple targets:

Expand Down Expand Up @@ -97,122 +90,26 @@ In this example a snapshot would be created with 5% of the rows in the User tabl

When set to true, all tables (with all data) that are not connected via foreign key relationships to the tables defined in `targets` will be included in the snapshot. When set to false, all the tables not connected to the `target` tables via foreign key relationships will be excluded from the snapshot.

### Excluding tables from subset
### Excluding tables from sample

To exclude specific tables from the snapshot see [exclude](docs/04-references/data-operations/03-exclude.md) documentation.

:::note A note on subset precision
:::note A note on sample precision

Note that the `precent` / `rowLimit` specified in the subset config may not be exact. The actual row count of the data is affected by the relationships between the tables. As such, a 5% subset specified against a specific table may ultimately include slightly more than 5% of the actual database.
Note that the `precent` / `rowLimit` specified in the sample config may not be exact. The actual row count of the data is affected by the relationships between the tables. As such, a 5% sample specified against a specific table may ultimately include slightly more than 5% of the actual database.

:::

:::note Limitations

When subsetting we calculate which rows to copy and keep a reference to them in memory. This means that there is a limit to the number of rows that we can store: The more rows you have in your subset, the more memory will be consumed. Currently the CLI is limited to 2GB. This is temporary issue which will be resolved in Q1 2023.
When sampling we calculate which rows to copy and keep a reference to them in memory. This means that there is a limit to the number of rows that we can store: The more rows you have in your sample, the more memory will be consumed. Currently the CLI is limited to 2GB. This is temporary issue which will be resolved in Q1 2023.

Until then:
- If you are using UUID's as primary keys (foreign keys) you have a row limit of roughly 1 million rows (or one large table of 12 million rows) on a 2GB system.
- If you are using integers (int/bigint) as primary keys you can have roughly 4 million rows (or one large table of 48 million rows) on a 2GB system.

Lots of assumptions are made here. This will vary drastically on your spesific database design. Chat to us on [Discord](https://app.snaplet.dev/chat) and we will help you figure out what your limit is.

If you see this error: `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory ` you have reached your limit. Try and make your subset smaller by reducing the `percent` or `rowLimit` or by setting `keepDisconnectedTables` to false.

:::

<!--
Subset version 2 dont have the customs forgeinKeys option yet. (Issue: https://linear.app/snaplet/issue/S-288/subset-version-2-custom-forgeinkeys-in-config)
### Foreign keys (foreignKeys: array, optional)
We use the foreign keys to traverse the databse when creating a subet. We use all non-nullable foreign keys and detect nullable forgein keys that will not cause a circular reference. The nullable forgein keys can be manually override with the `forgeinKeys` property.
The foreignKeys property is an array of objects with the following properties:
* `table` - the table name
* `column` - the column name
* `targetTable` - the target table name
* `targetColumn` - the target column name
Here is an example of a transform.ts file with a subset config that uses the foreignKeys property:
```ts
``` -->



---
# Subsetting (version 1) **DEPRECATED**:

:::note A note on this documentation

This reference is provided for legacy Snaplet users who may be using the previous version of subsetting that was configured via the `subsetting.json` file, and is now deprecated.
If you see this error: `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory ` you have reached your limit. Try and make your sample smaller by reducing the `percent` or `rowLimit` or by setting `keepDisconnectedTables` to false.

:::

Here is a basic example of the `subsetting.json` file:

```json
{
"enabled": true,
"initial_targets": [
{
"table": "public.Organization",
"row_limit": 100
}
],
"keep_disconnected_tables": true,
}
```
In this config we limited the table "Organization" in the "public" schema to 100 rows.

To test your new subset configuration locally run `snaplet snapshot capture`.

## Reference

### Enabled (enabled: boolean)
When set to true subsetting will occur during during `snaplet snapshot capture`

### Initial Targets (initial_targets)
Targets(tables) are used to specify the specifics of the subset. Subsetting will start at the first initial_target entry, thus at least one target needs to be specified.

The target requires:
* table name (table: string)
* percentage (percent: number) or limit on the rows: (row_limit: number)

Optional:
* where clause (where: string)

Note that for the first target the where clause can be used to reduce the subset. But for the following targets the where cluase will in most cases increase the size of the subset. Lets have a look at an example to showcase this.

Example `subsetting.json` file:
```json
{
"enabled": true,
"initial_targets": [
{
"table": "public.Organization",
"percent": 10,
"where": "\"Organization\".\"id\" > 300"
},
{
"table": "public.User",
"percent": 10,
"where": "\"User\".\"lastName\" = 'Lee'"
},
],
"keep_disconnected_tables": true
}
```
In this example we select 10% of the rows in the Organization, but only where the id is larger than 300.
* In a use case where we originally have a 100 Organizations and more that 10 of the Organizations has an id larger than 300 we would have a subset of 10 of the Organizations.
* In the case where we have say only 5 Organizations with id's larger than 300, then we would have only 5 Organizations in the subset.

Things get more complicated with the next target. Say each Organization has an administrator(User) associated with it. Here the Organization table has a foreign key pointing to User. In this case when we selected the Organization's rows we also had to get all the associated Users. So when we move on to the next target(User) we already have users in the subset and we cannot remove them or else we will break the forgein key constraits. Thus we add to the subset all users where the lastName is equal to "Lee".

### Disconnected tables (keep_disconnected_tables: boolean)

In your database there could be tables that don't have a relationship to the specified initial_targets. One can choose to either keep(`keep_disconnected_tables: true`) them in the snapshot or exclude them(`keep_disconnected_tables: false`) from the snapshot.

4 changes: 2 additions & 2 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ module.exports = {
id: "getting-started/configuration",
label: "Configuration",
},
{ type: "doc", id: "getting-started/restoring", label: "Restoring" },
{
type: "doc",
id: "getting-started/data-operations",
label: "Data operations",
},
{ type: "doc", id: "getting-started/capturing", label: "Capturing" },
{ type: "doc", id: "getting-started/sharing", label: "Sharing" },
{ type: "doc", id: "getting-started/restoring", label: "Restoring" },
{
type: "doc",
id: "getting-started/what-is-next",
Expand Down Expand Up @@ -62,7 +62,7 @@ module.exports = {
{
type: "doc",
id: "references/data-operations/reduce",
label: "Subset (Reduce) 🐥",
label: "Sample (Reduce) 🐥",
},
{
type: "doc",
Expand Down

0 comments on commit 7cc133c

Please sign in to comment.