Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions src/5-levels-of-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# 5 Levels of data reusability

Not all data are created equal.
There are notable differences in how much you can do with data and how much effort it takes.
The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user.
Re-useability is about being able to transform, sort, query, serialize, modify, render and audit data without requiring too much work.

_This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata.info/en/)_.

## Level 0: proprietary data

If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero.

That's why it's important to have a _license_ that allow others to use your data, like the [Open Database License](https://opendatacommons.org/licenses/odbl/summary/).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the trailing dot here (or capitalize first letter in following line).

or one of the Creative Commons licenses.

It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), instead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`).


## Level 1: unstructured data

_Examples: images, videos, plain text_

Unstructured data is the least usable.
Humans can read it, and AI / Machine Learning systems can draw more conclusions from it then ever,
but it's hard to build an actual application or graphic from only unstructured data.

```
Hi! I'm Joep, I'm born in 1991.
```

## Level 2: structured data

_Examples: CSV, XML, JSON, TOML, EXCEL_

Structured data can be read by machines, and this allows us to do all sorts of useful things.
We can _query_, _sort_ and _filter_.
But still, this type of data often requires human input when it needs to be processed.
And we don't have guarantees about which fields will be filled, or what their datatypes are.
One time, a `birthYear` can be a string, and the next time it can be a number.
Data can be _structured_, but still _unpredictable_.

```json
{
"name": "Joep",
"birthYear": 1991
}
```

If we want predictability, we need to make it _type-safe_.

## Level 3: type-safe data

_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming languages_

Type-safe data means that every value of the data has an explicit datatype.
It is _strongly typed_ and has a clear _schema_ that describes which properties you can expect in a Resource.
This means that someone re-using type-safe data can know for certain that it conforms to a specification, a set of rules.
The shape of the data is _predictable_.
This predictability means that developers can safely re-use it in their system without worrying about missing fields or datatype errors.

Lots of software has _internal_ type safety, especially if you use type-safe programming languages like Typescript, Kotlin or Rust.
However, when the data _leaves the system_, a lot of type related data is lost.
Even if this schema related information is described, the schema itself is often not machine-readable.
The best way to have type-safe data, is to describe the schema in a machine-readable format.

In SQL, we can use a DB schema. In JSON, we can add a JSON Schema file. For XML, we have XSD.

In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the required datatypes, which helps developers when re-using data understand what they can expect from a value.

```json
{
"https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
"https://atomicdata.dev/properties/name": "Joep",
"https://atomicdata.dev/properties/birthYear": 1991,
"https://atomicdata.dev/properties/worksOn": "Atomic Data",
}
```

## Level 4: browsable data

_Examples: Atomic Data, properly hosted RDF_

If your data is _connected_ to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other.
This effectively creates a _web of data_, and allows for a whole new way to think about the internet.
This is what allows decentralized applications, true data ownership, and a new set of applications.

```json
{
"https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
"https://atomicdata.dev/properties/name": "Joep",
"https://atomicdata.dev/properties/birthYear": 1991,
"https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
}
```

## Level 5: verifiable data

_Examples: Atomic Data + Atomic Commits_

When your data is _verifiable_, other people can verify who created it and modified it.
They can use cryptography to validate signatures, which proves that one person or machine created a piece of data.

```json
{
"https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
"https://atomicdata.dev/properties/name": "Joep",
"https://atomicdata.dev/properties/birthYear": 1991,
"https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
"https://atomicdata.dev/properties/previousCommit": "https://atomicdata.dev/commits/EF18751AE781",
}
```