From d7537e438c32aa13c0e4feb5d5212a20c7465ca4 Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Wed, 2 Mar 2022 23:12:50 +0100 Subject: [PATCH 1/8] #103 WIP 5 levels --- src/5-levels-of-data.md | 67 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) create mode 100644 src/5-levels-of-data.md diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md new file mode 100644 index 0000000..d917e83 --- /dev/null +++ b/src/5-levels-of-data.md @@ -0,0 +1,67 @@ +# 5 Levels of data usability + +Not all data are created equal. +There are notable differences in how much you can do with data, how flexible it is. +The more usable data is, the easier it will be to re-use it for developer, researcher or other type of data user. + +_This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata.info/en/)_. + +## Level 1: unstructured data + +_Examples: images, videos, plain text_ + +Unstructured data is the least usable. +Humans can read it, and AI / Machine Learning systems can draw more conclusions from it then ever, +but it's hard to build an actual application or graphic from only unstructured data. + +``` +Hi! I'm Joep, I'm born in 1991. +``` + +## Level 2: structured data + +_Examples: CSV, XML, JSON, TOML, EXCEL_ + +Structured data can be read by machines, and this allows us to do all sorts of useful things. +We can _query_, _sort_ and _filter_. +But still, this type of data often requires human input when it needs to be processed. +A human needs to make + + +- Requires human interpretation +- No semantic definitions of what properties represent +- Can be readed by machines if mapped correctly +- Often requires handling invalid data + +```json +{ + "name": "Joep", + "birthYear": "" +} +``` + +## Level 3: type-safe data + +_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL_ + +Type-safe data means that every value of the data has an explicit datatype, and that these datatypes can be constrained. +This means that someone re-using this data can know for certain that it conforms to a certain specification, a set of rules. +The shape of the data is predictable. + + +```json +{ + "https://atomicdata.dev/properties/name": "Joep", + "https://atomicdata.dev/properties/birthYear": 1991 +} +``` + +## Level 4: browsable data + +_Examples: Atomic Data_ + +If your data is _connected_ to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other. +This effectively creates a _web of data_, and allows for a whole new way to think about the internet. +This is what allows decentralized applications, true data ownership, and a new set of applications. + +- Is connected to other pieces of machine verifiable data From 2a6bb6628acee9fb31ff5ddd01de0ae22608e180 Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 12:35:35 +0100 Subject: [PATCH 2/8] #103 add verifiable data --- src/5-levels-of-data.md | 61 ++++++++++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 16 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index d917e83..e52f0fa 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -1,8 +1,8 @@ -# 5 Levels of data usability +# 5 Levels of data reusability Not all data are created equal. There are notable differences in how much you can do with data, how flexible it is. -The more usable data is, the easier it will be to re-use it for developer, researcher or other type of data user. +The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user. _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata.info/en/)_. @@ -25,34 +25,41 @@ _Examples: CSV, XML, JSON, TOML, EXCEL_ Structured data can be read by machines, and this allows us to do all sorts of useful things. We can _query_, _sort_ and _filter_. But still, this type of data often requires human input when it needs to be processed. -A human needs to make - - -- Requires human interpretation -- No semantic definitions of what properties represent -- Can be readed by machines if mapped correctly -- Often requires handling invalid data +And we don't have guarantees about which fields will be filled, or what their datatypes are. +One time, a `birthYear` can be a string, and the next time it can be a number. +Data can be _structured_, but still _unpredictable_. ```json { "name": "Joep", - "birthYear": "" + "birthYear": 1991 } ``` +If we want predictability, we need to make it _type-safe_. + ## Level 3: type-safe data -_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL_ +_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming langauges_ -Type-safe data means that every value of the data has an explicit datatype, and that these datatypes can be constrained. -This means that someone re-using this data can know for certain that it conforms to a certain specification, a set of rules. -The shape of the data is predictable. +Type-safe data means that every value of the data has an explicit datatype. +It is _strongly typed_ and has a clear _schema_ that describes which properties you can expect in a Resource. +This means that someone re-using type-safe data can know for certain that it conforms to a specification, a set of rules. +The shape of the data is _predictable_. +This predictability means that developers can safely re-use it in their system without worrying about missing fields or datatype errors. +Lots of software has _internal_ type safety, especially if you use type-safe programming langauges like Typescript, Kotlin or Rust. +However, when the data _leaves the system_, a lot of type related data is lost. +Even if this schema related information is described, the schema itself is often not machine-readable. +The best way to have type-safe data, is to describe the schema in a machine-readable format. + +In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the datatypes, which helps developers when re-using data. ```json { "https://atomicdata.dev/properties/name": "Joep", - "https://atomicdata.dev/properties/birthYear": 1991 + "https://atomicdata.dev/properties/birthYear": 1991, + "https://atomicdata.dev/properties/worksOn": "Atomic Data", } ``` @@ -64,4 +71,26 @@ If your data is _connected_ to other pieces of machine-readable dat, is becomes This effectively creates a _web of data_, and allows for a whole new way to think about the internet. This is what allows decentralized applications, true data ownership, and a new set of applications. -- Is connected to other pieces of machine verifiable data +```json +{ + "https://atomicdata.dev/properties/name": "Joep", + "https://atomicdata.dev/properties/birthYear": 1991, + "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev", +} +``` + +## Level 5: verifiable data + +_Examples: Atomic Data + Atomic Commits_ + +When your data is _verifiable_, other people can verify who created it and modified it. +They can use cryptography to validate signatures, which proves that one person or machine created a piece of data. + +```json +{ + "https://atomicdata.dev/properties/name": "Joep", + "https://atomicdata.dev/properties/birthYear": 1991, + "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev", + "https://atomicdata.dev/properties/previousCommit": "https://atomicdata.dev/commits/EF18751AE781", +} +``` From 9a4cf796dd5e64d151523b4a05fac3284760aea4 Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 12:36:41 +0100 Subject: [PATCH 3/8] #103 add class to type-safety --- src/5-levels-of-data.md | 1 + 1 file changed, 1 insertion(+) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index e52f0fa..fcefd59 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -57,6 +57,7 @@ In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) des ```json { + "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"], "https://atomicdata.dev/properties/name": "Joep", "https://atomicdata.dev/properties/birthYear": 1991, "https://atomicdata.dev/properties/worksOn": "Atomic Data", From 5226f11c237f5876462c2f3f8f35a7136d41f792 Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 13:30:40 +0100 Subject: [PATCH 4/8] #103 add level 0, proprietary data --- src/5-levels-of-data.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index fcefd59..1f56179 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -1,11 +1,18 @@ # 5 Levels of data reusability Not all data are created equal. -There are notable differences in how much you can do with data, how flexible it is. +There are notable differences in how much you can do with data and how much effort it takes. The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user. _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata.info/en/)_. +## Level 0: proprietary data + +If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero. +That's why it's important to have licences that allow others to use data. +It's also important to use _open formats_, intead of _proprietary formats_. +Creative Commons licenses are a great way to clearly communicate that your data is meant to be re-used. + ## Level 1: unstructured data _Examples: images, videos, plain text_ @@ -53,7 +60,9 @@ However, when the data _leaves the system_, a lot of type related data is lost. Even if this schema related information is described, the schema itself is often not machine-readable. The best way to have type-safe data, is to describe the schema in a machine-readable format. -In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the datatypes, which helps developers when re-using data. +In SQL, we can use a DB schema. In JSON, we can add a JSON Schema file. For XML, we have XSD. + +In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the required datatypes, which helps developers when re-using data understand what they can expect from a value. ```json { @@ -74,6 +83,7 @@ This is what allows decentralized applications, true data ownership, and a new s ```json { + "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"], "https://atomicdata.dev/properties/name": "Joep", "https://atomicdata.dev/properties/birthYear": 1991, "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev", From 5c8040d187e180b276332c2f11d1cf2c37a22cee Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 13:55:59 +0100 Subject: [PATCH 5/8] #103 CC fix --- src/5-levels-of-data.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index 1f56179..9b1301f 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -10,8 +10,8 @@ _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero. That's why it's important to have licences that allow others to use data. -It's also important to use _open formats_, intead of _proprietary formats_. -Creative Commons licenses are a great way to clearly communicate that your data is meant to be re-used. +It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), intead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`). +Creative Commons licenses are great to clearly communicate _if_, and if so then _how_, your data is permitted to be re-used. ## Level 1: unstructured data @@ -75,7 +75,7 @@ In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) des ## Level 4: browsable data -_Examples: Atomic Data_ +_Examples: Atomic Data, propertly hosted RDF_ If your data is _connected_ to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other. This effectively creates a _web of data_, and allows for a whole new way to think about the internet. From b1d053c9ac68fd6cfc50d0158feeacc76a621863 Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 14:20:33 +0100 Subject: [PATCH 6/8] #103 open database license --- src/5-levels-of-data.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index 9b1301f..9d4bd4b 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -9,9 +9,13 @@ _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata ## Level 0: proprietary data If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero. -That's why it's important to have licences that allow others to use data. + +That's why it's important to have a _licence_ that allow others to use your data. +A good choice for a permissive option is the [Open Database License](https://opendatacommons.org/licenses/odbl/summary/). +Creative Commons licenses are also good options to clearly communicate _if_, and if so then _how_, your data is permitted to be re-used. + It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), intead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`). -Creative Commons licenses are great to clearly communicate _if_, and if so then _how_, your data is permitted to be re-used. + ## Level 1: unstructured data @@ -99,6 +103,7 @@ They can use cryptography to validate signatures, which proves that one person o ```json { + "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"], "https://atomicdata.dev/properties/name": "Joep", "https://atomicdata.dev/properties/birthYear": 1991, "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev", From dbfc51167e9ca23b390039696b6106a0e8b6eebe Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 14:37:35 +0100 Subject: [PATCH 7/8] Spell, re-use --- src/5-levels-of-data.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index 9d4bd4b..9b61e9d 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -3,6 +3,7 @@ Not all data are created equal. There are notable differences in how much you can do with data and how much effort it takes. The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user. +Re-useability is about being able to transform, sort, query, serialize, modify, render and audit data without requiring too much work. _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata.info/en/)_. @@ -10,11 +11,11 @@ _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero. -That's why it's important to have a _licence_ that allow others to use your data. +That's why it's important to have a _license_ that allow others to use your data. A good choice for a permissive option is the [Open Database License](https://opendatacommons.org/licenses/odbl/summary/). Creative Commons licenses are also good options to clearly communicate _if_, and if so then _how_, your data is permitted to be re-used. -It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), intead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`). +It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), instead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`). ## Level 1: unstructured data @@ -51,7 +52,7 @@ If we want predictability, we need to make it _type-safe_. ## Level 3: type-safe data -_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming langauges_ +_Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming languages_ Type-safe data means that every value of the data has an explicit datatype. It is _strongly typed_ and has a clear _schema_ that describes which properties you can expect in a Resource. @@ -59,7 +60,7 @@ This means that someone re-using type-safe data can know for certain that it con The shape of the data is _predictable_. This predictability means that developers can safely re-use it in their system without worrying about missing fields or datatype errors. -Lots of software has _internal_ type safety, especially if you use type-safe programming langauges like Typescript, Kotlin or Rust. +Lots of software has _internal_ type safety, especially if you use type-safe programming languages like Typescript, Kotlin or Rust. However, when the data _leaves the system_, a lot of type related data is lost. Even if this schema related information is described, the schema itself is often not machine-readable. The best way to have type-safe data, is to describe the schema in a machine-readable format. @@ -79,7 +80,7 @@ In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) des ## Level 4: browsable data -_Examples: Atomic Data, propertly hosted RDF_ +_Examples: Atomic Data, properly hosted RDF_ If your data is _connected_ to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other. This effectively creates a _web of data_, and allows for a whole new way to think about the internet. From 46ecf1f7aae40cc0faccf86240af802810e7bc8a Mon Sep 17 00:00:00 2001 From: Joep Meindertsma Date: Thu, 3 Mar 2022 16:23:49 +0100 Subject: [PATCH 8/8] #103 less politics --- src/5-levels-of-data.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/5-levels-of-data.md b/src/5-levels-of-data.md index 9b61e9d..6a35d6a 100644 --- a/src/5-levels-of-data.md +++ b/src/5-levels-of-data.md @@ -11,9 +11,8 @@ _This list is inspired by Tim Berners-Lee's [5-star open data](https://5stardata If you don't give others the _rights_ to read, use or modify your data, it's reusability is zero. -That's why it's important to have a _license_ that allow others to use your data. -A good choice for a permissive option is the [Open Database License](https://opendatacommons.org/licenses/odbl/summary/). -Creative Commons licenses are also good options to clearly communicate _if_, and if so then _how_, your data is permitted to be re-used. +That's why it's important to have a _license_ that allow others to use your data, like the [Open Database License](https://opendatacommons.org/licenses/odbl/summary/). +or one of the Creative Commons licenses. It's also important to use _open formats_ (such as `CSV`, `JSON` or `PNG`), instead of _proprietary formats_ (tied to specific vendors, such as `PSD` or `RAR`).