dag-cbor: cleanup, reference RFC 8949, include @ipld/dag-cbor

ipld · Jan 28, 2021 · d83d9b8 · d83d9b8
1 parent 193f1c1
commit d83d9b8
Showing 1 changed file with 62 additions and 50 deletions.
diff --git a/block-layer/codecs/dag-cbor.md b/block-layer/codecs/dag-cbor.md
@@ -6,28 +6,30 @@
 * [Links](#links)
 * [Map Keys](#map-keys)
 * [Strictness](#strictness)
- * [Floating Point Encoding (Unresolved)](#floating-point-encoding-unresolved)
 * [Implementations](#implementations)
  * [JavaScript](#javascript)
  * [Go](#go)
  * [Java](#java)
 * [Limitations](#limitations)
- * [JavaScript](#javascript-1)
+ * [JavaScript Numbers](#javascript-numbers)
 
 DAG-CBOR supports the full [IPLD Data Model].
 
-DAG-CBOR uses the [Concise Binary Object Representation (CBOR)] data format, which natively supports all [IPLD Data Model Kinds].
+DAG-CBOR uses the [Concise Binary Object Representation (CBOR)] data format, defined by [RFC 8949] (formerly [RFC 7049]), which natively supports all [IPLD Data Model Kinds].
 
 ## Format
 
 The CBOR IPLD format is called DAG-CBOR to disambiguate it from regular CBOR. Most simple CBOR objects are valid DAG-CBOR. The primary differences are:
- * tag `42` interpreted as CIDs, no other tags are supported
- * maps may only be keyed by strings
- * additional strictness requirements are applied to ensure canonical data encoding forms
+
+ * Tag `42` interpreted as CIDs, no other tags are supported.
+ * Maps may only be keyed by strings.
+ * Additional strictness requirements are applied to ensure canonical data encoding forms. See [Strictness](#strictness) below.
 
 ## Links
 
-As with all IPLD formats, DAG-CBOR must be able to encode [Links]. In DAG-CBOR, links are the binary form of a [CID] encoded using the raw-binary identity [Multibase]. That is, the Multibase identity prefix (`0x00`) is prepended to the binary form of a CID and this new byte array is encoded into CBOR as a byte-string (major type 2), with the tag `42`.
+In DAG-CBOR, [Links] are the binary form of a [CID] encoded using the raw-binary identity [Multibase]. That is, the Multibase identity prefix (`0x00`) is prepended to the binary form of a CID and this new byte array is encoded into CBOR as a byte-string (major type 2), and associated with CBOR tag `42`.
+
+Tag `42` is associated in the [CBOR Tags Registry] as "IPLD content identifier" and is further defined in [IPLD content identifiers (CIDs) in CBOR].
 
 The inclusion of the Multibase prefix exists for historical reasons and the identity prefix *must not* be omitted.
 
@@ -37,93 +39,103 @@ In DAG-CBOR, map keys must be strings, as defined by the [IPLD Data Model]. Othe
 
 ## Strictness
 
-DAG-CBOR requires that there exist a single, canonical way of encoding any given object, and that encoded forms contain no superfluous data that may be ignored or lost in a round-trip decode/encode.
+DAG-CBOR requires that there exist a single, canonical way of encoding any given set of data, and that encoded forms contain no superfluous data that may be ignored or lost in a round-trip decode/encode.
 
 Therefore the DAG-CBOR codec must:
 
 1. Use no tags other than the CID tag (`42`). A valid DAG-CBOR encoder must not encode using any additional tags and a valid DAG-CBOR decoder must reject objects containing additional tags as invalid.
- * This includes any of the initial values of the tag registry in [section 2.4 of the CBOR specification], such as dates, bignums, bigfloats, URIs, regular expressions and other complex, or simple values whether or not they map to the [IPLD Data Model].
-2. The only usable major type 7 minor types are those for encoding Floats (`25`, `26`, `27`), True (`20`), False (`21`) and Null (`22`).
- * "Simple values" are not supported. This includes all registered or unregistered simple values that are encoded with a major type 7.
- * Undefined (`23`) is not supported.
-3. Use the canonical CBOR encoding defined by the suggestions in [section 3.9 of the CBOR specification]. A valid DAG-CBOR decoder should reject objects not following these restrictions as invalid. Specifically:
+ * This includes any of the well defined tag numbers listed [section 3.4 of RFC 8949], such as dates, bignums, bigfloats, URIs, regular expressions and other complex, or simple values whether or not they map to the [IPLD Data Model].
+2. Use the "Deterministically Encoded CBOR" rule suggestions defined in [section 4.2 of RFC 8949] except for map key ordering, which follow the original rules as defined in [section 3.9 of RFC 7049]. Therefore, a valid DAG-CBOR encoder should produce encoded forms that adhere to the following rules, and a valid DAG-CBOR decoder should reject encoded forms not adhering to the following rules:
  * Integer encoding must be as short as possible.
  * The expression of lengths in major types 2 through 5 must be as short as possible.
  * The expression of tag numbers (specifically only `42`) must be as short as possible for major type 6. Therefore, for valid DAG-CBOR, the only tag token that can appear is `0xd82a` - where `0xd8` is "major type 6 with 8-bit integer to follow" and `0x2a` is the number `42`.
- * The keys in every map must be sorted lowest value to highest. Sorting is performed on the bytes of the representation of the keys.
+ * The keys in every map must be sorted length-first by the byte representation of the string keys, where:
  - If two keys have different lengths, the shorter one sorts earlier;
  - If two keys have the same length, the one with the lower value in (byte-wise) lexical order sorts earlier.
  * Indefinite-length items are not supported, only definite-length items are usable. This includes strings, bytes, lists and maps. The "break" token is also not supported.
-4. Encode and decode a single top-level CBOR object and not allow back-to-back concatenated objects, as suggested by [section 3.1 of the CBOR specification] for _streaming applications_. All bytes of an encoded DAG-CBOR object must decode to a single object. Extraneous bytes, whether valid or invalid CBOR, should fail validation.
-5. Floating point values are always encoded in 64-bit, double-precision form, regardless of whether they can be represented as half (16) or single (32) precision.
-6. IEEE 754 special values `NaN`, `Infinity` and `-Infinity` should not be accepted as they do not appear in the IPLD Data Model. Therefore, tokens `0xf97c00` (`Infinity`), `0xf97e00` (`NaN`) and `0xf9fc00` (`-Infinity`) and their 32-bit and 64-bit variants, should not appear, or be accepted in DAG-CBOR binary form.
+3. The only usable major type 7 minor types are those for encoding Floats (minors `25`, `26`, `27`), True (minor `20`), False (minor `21`) and Null (minor `22`).
+ * [Simple Values] other than True, False and Null are not supported. This includes all registered or unregistered simple values that are encoded with a major type 7 other than True, False and Null.
+ * Undefined (minor `23`) is not supported as it is not part of the [IPLD Data Model].
+4. Floating point values must always encoded in 64-bit, double-precision form, regardless of whether they can be represented as half (16) or single (32) precision.
+5. [IEEE 754] special values `NaN`, `Infinity` and `-Infinity` must not be accepted as they do not appear in the [IPLD Data Model]. Therefore, tokens `0xf97c00` (`Infinity`), `0xf97e00` (`NaN`) and `0xf9fc00` (`-Infinity`), their 16-bit, 32-bit and 64-bit variants, and any other [IEEE 754] byte layout that is interpreted as these values, should not appear, or be accepted in DAG-CBOR binary form.
+6. Encode and decode must operate on a single top-level CBOR object. Back-to-back concatenated objects are not allowed or supported, as suggested by [section 5.1 of RFC 8949] for _streaming applications_. All bytes of an encoded DAG-CBOR object must decode to a single object. Extraneous bytes included in an IPLD block, whether valid or invalid CBOR, must not be accepted as valid DAG-CBOR.
 
 ## Implementations
 
 ### JavaScript
 
-[dag-cbor], used by [ipld] and [@ipld/block] adheres to this specification, with the following caveats:
+**[@ipld/dag-cbor]**, for use with [multiformats] adheres to this specification, with the following caveats:
+ * Complete strictness is not yet enforced on decode. Specifically: correct map ordering is not enforced and floats that are not encoded as 64-bit are not rejected.
+ * [`BigInt`] is accepted along with `Number` for encode, but the smallest-possible rule is followed when encoding. When decoding integers outside of the JavaScript "safe integer" range, a [`BigInt`] will be used.
+
+The legacy **[ipld-dag-cbor]** implementation adheres to this specification, with the following caveats:
 
- * Strictness is not yet enforced on decode, blocks encoded that don't follow the strictness rules are not rejected
- * Floating point values are encoded as their smallest form rather than always double-precision.
- * Many additional object types outside of the Data Model are currently accepted for encoding.
- * IEEE 754 special values `NaN`, `Infinity` and `-Infinity` are accepted for decode and encode.
+ * Strictness is not enforced on decode; blocks encoded that do not follow the strictness rules are not rejected.
+ * Floating point values are encoded as their smallest form rather than always 64-bit.
+ * Many additional object types outside of the Data Model are currently accepted for decode and encode, including `undefined`.
+ * [IEEE 754] special values `NaN`, `Infinity` and `-Infinity` are accepted for decode and encode.
+ * Integers outside of the JavaScript "safe integer" range will use the third-party [bignumber.js] library to represent their values.
+
+Note that inability to clearly differentiate between integers and floats in JavaScript may cause problems with round-trips of floating point values. See the [IPLD Data Model] and the discussion on [Limitations](#limitations) below for further discussion on JavaScript numbers and recommendations regarding the use of floats.
 
 ### Go
 
-[ipld-cbor] and [ipld-prime] adhere to this specification, with the following caveats:
+**[go-ipld-cbor]** and **[go-ipld-prime]** adhere to this specification, with the following caveats:
 
- * Strictness is not yet enforced on decode, blocks encoded that don't follow the strictness rules are not rejected
- * IEEE 754 special values `NaN`, `Infinity` and `-Infinity` are accepted for decode and encode.
+ * Strictness is not enforced on decode; blocks encoded that do not follow the strictness rules are not rejected.
+ * [IEEE 754] special values `NaN`, `Infinity` and `-Infinity` are accepted for decode and encode.
 
 ### Java
 
-[java ipld from Peergos](https://github.com/Peergos/Peergos/tree/master/src/peergos/shared/cbor) adhere to this specification, with the following caveats:
+[Java IPLD from Peergos] adheres to this specification, with the following caveats:
 
- * Strictness is not yet enforced on decode, blocks encoded that don't follow the strictness rules are not rejected
- * Floats are disabled
+ * Strictness is not enforced on decode; blocks encoded that do not follow the strictness rules are not rejected.
+ * Floats are disabled.
 
 ## Limitations
 
-### JavaScript
+### JavaScript Numbers
 
 Users of DAG-CBOR that expect their data may be consumed or produced by JavaScript at some point should be aware of limitations that the language imposes on its use of DAG-CBOR, specifically concerning numbers.
 
 All JavaScript numbers, both floating point and integer, (using the [`Number`] primitive wrapper) are represented internally as 64-bit [IEEE 754] floating-point values (i.e. double-precision). Some implications within JavaScript of this design choice are:
 
  * There is no clear differentiation between a pure integer type and a floating-point number where a developer may wish to have such a differentiation.
  * By convention, JavaScript engines and developers usually omit the decimal point when representing whole numbers, simulating integers where the number is not actually stored as an integer.
- * There are limits on maximum and minimum safe integer sizes representable in JavaScript that are more constrained than those of languages where there are 64-bit integer types. Numbers outside of the range of `Number.MAX_SAFE_INTEGER` (`2`<sup>`53`</sup>` - 1`) and `Number.MIN_SAFE_INTEGER` (`-(2`<sup>`53`</sup>` - 1)`) cannot be safely manipulated or inspected as they incur rounding effects imposed by the IEEE 754 representation.
+ * There are limits on maximum and minimum safe integer sizes representable in JavaScript that are more constrained than those of languages where there are 64-bit integer types. Numbers outside of the range of `Number.MAX_SAFE_INTEGER` (`2`<sup>`53`</sup>` - 1`) and `Number.MIN_SAFE_INTEGER` (`-(2`<sup>`53`</sup>` - 1)`) cannot be safely manipulated or inspected as they incur rounding effects imposed by the [IEEE 754] representation.
  * Native bit-wise operations on "integers" are not able to be performed outside of the 32-bit range; larger numbers will be truncated.
 
-The current CBOR encoder/decoder used by the primary JavaScript DAG-CBOR implementation uses the [bignumber.js] library to handle large numbers in some cases, although reliance on its wrapper type is not recommended by DAG-CBOR users.
+[@ipld/dag-cbor] supports [`BigInt`] for values outside of the safe integer range, while the legacy [ipld-dag-cbor] uses the third-party [bignumber.js] library to handle these values.
 
 The implications for DAG-CBOR of these limitaitons are:
 
- * Any `Number` serialized by the JavaScript CBOR encoder relies on a whole-number check (e.g. `x % 1 === 0`) to determine whether it should be encoded as an integer or a float.
- * Any float deserialized by the JavaScript CBOR decoder that does not have a fractional component will be indistinguishable from an integer to a JavaScript program.
- * Any `Number` greater than `Number.MAX_SAFE_INTEGER` or less than `Number.MIN_SAFE_INTEGER` cannot be properly inspected for its whole-number status and is therefore encoded by the JavaScript CBOR encoder as float regardless of whether it is a whole-number or has a fractional component.
- * Any integer deserialized by the JavaScript CBOR decoder greater than `Number.MAX_SAFE_INTEGER` or less than `Number.MIN_SAFE_INTEGER` will be returned as a bignumber.js wrapper type, which may be unexpected to users and have unexpected effects on downstream code.
-
-A new [BigInt] built-in type is currently being adopted across JavaScript engines. Once support is widely available, it is expected that this type will assist with some of these challenges.
+ * Any integer deserialized by the JavaScript CBOR decoder greater than `Number.MAX_SAFE_INTEGER` or less than `Number.MIN_SAFE_INTEGER` will be returned as a [`BigInt`] from [@ipld/dag-cbor] or a [bignumber.js] wrapper type from [ipld-dag-cbor], which may be unexpected to users and have unexpected effects on downstream code.
+ * Any `Number` serialized by the JavaScript CBOR encoder relies on a whole-number check (i.e. `Number.isInteger()`, roughly `x % 1 === 0`) to determine whether it should be encoded as an integer or a float.
+ * Any float deserialized by the JavaScript CBOR decoder that does not have a fractional component will be indistinguishable from an integer to a JavaScript program and may not round-trip to the same bytes if originally produced by non-JavaScript code.
+ * Any `Number` greater than `Number.MAX_SAFE_INTEGER` or less than `Number.MIN_SAFE_INTEGER` cannot be properly inspected for its whole-number status and is therefore encoded by the JavaScript CBOR encoder as float regardless of whether it is a whole-number or has a fractional component. [`BigInt`] should be used for [@ipld/dag-cbor] when dealing with integers outside of the safe range to ensure proper handling.
 
 [IPLD Data Model]: ../../data-model-layer/data-model.md
-[Concise Binary Object Representation (CBOR)]: https://tools.ietf.org/html/rfc7049
+[Concise Binary Object Representation (CBOR)]: https://cbor.io/
+[RFC 8949]: https://tools.ietf.org/html/rfc8949
+[RFC 7049]: https://tools.ietf.org/html/rfc7049
 [IPLD Data Model Kinds]: ../../data-model-layer/data-model.md#kinds
 [Links]: ../../data-model-layer/data-model.md#link-kind
-[CIDs]: ../CID.md
+[CID]: ../CID.md
 [Multibase]: https://github.com/multiformats/multibase
-[section 2.4 of the CBOR specification]: https://tools.ietf.org/html/rfc7049#section-2.4
-[section 3.9 of the CBOR specification]: https://tools.ietf.org/html/rfc7049#section-3.9
-[section 3.1 of the CBOR specification]: https://tools.ietf.org/html/rfc7049#section-3.1
-[borc]: https://github.com/dignifiedquire/borc
-[dag-cbor]: https://github.com/ipld/js-ipld-dag-cbor/
-[refmt]: https://github.com/polydawn/refmt/
-[ipld-cbor]: https://github.com/ipfs/go-ipld-cbor
-[ipld-prime]: http://github.com/ipld/go-ipld-prime
-[ipld]: https://github.com/ipld/js-ipld
-[@ipld/block]: https://github.com/ipld/js-block
+[CBOR Tags Registry]: https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml
+[IPLD content identifiers (CIDs) in CBOR]: https://github.com/ipld/cid-cbor/
+[section 3.4 of RFC 8949]: https://tools.ietf.org/html/rfc8949#section-3.4
+[section 4.2 of RFC 8949]: https://tools.ietf.org/html/rfc8949#section-4.2
+[section 3.9 of RFC 7049]: https://tools.ietf.org/html/rfc7049#section-3.9
+[Simple Values]: https://tools.ietf.org/html/rfc8949#section-2.1
+[section 5.1 of RFC 8949]: https://tools.ietf.org/html/rfc8949#section-5.1
+[@ipld/dag-cbor]: https://github.com/ipld/js-dag-cbor/
+[multiformats]: https://github.com/multiformats/js-multiformats/
+[`BigInt`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt
+[ipld-dag-cbor]: https://github.com/ipld/js-ipld-dag-cbor/
+[go-ipld-cbor]: https://github.com/ipfs/go-ipld-cbor
+[go-ipld-prime]: http://github.com/ipld/go-ipld-prime
+[Java IPLD from Peergos]: https://github.com/Peergos/Peergos/tree/master/src/peergos/shared/cbor
 [`Number`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number
 [IEEE 754]: https://en.wikipedia.org/wiki/Floating-point_arithmetic
 [bignumber.js]: https://github.com/MikeMcl/bignumber.js
-[BigInt]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt