First draft of the upgrade doc for nom 4

passy · Jan 16, 2018 · 7a046da · 7a046da
1 parent e45d205
commit 7a046da
Showing 1 changed file with 221 additions and 0 deletions.
diff --git a/doc/upgrading_to_nom_4.md b/doc/upgrading_to_nom_4.md
@@ -0,0 +1,221 @@
+% Upgrading to nom 4.0
+
+# Upgrading to nom 4.0
+
+The nom 4.0 is a nearly complete rewrite of nom's internal structures, along with a cleanup of a lot of parser and combinators whose semantics were unclear. Upgrading from previous nom versions can require a lot of changes, especially if you have a lot of unit tests. But most of those changes are pretty straightforward.
+
+## Changes in internal structures
+
+Previous versions of nom all generated parsers with the following signature:
+
+``rust,ignore
+fn parser(input: I) -> IResult<I,O> { ... }
+```
+
+With the following definition for `IResult`:
+
+```rust,ignore
+pub enum IResult<I,O,E=u32> {
+  /// remaining input, result value
+  Done(I,O),
+  /// indicates the parser encountered an error. E is a custom error type you can redefine
+  Error(Err<E>),
+  /// Incomplete contains a Needed, an enum than can represent a known quantity of input data, or unknown
+  Incomplete(Needed)
+}
+
+pub enum Needed {
+  /// needs more data, but we do not know how much
+  Unknown,
+  /// contains the required total data size
+  Size(usize)
+}
+
+// if the "verbose-errors" feature is active
+ pub type Err<E=u32> = ErrorKind<E>;
+
+// if the "verbose-errors" feature is active
+pub enum Err<P,E=u32>{
+  /// An error code, represented by an ErrorKind, which can contain a custom error code represented by E
+  Code(ErrorKind<E>),
+  /// An error code, and the next error
+  Node(ErrorKind<E>, Vec<Err<P,E>>),
+  /// An error code, and the input position
+  Position(ErrorKind<E>, P),
+  /// An error code, the input position and the next error
+  NodePosition(ErrorKind<E>, P, Vec<Err<P,E>>)
+}
+```
+
+The new design uses the `Result` type from the standard library:
+
+````rust,ignore
+pub type IResult<I, O, E = u32> = Result<(I, O), Err<I, E>>;
+
+pub enum Err<I, E = u32> {
+  /// There was not enough data
+  Incomplete(Needed),
+  /// The parser had an error (recoverable)
+  Error(Context<I, E>),
+  /// The parser had an unrecoverable error
+  Failure(Context<I, E>),
+}
+
+pub enum Needed {
+  /// needs more data, but we do not know how much
+  Unknown,
+  /// contains the required additional data size
+  Size(usize)
+}
+
+// if the "verbose-errors" feature is active
+pub enum Context<I, E = u32> {
+  Code(I, ErrorKind<E>),
+}
+
+// if the "verbose-errors" feature is active
+pub enum Context<I, E = u32> {
+  Code(I, ErrorKind<E>),
+  List(Vec<(I, ErrorKind<E>)>),
+}
+```
+
+With this new design, the `Incomplete` case is now part of the error case, and we get a `Failure`
+case representing an unrecoverable error (combinators like `alt!` will not try another branch).
+The "verbose" error management is now a truly additive feature above the "simple" one (adding a
+case to an enum). Error management types also get smaller and more efficient. We can now return
+the related input as part of the error in all cases.
+
+All of this will likely not affect your existing parsers, but require changes to the surrounding
+code that manipulates parser results.
+
+## Replacing parser result matchers
+
+Whenever you use pattern matching on the result of a parser, or compare it to another parser
+result (like in a unit test), you will have to perform the following changes:
+
+For the correct result case:
+
+```rust,ignore
+IResult::Done(i, o)
+
+// becomes
+
+Ok((i, o))
+```
+
+For the error case (note that argument position for `error_position` and other such macros was changed
+to match the rest of the code):
+
+```rust,ignore
+IResult::Error(error_position!(ErrorKind::OneOf, input)),
+
+// becomes
+
+Err(Err::Error(error_position!(input, ErrorKind::OneOf)))
+```
+
+```rust,ignore
+IResult::Incomplete(Needed::Size(1))
+
+// becomes
+
+Err(Err::Incomplete(Needed::Size(1)))
+```
+
+For pattern matching, you now need to handle the `Failure` case as well, which works like the error
+case:
+
+```rust,ignore
+match result {
+  Ok((remaining, value)) => { ... },
+  Err(Err::Incomplete(needed) => { ... },
+  Err(Err::Error(e)) | Err(Err::Failure(e)) => { ... }
+}
+```
+
+## Errors on `Incomplete` data size calculation
+
+In previous versions, `Needed::Size(sz)` indicated the total needed data size (counting the actual input).
+Now it only returns the additional data needed, so the values will have changed.
+
+## New trait for input types
+
+nom allows other input types than `&[u8]` and `&str`, as long as they implement a set of traits
+that are used everywhere in nom. This version introduces the `AtEof` trait:
+
+```rust
+pub trait AtEof {
+  fn at_eof(&self) -> bool;
+}
+```
+
+This trait allows the input value to indicate whether there can be more input coming later (buffering
+data from a file, or waiting for network data).
+
+## Dealing with `Incomplete` usage
+
+nom's parsers are designed to work around streaming issues: if there is not enough data to decide, a
+parser will return `Incomplete` instead of returning a partial value that might be false.
+
+As an example, if you want to parse alphabetic characters then digits, when you get the whole input
+`abc123;`, the parser will return `abc` for alphabetic characters, and `123` for the digits, and `;`
+as remaining input.
+
+But if you get that input in chunks, like `ab` then `c123;`, the alphabetic characters parser will
+return `Incomplete`, because it does not know if there will be more matching characters afterwards.
+If it returned `ab` directly, the digit parser would fail on the rest of the input, even though the
+input had the valid format.
+
+For some users, though, the input will never be partial (everything could be loaded in memory at once),
+and the solution in nom 3 and before was to wrap parts of the parsers with the `complete!()` combinator
+that transforms `Incomplete` in `Error`.
+
+nom 4 is much stricter about the behaviour with partial data, but provides better tools to deal with it.
+Thanks to the new `AtEof` trait for input types, nom now provides the `CompleteByteSlice(&[u8])` and
+`CompleteStr(&str)` input types, for which the `at_eof()` method always returns true.
+With these types, no need to put a `complete!()` combinator everywhere, you can juste apply those types
+like this:
+
+```rust,ignore
+named!(parser<&str,ReturnType>, ... );
+
+// becomes
+
+named!(parser<CompleteStr,ReturnType, ... );
+```
+
+```rust,ignore
+named!(parser<&str,&str>, ... );
+
+// becomes
+
+named!(parser<CompleteStr,CompleteStr, ... );
+```
+
+```rust,ignore
+named!(parser, ... );
+
+// becomes
+
+named!(parser<CompleteByteSlice,CompleteByteSlice, ... );
+```
+
+And as an example, for a unit test:
+
+```rust,ignore
+assert_eq!(parser("abcd123"), Ok(("123", "abcd"));
+
+// becomes
+
+assert_eq!(parser(CompleteStr("abcd123")), Ok((CompleteStr("123"), CompleteStr("abcd")));
+```
+
+These types allow you to correctly handle cases like text formats for which there might be a last
+empty line or not, as seen in [one of the examples](https://github.com/Geal/nom/blob/87d837006467aebcdb0c37621da874a56c8562b5/tests/multiline.rs).
+
+## Producers and consumers
+
+Producers and consumers were removed in nom 4. That feature was too hard to integrate in code that
+deals with IO.
+