How to recover original text when a row fails decoding? #188

seanf · 2019-06-25T23:12:20Z

As a simplified example, let's say I'm parsing a file like this, which contains five lines but only three records (with two fields: a String and an Int):

abc,123
def,456
ghi,"j
k
l"

Lines 0 and 1 parse okay, so I process them immediately.

The last three lines constitute one record because of the quotes, and parsing fails because j\nk\nl is not a valid number. So I want to extract that record for separate, manual processing. To do that, I need a verbatim copy of the line(s) which couldn't be parsed:

ghi,"j
k
l"

Alternatively, the cell might represent an enum (really case objects, used to encode a record type), with a CellDecoder which decodes one of N strings into one of N enums. But if an unexpected String shows up, I need to save the entire record for later processing.

I'm dealing with a case like this right now, but I've had to make the assumption that there are never any quotes (or any multi-line records). Thus I can split the file/stream into lines and parse the lines one at a time, converting each line to Either[String, MyRow] (where a String is a bad line, and MyRow is a successfully parsed record). But this code could be simpler if the parser simply returned an error which included the offending line(s), plus it would work with quotes and multi-line records.

Related: #183

Note that a CellDecoder[Either[String, MyCell]] won't work, because I need the entire line (or multi-line record) to reprocess it.

A RowDecoder[Either[String, MyRow]] is closer, but it looks like RowDecoders only receive a Seq[String], and I need the entire original line (or lines) as a String, unaltered. Trying to convert the Seq[String] back to a String is bound to involve some loss (eg if there were extra fields at the end, or a trailling comma).

The text was updated successfully, but these errors were encountered:

nrinaudo · 2019-06-26T00:47:05Z

Can you explain why you’d want a String rather than a CSV row (a Seq[String])? I think the rest of your issue is now clear, but I don’t understand why you’d want to re-implement the csv parsing logic yourself.

seanf · 2019-07-26T06:24:06Z

@nrinaudo I wouldn't want to re-implement the parsing as such, but if the row processing has gone terribly wrong for whatever reason, I want to save the record in its original, unaltered form. This way, after my code (or kantan.csv config) is fixed/updated, the data can be reparsed and reprocessed from the beginning.

I'm planning to send these unhandled records to a kind of dead letter queue, and move them back to the input queue when the code is ready. So I need a String which contains one or more CSV records, since that's what the input queue carries. And when things are going wrong, I don't want to introduce any unnecessary changes by converting String to Seq[String] and back to String.

seanf mentioned this issue Jun 25, 2019

Include the line number of an error when parsing #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to recover original text when a row fails decoding? #188

How to recover original text when a row fails decoding? #188

seanf commented Jun 25, 2019

nrinaudo commented Jun 26, 2019

seanf commented Jul 26, 2019

How to recover original text when a row fails decoding? #188

How to recover original text when a row fails decoding? #188

Comments

seanf commented Jun 25, 2019

nrinaudo commented Jun 26, 2019

seanf commented Jul 26, 2019