Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to recover original text when a row fails decoding? #188

Open
seanf opened this issue Jun 25, 2019 · 2 comments
Open

How to recover original text when a row fails decoding? #188

seanf opened this issue Jun 25, 2019 · 2 comments

Comments

@seanf
Copy link

seanf commented Jun 25, 2019

As a simplified example, let's say I'm parsing a file like this, which contains five lines but only three records (with two fields: a String and an Int):

abc,123
def,456
ghi,"j
k
l"

Lines 0 and 1 parse okay, so I process them immediately.

The last three lines constitute one record because of the quotes, and parsing fails because j\nk\nl is not a valid number. So I want to extract that record for separate, manual processing. To do that, I need a verbatim copy of the line(s) which couldn't be parsed:

ghi,"j
k
l"

Alternatively, the cell might represent an enum (really case objects, used to encode a record type), with a CellDecoder which decodes one of N strings into one of N enums. But if an unexpected String shows up, I need to save the entire record for later processing.

I'm dealing with a case like this right now, but I've had to make the assumption that there are never any quotes (or any multi-line records). Thus I can split the file/stream into lines and parse the lines one at a time, converting each line to Either[String, MyRow] (where a String is a bad line, and MyRow is a successfully parsed record). But this code could be simpler if the parser simply returned an error which included the offending line(s), plus it would work with quotes and multi-line records.

Related: #183

Note that a CellDecoder[Either[String, MyCell]] won't work, because I need the entire line (or multi-line record) to reprocess it.

A RowDecoder[Either[String, MyRow]] is closer, but it looks like RowDecoders only receive a Seq[String], and I need the entire original line (or lines) as a String, unaltered. Trying to convert the Seq[String] back to a String is bound to involve some loss (eg if there were extra fields at the end, or a trailling comma).

@nrinaudo
Copy link
Owner

Can you explain why you’d want a String rather than a CSV row (a Seq[String])? I think the rest of your issue is now clear, but I don’t understand why you’d want to re-implement the csv parsing logic yourself.

@seanf
Copy link
Author

seanf commented Jul 26, 2019

@nrinaudo I wouldn't want to re-implement the parsing as such, but if the row processing has gone terribly wrong for whatever reason, I want to save the record in its original, unaltered form. This way, after my code (or kantan.csv config) is fixed/updated, the data can be reparsed and reprocessed from the beginning.

I'm planning to send these unhandled records to a kind of dead letter queue, and move them back to the input queue when the code is ready. So I need a String which contains one or more CSV records, since that's what the input queue carries. And when things are going wrong, I don't want to introduce any unnecessary changes by converting String to Seq[String] and back to String.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants