Better inline charset handling for get_html #172

LetThereBeDwight · 2024-10-01T17:15:43Z

Closes #171.

Changes proposed in this pull request

Aligning get_html matches with get_text to ensure that we match on the same types of content type specifications between the two, except for the obvious actual content type part differences.

… specifications when calling get_html, and align array charset match strategy with get_text

andrewtimberlake · 2024-10-01T17:26:20Z

This change shouldn’t be needed, see my comment on #171

LetThereBeDwight · 2024-10-01T17:45:51Z

I've commented as well - the issue isn't with building it this way, the issue is in parsing it when it comes in this way via raw spec.

andrewtimberlake · 2024-10-01T18:55:06Z

Please give us an example in raw spec that causes this problem. The parser should break it into the content type and params

LetThereBeDwight · 2024-10-01T20:06:03Z

Okay apologies I now see what was happening (kinda) - I'm actually on a different version (0.3.1) since that's the latest in hex.

When testing like this against 0.3.1 in test/mail/parsers/rfc_2822_test.exs (apologies for the long string, using the multi-line string resulted in different parser results with really bad headers - this is what I get directly when pulling the email from s3).

  test "content-type html with explicit charset" do
    message =
      parse_email("Return-Path: <[email protected]>\r\nDate: Tue, 09 Jul 2024 12:53:58 +0000 (UTC)\r\nFrom: \"from name\" <[email protected]>\r\nReply-to: \"reply to name\" <[email protected]>\r\nTo: [email protected]\r\nSubject: SR0012084728 - RFQ\r\nMIME-version: 1.0\r\nContent-type: text/html; charset=UTF-8\r\nContent-transfer-encoding: quoted-printable\r\nReturn-Path: [email protected]\r\n\r\n<html><head></head><body><p>Test</p></body></html>")

    IO.inspect message, label: "Message"
    IO.inspect message.headers["content-type"], label: "Content-Type"
  end

We get the output:

Message: %Mail.Message{
  headers: %{
    "content-transfer-encoding" => "quoted-printable",
    "content-type" => "text/html; charset=UTF-8",
    "date" => ~U[2024-07-09 12:53:58Z],
    "from" => {"from name", "[email protected]"},
    "mime-version" => "1.0",
    "reply-to" => "\"reply to name\" <[email protected]>",
    "return-path" => "[email protected]",
    "subject" => "SR0012084728 - RFQ",
    "to" => ["[email protected]"]
  },
  body: "<html><head></head><body><p>Test</p></body></html>",
  parts: [],
  multipart: false
}
Content-Type: "text/html; charset=UTF-8"

SO - apologies for mislabling the version, which made me then go and try to update and realize that 0.4.0 isn't actually on Hex - is there a timeline for that or have you stopped pushing to hex entirely?

Granted I wouldn't have noticed these differences between get_text and get_html otherwise BUT I still maintain that those two parsers should align in any case, and the initial version of using put_content_type is also valid and generates a valid Mail.Message and email.

Understood if you don't wanna take on this PR as a result as there are other facilities here that work around the issue, but I do pretty firmly believe those two matches should align.

TLDR:

My apologies on not recognizing the version I had in the system I was using that ran into this issue.
When will 0.4.0 be pushed to hex (if ever)?
Do you believe that it's valid for the functions get_html and get_text to be misaligned in the way they match content type specifications? If not, this PR is valid in it's current state and that usage of put_content_type shows the issue in a rather valid way to my eye. If so, then feel free to close.

andrewtimberlake · 2024-10-02T07:48:05Z

Thanks @LetThereBeDwight
I have published 0.4.0
You are right, get_text was inconsistent and I’ve changed that in #174 along with a number of doctests to check for consistency in our examples

Dwight Bell added 3 commits October 1, 2024 09:56

ignore tool versions

97b9d46

HTML content type match widening and test to allow for inline charset…

2f7465f

… specifications when calling get_html, and align array charset match strategy with get_text

more descriptive test name

5a68b1b

LetThereBeDwight closed this Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better inline charset handling for get_html #172

Better inline charset handling for get_html #172

LetThereBeDwight commented Oct 1, 2024

andrewtimberlake commented Oct 1, 2024

LetThereBeDwight commented Oct 1, 2024 •

edited

Loading

andrewtimberlake commented Oct 1, 2024

LetThereBeDwight commented Oct 1, 2024 •

edited

Loading

andrewtimberlake commented Oct 2, 2024

Better inline charset handling for get_html #172

Better inline charset handling for get_html #172

Conversation

LetThereBeDwight commented Oct 1, 2024

Changes proposed in this pull request

andrewtimberlake commented Oct 1, 2024

LetThereBeDwight commented Oct 1, 2024 • edited Loading

andrewtimberlake commented Oct 1, 2024

LetThereBeDwight commented Oct 1, 2024 • edited Loading

andrewtimberlake commented Oct 2, 2024

LetThereBeDwight commented Oct 1, 2024 •

edited

Loading

LetThereBeDwight commented Oct 1, 2024 •

edited

Loading