Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better inline charset handling for get_html #172

Closed
wants to merge 3 commits into from

Conversation

LetThereBeDwight
Copy link

Closes #171.

Changes proposed in this pull request

Aligning get_html matches with get_text to ensure that we match on the same types of content type specifications between the two, except for the obvious actual content type part differences.

Dwight Bell added 3 commits October 1, 2024 09:56
… specifications when calling get_html, and align array charset match strategy with get_text
@andrewtimberlake
Copy link
Collaborator

This change shouldn’t be needed, see my comment on #171

@LetThereBeDwight
Copy link
Author

LetThereBeDwight commented Oct 1, 2024

I've commented as well - the issue isn't with building it this way, the issue is in parsing it when it comes in this way via raw spec.

@andrewtimberlake
Copy link
Collaborator

Please give us an example in raw spec that causes this problem. The parser should break it into the content type and params

@LetThereBeDwight
Copy link
Author

LetThereBeDwight commented Oct 1, 2024

Okay apologies I now see what was happening (kinda) - I'm actually on a different version (0.3.1) since that's the latest in hex.

When testing like this against 0.3.1 in test/mail/parsers/rfc_2822_test.exs (apologies for the long string, using the multi-line string resulted in different parser results with really bad headers - this is what I get directly when pulling the email from s3).

  test "content-type html with explicit charset" do
    message =
      parse_email("Return-Path: <[email protected]>\r\nDate: Tue, 09 Jul 2024 12:53:58 +0000 (UTC)\r\nFrom: \"from name\" <[email protected]>\r\nReply-to: \"reply to name\" <[email protected]>\r\nTo: [email protected]\r\nSubject: SR0012084728 - RFQ\r\nMIME-version: 1.0\r\nContent-type: text/html; charset=UTF-8\r\nContent-transfer-encoding: quoted-printable\r\nReturn-Path: [email protected]\r\n\r\n<html><head></head><body><p>Test</p></body></html>")

    IO.inspect message, label: "Message"
    IO.inspect message.headers["content-type"], label: "Content-Type"
  end

We get the output:

Message: %Mail.Message{
  headers: %{
    "content-transfer-encoding" => "quoted-printable",
    "content-type" => "text/html; charset=UTF-8",
    "date" => ~U[2024-07-09 12:53:58Z],
    "from" => {"from name", "[email protected]"},
    "mime-version" => "1.0",
    "reply-to" => "\"reply to name\" <[email protected]>",
    "return-path" => "[email protected]",
    "subject" => "SR0012084728 - RFQ",
    "to" => ["[email protected]"]
  },
  body: "<html><head></head><body><p>Test</p></body></html>",
  parts: [],
  multipart: false
}
Content-Type: "text/html; charset=UTF-8"

SO - apologies for mislabling the version, which made me then go and try to update and realize that 0.4.0 isn't actually on Hex - is there a timeline for that or have you stopped pushing to hex entirely?

Granted I wouldn't have noticed these differences between get_text and get_html otherwise BUT I still maintain that those two parsers should align in any case, and the initial version of using put_content_type is also valid and generates a valid Mail.Message and email.

Understood if you don't wanna take on this PR as a result as there are other facilities here that work around the issue, but I do pretty firmly believe those two matches should align.

TLDR:

  1. My apologies on not recognizing the version I had in the system I was using that ran into this issue.
  2. When will 0.4.0 be pushed to hex (if ever)?
  3. Do you believe that it's valid for the functions get_html and get_text to be misaligned in the way they match content type specifications? If not, this PR is valid in it's current state and that usage of put_content_type shows the issue in a rather valid way to my eye. If so, then feel free to close.

@andrewtimberlake
Copy link
Collaborator

Thanks @LetThereBeDwight
I have published 0.4.0
You are right, get_text was inconsistent and I’ve changed that in #174 along with a number of doctests to check for consistency in our examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mail.get_html fails with inline charset specifications
2 participants