Skip to content

parserOption to keep HTML Entities #140

Closed
@JpEncausse

Description

@JpEncausse

Hello,
I'm parsing RSS Feeds from TheOldReader shared feed. It seems they try to rewrite the RSS entry and rebuild RSS Items. But for HTML content stored in content:encoded they encode the text into <description> tags:

An Item from ArsTechnica :

<description>&lt;div&gt;
&lt;figure&gt;
  &lt;img src="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-800x533.jpeg" alt="Image of a chip with a device on it that is shaped like two triangles connected by a bar."&gt;
      &lt;p&gt;&lt;a href="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-scaled.jpeg"&gt;Enlarge&lt;/a&gt; / Quantinuum's H2 "racetrack" quantum processor. (credit: Quantinuum)&lt;/p&gt;  &lt;/figure&gt;
&lt;div&gt;&lt;a&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;On Tuesday, Microsoft made a series of announcements related to its Azure Quantum Cloud service. Among them was a demonstration of logical operations using the largest number of error-corrected qubits yet.&lt;/p&gt;
&lt;p&gt;"&lt;a href="https://arstechnica.com/science/2024/04/quantum-error-correction-used-to-actually-correct-errors/"&gt;Since April&lt;/a&gt;, we've tripled the number of logical qubits here," said Microsoft Technical Fellow Krysta Svore. "So we are accelerating toward that hundred-logical-qubit capability." The company has also lined up a new partner in the form of Atom Computing, which uses neutral atoms to hold qubits and has already demonstrated hardware with over 1,000 hardware qubits.&lt;/p&gt;
&lt;p&gt;Collectively, the announcements are the latest sign that quantum computing has emerged from its infancy and is rapidly progressing toward the development of systems that can reliably perform calculations that would be impractical or impossible to run on classical hardware. We talked with people at Microsoft and some of its hardware partners to get a sense of what's coming next to bring us closer to useful quantum computing.&lt;/p&gt;
&lt;/div&gt;&lt;p&gt;&lt;a href="https://arstechnica.com/?p=2048754#p3"&gt;Read 20 remaining paragraphs&lt;/a&gt; | &lt;a href="https://arstechnica.com/?p=2048754&amp;amp;comments=1"&gt;Comments&lt;/a&gt;&lt;/p&gt;</description>

When I decode the RSS Item with FeedParser all the HTML entities in the descrption field are stripped. What option should I set to avoid the strip of HTML entities ?

  • Convert them back to HTML into a String
  • or keep them as is

I tried the normalization = false but it fail and parse nothing.
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions