Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parserOption to keep HTML Entities #140

Closed
JpEncausse opened this issue Sep 10, 2024 · 3 comments
Closed

parserOption to keep HTML Entities #140

JpEncausse opened this issue Sep 10, 2024 · 3 comments

Comments

@JpEncausse
Copy link

Hello,
I'm parsing RSS Feeds from TheOldReader shared feed. It seems they try to rewrite the RSS entry and rebuild RSS Items. But for HTML content stored in content:encoded they encode the text into <description> tags:

An Item from ArsTechnica :

<description>&lt;div&gt;
&lt;figure&gt;
  &lt;img src="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-800x533.jpeg" alt="Image of a chip with a device on it that is shaped like two triangles connected by a bar."&gt;
      &lt;p&gt;&lt;a href="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-scaled.jpeg"&gt;Enlarge&lt;/a&gt; / Quantinuum's H2 "racetrack" quantum processor. (credit: Quantinuum)&lt;/p&gt;  &lt;/figure&gt;
&lt;div&gt;&lt;a&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;On Tuesday, Microsoft made a series of announcements related to its Azure Quantum Cloud service. Among them was a demonstration of logical operations using the largest number of error-corrected qubits yet.&lt;/p&gt;
&lt;p&gt;"&lt;a href="https://arstechnica.com/science/2024/04/quantum-error-correction-used-to-actually-correct-errors/"&gt;Since April&lt;/a&gt;, we've tripled the number of logical qubits here," said Microsoft Technical Fellow Krysta Svore. "So we are accelerating toward that hundred-logical-qubit capability." The company has also lined up a new partner in the form of Atom Computing, which uses neutral atoms to hold qubits and has already demonstrated hardware with over 1,000 hardware qubits.&lt;/p&gt;
&lt;p&gt;Collectively, the announcements are the latest sign that quantum computing has emerged from its infancy and is rapidly progressing toward the development of systems that can reliably perform calculations that would be impractical or impossible to run on classical hardware. We talked with people at Microsoft and some of its hardware partners to get a sense of what's coming next to bring us closer to useful quantum computing.&lt;/p&gt;
&lt;/div&gt;&lt;p&gt;&lt;a href="https://arstechnica.com/?p=2048754#p3"&gt;Read 20 remaining paragraphs&lt;/a&gt; | &lt;a href="https://arstechnica.com/?p=2048754&amp;amp;comments=1"&gt;Comments&lt;/a&gt;&lt;/p&gt;</description>

When I decode the RSS Item with FeedParser all the HTML entities in the descrption field are stripped. What option should I set to avoid the strip of HTML entities ?

  • Convert them back to HTML into a String
  • or keep them as is

I tried the normalization = false but it fail and parse nothing.
Thanks

@ndaidong
Copy link
Collaborator

Can you share your code? It works for me when I disable normalization.
For more complicate use case, you can use getExtraEntryFields to get extractly what you need.

Screenshot from 2024-09-11 23-20-04

@JpEncausse
Copy link
Author

JpEncausse commented Sep 12, 2024

My TheOldReader feed is here :
https://theoldreader.com/profile/JpEncausse.rss

The code :

  • If I set normalization to false it fail. And my code should access to 100 other RSS feeds (that works correctly, I'm afraid it fails if I set to false.)
  • As a workaround I put a very very very ugly hack that works.
msg.overwatch.feed.options = { 
    'descriptionMaxLen': 0,
    'normalization' : true,
    'xmlParserOptions': { 
        processEntities: false, 
        htmlEntities: false,

        // UGLY HACK ------------------------------------------
        tagValueProcessor: (tagName, tagValue, jPath, hasAttributes, isLeafNode) => {
            if(tagName == 'description') { return tagValue.replace(/&/g,'§') }
            return tagValue;
        }
    }
}

if (source.ExtraField){
    let extrafields = source.ExtraField.split(',')
    msg.overwatch.feed.options.getExtraEntryFields = (entry) => {
        let extraContent = {}
        for (let extra of extrafields){
            extra = extra.trim();
            extraContent[extra] = entry[extra]
        }
        return extraContent
    }
}

try {
    msg.payload = await FeedExtractor.extract(source.FeedURL, msg.overwatch.feed.options,  {
        'headers': { 'User-Agent': 'Mozilla/5.0 (compatible; Node-RED; AI)' }
    })

    // UGLY HACK ------------------------------------------
    if (msg.payload.entries){
        for (let e of msg.payload.entries){
            if (e.description) { 
                e.description = he.decode(e.description.replace(/§/g,'&'))
            }
        }
    }

} catch(ex){
    msg.overwatch.feed.lastError = "Feed Parsing Exception: " + ex
    node.warn({msg : "Feed Parsing Exception", ex})
    return [msg, undefined];
}

@ndaidong
Copy link
Collaborator

ndaidong commented Sep 16, 2024

Sorry that I was offline a few days.

This feed content is quite standard, with normalization disabled, the full decoded description is being returned as well.
However, because this feed is something different in the list of 100 feeds you are working with, you still need some way to choose the right function to process it.

if (isTheOldReader(feedUrl)) {
 doSpecialFeedHandler(feedUrl)
} else {
 doRegularFeedHandler(feedUrl)
}

Here is my tested solution that you can refer to apply to your workflow:

import { extract } from '@extractus/feed-extractor'

const isTheOldReader = (rss) => {
  return rss.startsWith('https://theoldreader.com/')
}

const doRegularFeedHandler = (rss) => {
  return extract(rss)
}

const doSpecialFeedHandler = async (rss) => {
  const result = await extract(rss, {
    getExtraEntryFields: (feedEntry) => {
      const { description } = feedEntry
      return {
        description,
      }
    },
  })
  return result
}

const runParse = async (rss) => {
  return isTheOldReader(rss) ? doSpecialFeedHandler(rss) : doRegularFeedHandler(rss)
}

const feedUrl = 'https://theoldreader.com/profile/JpEncausse.rss'

const data = await runParse(feedUrl)
console.log(data)

Screenshot from 2024-09-16 15-03-20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants