You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.
A couple options I see:
Begin our Spark indexing pipeline by gathering a list of the top N link texts for every page/domain, and then in the same job iterate over the WARC files again, fetching the link text from Spark RDDs.
Same as above, but instead of keeping the link texts in Spark RDDs, store them in a large key/value db (target_url=>link_texts), from which they would be fetched by our current index process. Storing them in a permanent database would allow us to use them elsewhere but is obviously more complex.
Index like we do currently, and then do a second indexing pass just for link text. Which would mean an update to the Elasticsearch document, which is a costly operation.
The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.
The text was updated successfully, but these errors were encountered:
Link text is a powerful signal for relevance.
Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.
A couple options I see:
The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.
The text was updated successfully, but these errors were encountered: