Index link text #7

sylvinus · 2016-02-20T02:51:42Z

Link text is a powerful signal for relevance.

Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.

A couple options I see:

Begin our Spark indexing pipeline by gathering a list of the top N link texts for every page/domain, and then in the same job iterate over the WARC files again, fetching the link text from Spark RDDs.
Same as above, but instead of keeping the link texts in Spark RDDs, store them in a large key/value db (target_url=>link_texts), from which they would be fetched by our current index process. Storing them in a permanent database would allow us to use them elsewhere but is obviously more complex.
Index like we do currently, and then do a second indexing pass just for link text. Which would mean an update to the Elasticsearch document, which is a costly operation.

The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.

sylvinus added hard needs discussion spark labels Feb 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index link text #7

Index link text #7

sylvinus commented Feb 20, 2016

Index link text #7

Index link text #7

Comments

sylvinus commented Feb 20, 2016