Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index link text #7

Open
sylvinus opened this issue Feb 20, 2016 · 0 comments
Open

Index link text #7

sylvinus opened this issue Feb 20, 2016 · 0 comments

Comments

@sylvinus
Copy link
Contributor

Link text is a powerful signal for relevance.

Current code can already extract the text. The main issue is that it's an external factor to the page and has to be determined (inverted) before we index the page if we want to keep a single indexing pass.

A couple options I see:

  • Begin our Spark indexing pipeline by gathering a list of the top N link texts for every page/domain, and then in the same job iterate over the WARC files again, fetching the link text from Spark RDDs.
  • Same as above, but instead of keeping the link texts in Spark RDDs, store them in a large key/value db (target_url=>link_texts), from which they would be fetched by our current index process. Storing them in a permanent database would allow us to use them elsewhere but is obviously more complex.
  • Index like we do currently, and then do a second indexing pass just for link text. Which would mean an update to the Elasticsearch document, which is a costly operation.

The good news is that unlike PageRank it doesn't need to be a graph operation. We should be fine for now (or ever) with 1 level of transmission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant