Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abstract_graph() for wikipedia hyperlinks. #12

Open
karlrohe opened this issue Jun 15, 2020 · 7 comments
Open

abstract_graph() for wikipedia hyperlinks. #12

karlrohe opened this issue Jun 15, 2020 · 7 comments

Comments

@karlrohe
Copy link

karlrohe commented Jun 15, 2020

Would be really cool to sample wikipedia hyperlink graph.

Wikipedia requests "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia."
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

So, would it be ok if we limited the number of pages downloaded? I don't know what a good number is. Is 50k too high?

Alternatively, that link above describes how one can download the data in bulk.

@alexpghayes
Copy link
Collaborator

aPPR probably isn't fast enough to worry about this too much, but either way, Wikimedia has an open API that we could just use to pull data (including a custom variant of SQL for knowledge graphs called SPARQL!!). See https://github.com/bearloga/WikidataQueryServiceR for some details.

@bearloga is there any easy sample code we could riff off of to (locally, not globally) find all pages linked from a given wikipedia page?

@bearloga
Copy link

bearloga commented Jun 17, 2020

@alexpghayes thanks for the shoutout and ping! :D

Here are some possible options, assuming you mean Wikipedia pages linked to from a given Wikipedia page (as opposed to external links in References sections):

But yes, please don't just download a bunch of Wikipedia articles with a crawler. The Wikimedia Foundation is a non-profit organization with strict privacy & security policies, so we maintain our own data centers and do not rely on external CDNs like Cloudflare to distribute the burden of hosting and serving free knowledge.

Hope that helps!

Edit: pyWikiMM seems interesting/promising

@karlrohe
Copy link
Author

whoa. I thought clickstream was page views (node counts).... but that is another data set.

clickstream is actual clicks (edge counts). That is amazing. Last month was less than 500 mb for english. totally do-able and actually more/better/interesting-er than simply hyperlinks.

However, for simplicity, what about wikipediR::page_links?

@bearloga
Copy link

bearloga commented Jun 17, 2020

Oh! Yeah, totally! WikipediR::page_links would be great. Internally it calls the MediaWiki API, which is much better than web-scraping.

A few recommendations:

  • Specify namespaces = 0 to limit links within the (Article) namespace
  • If fetching links recursively to grow the graph outward from the source node, combine the children with | to limit the number of individual API requests (per etiquette guidelines), for example:
library(WikipediR)
linx <- page_links(
  "en", "wikipedia",
  page = "Aaron Halfaker|Hadley Wickham",
  namespaces = 0
)

linx$query$pages will be a list with 2 elements, one for each article. As an example, the result can be made into a tibble with:

library(purrr)
map_dfr(
  x$query$pages,
  function(page) {
    tibble::tibble(source = page$title, target = map_chr(page$links, ~ .x$title))
  }
)
source target
Aaron Halfaker ACM Digital Library
Aaron Halfaker Arnnon Geshuri
Aaron Halfaker Artificial intelligence
... ...
Hadley Wickham Tidy data
Hadley Wickham Tidyverse
Hadley Wickham University of Auckland

I don't think it accepts more than 50 at a time though. Also depending on the character length of the titles, concatenating too many may hit the URI length limit. think <2000 characters is the rule of thumb.

  • It may be tempting to parallelize the process, but by "making your requests in series rather than in parallel, by waiting for one request to finish before sending a new request, should result in a safe request rate" (also from the etiquette guidelines)

@karlrohe
Copy link
Author

This is super helpful. Thank you @bearloga !

@alexpghayes
Copy link
Collaborator

alexpghayes commented Jun 17, 2020

Thanks @bearloga!! Karl, as a side note, all of the APPR internals request serially.

@karlrohe
Copy link
Author

I assumed that we requested serially. That's good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants