-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abstract_graph() for wikipedia hyperlinks. #12
Comments
@bearloga is there any easy sample code we could riff off of to (locally, not globally) find all pages linked from a given wikipedia page? |
@alexpghayes thanks for the shoutout and ping! :D Here are some possible options, assuming you mean Wikipedia pages linked to from a given Wikipedia page (as opposed to external links in References sections):
But yes, please don't just download a bunch of Wikipedia articles with a crawler. The Wikimedia Foundation is a non-profit organization with strict privacy & security policies, so we maintain our own data centers and do not rely on external CDNs like Cloudflare to distribute the burden of hosting and serving free knowledge. Hope that helps! Edit: pyWikiMM seems interesting/promising |
whoa. I thought clickstream was page views (node counts).... but that is another data set. clickstream is actual clicks (edge counts). That is amazing. Last month was less than 500 mb for english. totally do-able and actually more/better/interesting-er than simply hyperlinks. However, for simplicity, what about wikipediR::page_links? |
Oh! Yeah, totally! A few recommendations:
library(WikipediR)
linx <- page_links(
"en", "wikipedia",
page = "Aaron Halfaker|Hadley Wickham",
namespaces = 0
)
library(purrr)
map_dfr(
x$query$pages,
function(page) {
tibble::tibble(source = page$title, target = map_chr(page$links, ~ .x$title))
}
)
I don't think it accepts more than 50 at a time though. Also depending on the character length of the titles, concatenating too many may hit the URI length limit. think <2000 characters is the rule of thumb.
|
This is super helpful. Thank you @bearloga ! |
Thanks @bearloga!! Karl, as a side note, all of the APPR internals request serially. |
I assumed that we requested serially. That's good. |
Would be really cool to sample wikipedia hyperlink graph.
Wikipedia requests "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia."
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
So, would it be ok if we limited the number of pages downloaded? I don't know what a good number is. Is 50k too high?
Alternatively, that link above describes how one can download the data in bulk.
The text was updated successfully, but these errors were encountered: