title | subtitle | author | job | framework | revealjs | highlighter | hitheme | widgets | mode | knit | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Scraping with R |
Test subtitle |
Eugene Pyatigorsky |
Ahalogy |
revealjs |
|
highlight.js |
Github |
selfcontained |
slidify::knit2slides |
June 21, 2016
**Eugene Pyatigorsky**
This presentation and supporting materials available at:
https://github.com/epspi/Rscraping
- Overview of packages
- A look at how to scrape
- Working example
- Best practices
--- .chapter
--- &vertical
Most of the work will be done by Hadley's package rvest
- Based on Python's
beautifulsoup
- Extracts elements from the dom using CSS or XPath
.fragment e.g.rvest::read_table()
This is (Hadley's) wrapper for curl
- Really useful for making customized calls to APIs
- Can also be used for writing your own APIs
.fragment e.g.httr::GET("some_endpoint", config)
--- .chapter
--- &vertical
lnk <- 'http://www.bing.com/search?q=Cincinnati+R+users+group&go=Submit&qs=n&form=QBLH&pq=cincinnati+r+users+g&sc=0-20&sp=-1&sk=&cvid=4A13A7CB066B419B9F7BD75777D68F09'
read_html(lnk) %>%
html_nodes("h2 a") %>%
html_text
## [1] "Cincinnati UC Users Group (Cincinnati, OH) - Meetup"
## [2] "Local R User Group Directory - Revolutions"
## [3] "New R User Group in Cincinnati / Dayton - Revolutions"
## [4] "Cincinnati Sharepoint User Group - Facebook"
## [5] "Cincinnati .Net Users Group"
## [6] "CincyPowerShell | PowerShell Community Groups"
## [7] "Reinaldo R. - Cincinnati UC Users Group (Cincinnati, OH ..."
## [8] "Group: Cincinnati |Tableau Support Community"
#
for "id=".
for "class="
.fragment OR you can use SelectorGadget for Chrome
https://chrome.google.com/webstore/detail/selectorgadget/
--- .chapter
--- &vertical
Cincinnati Foreclosures - A Real Estate Scraper
--- .chapter
--- &vertical
Use APIs instead of scraping whenever possible. There isn't a lot of documentation for rvest
and cookie-based authentication can be tricky.
- The real power of
R
andrvest
shines when used withshiny
(npi). - Put your scraping code in a standalone R script and automate with
cron
.