Repository for the presentation by Fanny Franchini at the online R for HTA 2021 conference
Please note that this presentation is aimed at beginners
Health Technology Assessment (HTA) and Health Economic (HE) analyses rely partly on the data hosted across multiple websites curated by different governmental bodies. As a result, there is no unified repository containing all the information necessary for data mining and subsequent analyses.
Web scraping is a technique that performs automated information extraction from websites. Scrapers work by parsing the page source code to retrieve programmatically specified elements. This workshop aims to introduce participants to scraping in R, for HTML-based websites.
In the case study presented, we scrape the Pharmaceutical Benefits Scheme website to produce a structured dataframe containing all drugs listed by the Australian TGA and their restriction of use, doses, current unit cost as well as historical cost.
The repository contains two R scripts.
- The first one
pbs_scraper.R
is the script used for this case-study, i.e. scraping the PBS. - The second one
function_scraper.R
is the script containing the functions that are used in 1.
Please head over here to access the slide deck: https://fannychini.github.io/
HTML basics : introduction to web structure @ Mozilla
HTML elements : complement to the above @ W3Schools
Scraping with R : rvest package homepage
Scraping etiquette : polite package homepage
CSS selectors : CSS selectors @ Interneting is hard
Scraping Javascript : Rselenium when Rvest is not enough
Please get in touch with any questions or suggestions: [email protected]