boilerpipeR is an R-package which provides an interface to boilerpipe, a Java library written by Christian Kohlschütter [1]. It supports the generic extraction of main text content from HTML files and therefore removes ads, side-bars and headers from the HTML source content. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
To install the latest version from CRAN simply
install.packages("boilerpipeR")
Using the devtools package you can easily install the latest development version of boilerpipeR from github with
library(devtools)
install_github("mannau/boilerpipeR")
Windows users need to use the following command to install from github:
library(devtools)
install_github("mannau/boilerpipeR", args = "--no-multiarch")
To download and extract the main text from e.g. the R-Studio blog you can use the following commands:
library(boilerpipeR)
url <- "http://blog.rstudio.org/2014/05/09/reshape2-1-4/"
maintext <- ArticleExtractor(url, asText = FALSE)
cat(maintext)
[1] Christian Kohlschütter, Exploiting Links and Text Structure on the Web — A Quantitative Approach to Improving Search Quality, PhD Thesis
boilerpipe and boilerpipeR are both released under the Apache Version 2 License