Skip to content

mannau/boilerpipeR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

boilerpipeR

R build status

boilerpipeR is an R-package which provides an interface to boilerpipe, a Java library written by Christian Kohlschütter [1]. It supports the generic extraction of main text content from HTML files and therefore removes ads, side-bars and headers from the HTML source content. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Install

To install the latest version from CRAN simply

install.packages("boilerpipeR")

Using the devtools package you can easily install the latest development version of boilerpipeR from github with

library(devtools)
install_github("mannau/boilerpipeR")

Windows users need to use the following command to install from github:

library(devtools)
install_github("mannau/boilerpipeR", args = "--no-multiarch")

Usage

To download and extract the main text from e.g. the R-Studio blog you can use the following commands:

library(boilerpipeR)

url <- "http://blog.rstudio.org/2014/05/09/reshape2-1-4/"
maintext <- ArticleExtractor(url, asText = FALSE)
cat(maintext)

References

[1] Christian Kohlschütter, Exploiting Links and Text Structure on the Web — A Quantitative Approach to Improving Search Quality, PhD Thesis

License

boilerpipe and boilerpipeR are both released under the Apache Version 2 License

About

Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published