Ruby library providing DSL for describing custom Web scrapers.
PhantomJS — for installation please refer PhantomJS download page.
git clone https://github.com/OPLZZ/scrapybara.git
cd scrapybara
... install the required rubygems:
bundle install
You can take a look into examples directory for few annotated examples:
-
This definition goes through http://www.job-it.cz portal with implemented RDFa markup described in Recipe for marking-up job posting in RDFa, converts identified RDFa markup into JSON-LD and prints job posting title and employment type.
-
This definition prints first two pages of comments of articles listed on main page http://zpravy.idnes.cz
Each definition can be executed by scrapybara
binary from bin directory
For example:
bundle exec ./bin/scrapybara examples/job-it.rb
If you run definition in interactive
mode, code execution is paused for each #fetch, #extract part of the definition in Pry session. When code is paused, you can debug everything you can do in Pry session.
To continue, press CTRL+D
.
To exit, type exit!
and press enter.
bundle exec ./bin/scrapybara examples/job-it.rb --interactive
You can run each definition in browser by overriding Capybara driver from command line.
bundle exec ./bin/scrapybara examples/job-it.rb --driver selenium
You can combine previous options, so you can have interactive
mode with selenium
driver.
bundle exec ./bin/scrapybara examples/job-it.rb --driver selenium --interactive
You can enable debug
for interactive
mode to add even more breakpoints when definition is executed
bundle exec ./bin/scrapybara examples/job-it.rb --debug --interactive
##Funding The project No. CZ.1.04/5.1.01/77.00440 was funded from the European Social Fund through the Operational Programme Human Resources and Employment and the state budget of Czech Republic.