Skip to content

Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark

License

Notifications You must be signed in to change notification settings

pedroalvesbatista/spookystuff

 
 

Repository files navigation

Latest doc already moved to:

http://tribbloid.github.io/spookystuff/

SpookyStuff

... is a scalable query engine for web scraping/data integration/acceptance QA. The goal is to allow the Web being queried and ETL'ed like a relational database.

SpookyStuff is the fastest big data collection engine in history, with a speed record of querying 330404 dynamic pages per hour on 300 cores.

Build Status

branch \ profile scala-2.11 scala-2.12
master Codeship Status for tribbloid/spookystuff CI

Join the chat at https://gitter.im/tribbloid/spookystuff

SpookyStuff-UAV (alpha component)

... allows the same engine to be used to control a swarm of aerial robots for photogrammetry and data acquisition. It is still a work in progress, please refer to this proposal for a feature and implementation overview.

Build Status

branch \ profile scala-2.11 scala-2.12
master Build Status -

Join the chat at https://gitter.im/spookystuff-UAV/Lobby

Powered by

Core Apache Spark
Apache Spark
Apache Maven
Apache Maven
JSoup
JSoup
Apache Tika
Apache Tika
Web Yourkit Java Profiler
Yourkit
PhantomJS/GhostDriver
PhantomJS
Selenium
Selenium
UAV MAVLink
MAVLink

License

Copyright © 2014 by Peng Cheng @tribbloid, Sandeep Singh @techaddict, Terry Lin @ithinkicancode, Long Yao @l2yao and contributors.

Published under ASF License, see LICENSE.

About

Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 74.5%
  • HTML 12.7%
  • JavaScript 8.0%
  • Java 2.1%
  • Python 2.0%
  • Shell 0.4%
  • Dockerfile 0.3%