Krawl

An automatic Web crawler which collects hotel reviews from TripAdvisor.com and stores data in MongoDB.

Installation

  sudo npm -g install common-node

2 Install PhantomJS (http://phantomjs.org/)

  sudo npm -g install phantomjs

3 Install nodejs-legacy (required by common-node)

  sudo apt-get install nodejs-legacy

Edit src/0-add-city.js
Run ./0-add-cities.sh to add city data
Run ./1-collect-city-hotels.sh to collect hotel URL's
Run ./2-collect-hotel-reviews.sh to collect reviews URL's (you may need to run it more than once until all pages have been processed).
Run ./3-get-reviews-html.sh to download the HTML content of reviews (you may need to run it more than once until all reviews have been processed).
Run ./4-get-blocked-review-html.sh to download the HTML content of reviews which were blocked in the previous step. many steps (note this runs using nodejs, not common-node).
Finally, run ./5-process-review-html.sh (edit success_status_code and fail_status_code before running the script)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
0-add-cities.sh		0-add-cities.sh
1-collect-city-hotels.sh		1-collect-city-hotels.sh
2-collect-hotel-reviews.sh		2-collect-hotel-reviews.sh
3-get-reviews-html.sh		3-get-reviews-html.sh
4-get-blocked-review-html.sh		4-get-blocked-review-html.sh
5-process-review-html.sh		5-process-review-html.sh
LICENSE		LICENSE
README.md		README.md