An automatic Web crawler which collects hotel reviews from TripAdvisor.com and stores data in MongoDB.
1 Install common-node (https://www.npmjs.com/package/common-node)
sudo npm -g install common-node
2 Install PhantomJS (http://phantomjs.org/)
sudo npm -g install phantomjs
3 Install nodejs-legacy (required by common-node)
sudo apt-get install nodejs-legacy
- Edit
src/0-add-city.js
- Run
./0-add-cities.sh
to add city data - Run
./1-collect-city-hotels.sh
to collect hotel URL's - Run
./2-collect-hotel-reviews.sh
to collect reviews URL's (you may need to run it more than once until all pages have been processed). - Run
./3-get-reviews-html.sh
to download the HTML content of reviews (you may need to run it more than once until all reviews have been processed). - Run
./4-get-blocked-review-html.sh
to download the HTML content of reviews which were blocked in the previous step. many steps (note this runs using nodejs, not common-node). - Finally, run
./5-process-review-html.sh
(editsuccess_status_code
andfail_status_code before
running the script)