Skip to content

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.

Notifications You must be signed in to change notification settings

jacygao/spiderman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiderman

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.

One-time setup

Install Gumbo (https://github.com/google/gumbo-parser)

  git clone https://github.com/google/gumbo-parser.git
  
  $ ./autogen.sh
  $ ./configure
  $ make
  $ sudo make install

For Mac with Homebrew, do:

  brew install gumbo-parser

Clone Spiderman repo

  git clone https://github.com/JacyGao/spiderman.git

To compile Spiderman, do:

  tools/all.sh

To run Spiderman, do:

  ./a.out {url} {depth}

For example

  ./a.out http://www.ea.com 1

About

A simple web crawler that crawls a website n-links deep and calculate the number of unique rendered words found on each page and in total.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages