Bullfinch

Bullfinch is an extremely simple web crawler written in Rust for learning purposes.

Use

To run command line client:

cargo run --bin bf-client -- -v links  http://www.google.com 1

To run web service:

cargo run --bin bf-server

The server will bind to 127.0.0.1:8088 and will expose the following endpoints:

Get lists of registered domains that are being crawled:
"/api/v1/domains"
Get links for a given domain id:
"/api/v1/links/{domain_id}"
Get number of visited pages for a given domain id:
"/api/v1/num/{domain_id}"

Example use:

curl -i -H "Accept: application/json" "127.0.0.1:8088/api/v1/links/1"

There is a helper script at misc/client.py that makes requests to these endpoints.

Example

use bullfinch::error::BfError;
use bullfinch::Crawler;

fn main() -> Result<(), BfError> {
    let u = "https://www.thetimes.co.uk/";
    let mut crawler = Crawler::new(u)?;
    crawler.crawl_depth = 1;
    crawler.verbose_log = true;
    // This does not block
    crawler.start();
    // Wait to give crawler some time to crawl
    std::thread::sleep(std::time::Duration::from_secs(1));
    println!(
        "Found {} unique links (at a depth level {}).",
        crawler.visited.lock().unwrap().len(),
        crawler.crawl_depth
    );
    Ok(())
}

Architecture

Current

Currently there is just a single binary that serves as a command line interface (CLI)

Web interface is implemented also as a single binary using actix_web.

Visited links are stored in a HashSet, although Bloom filter might be a better choice¹. Main thread checks whether the link that we encounter has been visited, if not it is sent to worker threads. Communication is implemented using crossbeam-channel. Other approach would have been to use Arc<Mutex<>> queue to append new links for fetching.

Planned

In the future it will roughly look like this:

Improvements

Immediate issues that should be addressed:

Adhere to robots exclusion standard (robots.txt)
Logging - use slog instead of println. Verbosity levels.
Error handling - close to none at the moment. Need to define our own error type and wrap all other errors².
Persistency - serialize and save visited links to disk

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
info		info
misc		misc
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bullfinch

Use

Example

Architecture

Current

Planned

Improvements

About

Releases

Packages

Languages

License

pulcheri/bullfinch

Folders and files

Latest commit

History

Repository files navigation

Bullfinch

Use

Example

Architecture

Current

Planned

Improvements

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages