ReqCrawl

Req plugins to support common crawling functions

Installation

def deps do
  [
    {:req_crawl, "~> 0.2.0"}
    {:saxy, "~> 1.5"} # Optionally to use `ReqCrawl.Sitemap`
  ]
end

Plugins

ReqCrawl.Robots

A Req plugin to parse robots.txt files

You can attach this plugin to any %Req.Request you use for a crawler and it will only run against URLs with a path of /robots.txt.

It outputs a map with the following fields:

:errors - A list of any errors encountered during parsing
:sitemaps - A list of the sitemaps
:rules - A map of the rules with User-Agents as the keys and a map with the following values as the fields:
- :allow - A list of allowed paths
- :disallow - A list of the disallowed paths

ReqCrawl.Sitemap

Gathers all URLs from a Sitemap or SitemapIndex according to the specification described at https://sitemaps.org/protocol.html

Supports the following formats:

.xml (for sitemap and sitemapindex)
.txt (for sitemap)

Outputs a 2-Tuple of {type, urls} where type is one of :sitemap or :sitemapindex and urls is a list of URL strings extracted from the body.

Output is stored in the ReqResponse in the private field under the :crawl_sitemap key

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReqCrawl

Installation

Plugins

ReqCrawl.Robots

ReqCrawl.Sitemap

About

Releases

Packages

Languages

License

acalejos/req_crawl

Folders and files

Latest commit

History

Repository files navigation

ReqCrawl

Installation

Plugins

ReqCrawl.Robots

ReqCrawl.Sitemap

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages