Skip to content
/ crawler Public
forked from spatie/crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

License

Notifications You must be signed in to change notification settings

SERFF/crawler

 
 

Repository files navigation

🕸 Easily crawl the web using PHP 🕷

Latest Version on Packagist Software License Build Status Quality Score StyleCI Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that implements the \Spatie\Crawler\CrawlObserver interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Psr\Http\Message\UriInterface $url
 */
public function willCrawl(UriInterface $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Psr\Http\Message\UriInterface $url
 * @param \Psr\Http\Message\ResponseInterface $response
 * @param \Psr\Http\Message\UriInterface $foundOn
 */
public function hasBeenCrawled(UriInterface $url, $response, ?UriInterface $foundOn = null);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Using multiple observers

You can set multiple observers with setCrawlObservers:

Crawler::create()
    ->setCrawlObservers([
        <implementation of \Spatie\Crawler\CrawlObserver>,
        <implementation of \Spatie\Crawler\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);

Alternatively you can set multiple observers one by one with addCrawlObserver:

Crawler::create()
    ->addCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->addCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->addCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

Executing JavaScript

By default the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Crawler::create()
    ->executeJavaScript()
    ...

In order to make it possible to get the body html after the javascript has been executed, this package depends on our Browsershot package. This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.

Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling executeJavaScript().

Filtering certain urls

You can tell the crawler not to visit certain urls by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;

This package comes with three CrawlProfiles out of the box:

  • CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
  • CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
  • CrawlSubdomainUrls: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setConcurrency(1) //now all urls will be crawled one by one

Setting the maximum crawl count

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the setMaximumCrawlCount method.

// stop crawling after 5 urls

Crawler::create()
    ->setMaximumCrawlCount(5)

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Crawler::create()
    ->setMaximumDepth(2)

Setting the maximum response size

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in CollectionCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueue\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Crawler::create()
    ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueue\CrawlQueue>)

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
npm install
./start_server.sh

With the server running, you can start testing.

vendor/bin/phpunit

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

Support us

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Does your business depend on our contributions? Reach out and support us on Patreon. All pledges will be dedicated to allocating workforce on maintenance and new awesome stuff.

License

The MIT License (MIT). Please see License File for more information.

About

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PHP 94.9%
  • JavaScript 4.5%
  • Shell 0.6%