Skip to content

Commit

Permalink
Add CRAWLER_DIRECTIVES.md and purge crawls documentation (#115)
Browse files Browse the repository at this point in the history
- Add documentation for purge crawls to main `README`.
- Add `CRAWLER_DIRECTIVES.md`, which includes documentation about
managing crawl discovery and extraction through server or HTML changes.
- These docs are largely copied from official [optimizing web content
docs](https://www.elastic.co/guide/en/enterprise-search/current/crawler-content.html),
with relevant changes applied.
- Remove unused `SCHEDULING.md`. This was mistakenly added in a previous
PR. Scheduling is documented in the main `README.md`.
  • Loading branch information
navarone-feekery authored Sep 3, 2024
1 parent dc58628 commit 2d29f08
Show file tree
Hide file tree
Showing 3 changed files with 279 additions and 4 deletions.
31 changes: 27 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,35 @@ Indexing web content with the Open Crawler requires:

### Execution logic

Open Crawler runs crawl jobs on command based on config files in the `config` directory.
Crawler runs crawl jobs on command, based on config files in the `config` directory.
Each URL endpoint found during the crawl will result in one document to be indexed into Elasticsearch.
Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.

Open Crawler performs crawl jobs in a multithreaded environment, where one thread will be used to visit one URL endpoint.
The crawl results from these are added to a pool of results.
These are indexed into Elasticsearch using the `_bulk` API once the pool reaches a configurable threshold.
Crawls are performed in two stages:

#### 1. Primary crawl

Beginning with URLs included as `seed_urls`, the Crawler begins crawling web content.
While crawling, each link it encounters will be added to the crawl queue, unless the link should be ignored due to [crawl rules](./docs/features/CRAWL_RULES.md) or [crawler directives](./docs/features/CRAWLER_DIRECTIVES.md).

The crawl results from visiting these webpages are added to a pool of results.
These are indexed into Elasticsearch using the `_bulk` API once the pool reaches the configured threshold.

#### 2. Purge crawl

After a primary crawl is completed, Crawler will then fetch every doc from the associated index that was not encountered during the primary crawl.
It does this through comparing the `last_crawled_at` date on the doc to the primary crawl's start time.
If `last_crawled_at` is earlier than the start time, that means the webpage was not updated during the primary crawl and should be added to the purge crawl.

Crawler then re-crawls all of these webpages.
If a webpage is still accessible, Crawler will update its Elasticsearch doc.
A webpage can be inaccessible due to any of the following reasons:

- Updated [crawl rules](./docs/features/CRAWL_RULES.md) in the configuration file that now exclude the URL
- Updated [crawler directives](./docs/features/CRAWLER_DIRECTIVES.md) on the server or webpage that now exclude the URL
- Non-`200` response from the webserver

At the end of the purge crawl, all docs in the index that were not updated during either the primary crawl or the purge crawl are deleted.

### Setup

Expand Down
252 changes: 252 additions & 0 deletions docs/features/CRAWLER_DIRECTIVES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
# Crawler Directives

Crawler directives are external features that the Open Crawler adheres to.
These typically require no change to the Open Crawler configuration; instead, they are set up directly on the webserver or webpage.

These techniques are similar to search engine optimization (SEO) techniques used for other web crawlers and robots.
For example, you can embed instructions for the web crawler within your HTML content.
You can also prevent the crawler from following links or indexing any content for certain webpages.
Use these tools to manage webpage discovery and content extraction.

**Discovery** concerns which webpages and files from crawled domains get indexed:

- [Canonical URL link tags](#canonical-url-link-tags)
- [Robots meta tags](#robots-meta-tags)
- [Nofollow links](#nofollow-links)
- [`robots.txt` files](#robotstxt-files)
- [Sitemaps](#sitemaps)

**Extraction** concerns how content is indexed and mapped to fields in Elasticsearch documents:

- [Data attributes for inclusion and exclusion](#data-attributes-for-inclusion-and-exclusion)

## HTML elements and attributes

The following sections describe crawler instructions you can embed within HTML elements and attributes.

### Canonical URL link tags

A canonical URL link tag is an HTML element you can embed within pages that duplicate the content of other pages.
The canonical URL link tag specifies the canonical URL for that content.

The canonical URL is stored on the document in the `url` field, while the `additional_urls` field contains all other URLs where the crawler discovered the same content.
If your site contains pages that duplicate the content of other pages, use canonical URL link tags to explicitly manage which URL is stored in the `url` field of the indexed document.

#### Template

```html
<link rel="canonical" href="{CANONICAL_URL}">
```

#### Example

```html
<link rel="canonical" href="https://example.com/categories/dresses/starlet-red-medium">
```

### Robots meta tags

Robots meta tags are HTML elements you can embed within pages to prevent the crawler from following links or indexing content.
These tags are related to [crawl rules](./CRAWL_RULES.md).

#### Template

```html
<meta name="robots" content="{DIRECTIVES}">
```

#### Supported directives

`noindex`

- The web crawler will not index the page
- If you want to index some, but not all, content on a page, see [data attributes for inclusion and exclusion](#data-attributes-for-inclusion-and-exclusion)

`nofollow`

- The web crawler will not follow links from the page
- The directive does not prevent the web crawler from indexing the page

#### Examples

```html
<meta name="robots" content="noindex">
<meta name="robots" content="nofollow">
<meta name="robots" content="noindex, nofollow">
```

### Data attributes for inclusion and exclusion

Inject HTML data attributes into your webpages to instruct the web crawler to include or exclude particular sections from extracted content.
For example, use this feature to exclude navigation and footer content when crawling, or to exclude sections of content only intended for screen readers.

These attributes work as follows:

- For all pages that contain HTML tags with a `data-elastic-include` attribute, the crawler will only index content within those tags.
- For all pages that contain HTML tags with a `data-elastic-exclude` attribute, the crawler will skip those tags from content extraction.
- You can nest `data-elastic-include` and `data-elastic-exclude` tags.
- The web crawler will still crawl any links that appear inside excluded sections, as long as the configured crawl rules allow them.

#### Examples

Here's a simple content exclusion rule example:

```html
<body>
<p>This is your page content, which will be indexed by the web crawler.
<div data-elastic-exclude>Content in this div will be excluded from the search index</div>
</body>
```

In this more complex example with nested exclusion and inclusion rules, the web crawler will only extract "test1 test3 test5 test7" from the page.

```html
<body>
test1
<div data-elastic-exclude>
test2
<p data-elastic-include>
test3
<span data-elastic-exclude>
test4
<span data-elastic-include>test5</span>
</span>
</p>
test6
</div>
test7
</body>
```

### Nofollow links

Nofollow links are HTML links that instruct the crawler to not follow the URL.

The web crawler will not follow links that include `rel="nofollow"` (that is, will not add links to the crawl queue).
The link does not prevent the web crawler from indexing the page in which it appears.

#### Template

```html
<a rel="nofollow" href="{LINK_URL}">{LINK_TEXT}</a>
```

#### Example

```html
<a rel="nofollow" href="/admin/categories">Edit this category</a>
```

## Server-based configurations

The following sections describe crawler instructions you can include on your webserver.

### `robots.txt` files

> [!TIP]
It is impossible to configure the web crawler to ignore or work around a domain’s `robots.txt` file.
Remember this if you’re crawling a domain you don’t control.

A domain may have a `robots.txt` file.
This is a plain text file that provides instructions to web crawlers.
The instructions within the file, also called directives, communicate which paths within that domain are disallowed (and allowed) for crawling.

You can also use a `robots.txt` file to specify [sitemaps](#sitemaps) for a domain.

Most web crawlers automatically fetch and parse the `robots.txt` file for each domain they crawl.
If you already publish a `robots.txt` file for other web crawlers, be aware the web crawler will fetch this file and honor the directives within it.
You may want to add, remove, or update the `robots.txt` file for each of your domains.

#### Example

To add a `robots.txt` file to the domain `https://shop.example.com`:

1. Determine which paths within the domain you’d like to exclude.
2. Create a `robots.txt` file with the appropriate directives from the [robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard), for example:
```txt
User-agent: *
Disallow: /cart
Disallow: /login
Disallow: /account
```
3. Publish the file, with filename `robots.txt`, at the root of the domain: `https://shop.example.com/robots.txt`.

The next time the web crawler visits the domain, it will fetch and parse the `robots.txt` file.
The web crawler will crawl only those paths that are allowed by the [crawl rules](./CRAWL_RULES.md) and the directives within the `robots.txt` file for the domain.

#### Nonstandard extensions

The Elastic web crawler supports some, but not all, [nonstandard extensions](https://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions) to the robots exclusion standard:

| Directive | Support |
|-----------------------|---------------|
| Crawl-delay directive | Not supported |
| Sitemap directive | Supported |
| Host directive | Not supported |

### Sitemaps

A sitemap is an XML file, associated with a domain, that informs web crawlers about pages within that domain.
XML elements within the sitemap identify specific URLs that are available for crawling.
Each domain may have one or more sitemaps.

If you already publish sitemaps for other web crawlers, the web crawler can use the same sitemaps.
To make your sitemaps discoverable, specify them within `robots.txt` files.

Sitemaps are related to the `seed_urls` configuration field.
You can choose to submit URLs to the web crawler using sitemaps, seed URLs, or a combination of both.

Use sitemaps to inform the web crawler of pages you think are important, or pages that are isolated and not linked from other pages.
However, be aware the web crawler will visit only those pages from the sitemap that are allowed by the domain’s [crawl rules](./CRAWL_RULES.md) and [robots.txt](#robotstxt-files) file directives.

#### Sitemap discovery and management

To add a sitemap to a domain, you can specify it within a `robots.txt` file.
At the start of each crawl, the web crawler fetches and processes each domain’s `robots.txt` file and each sitemap specified within those `robots.txt` files.

#### Sitemap format and technical specification

The [sitemaps standard](https://www.sitemaps.org/index.html) defines the format and technical specification for sitemaps.
Refer to the standard for the required and optional elements, character escaping, and other technical considerations and examples.

The web crawler does not process optional metadata defined by the standard.
The web crawler extracts a list of URLs from each sitemap and ignores all other information.

There is no guarantee that pages (and their respective linked pages) will be indexed in the order they appear in the sitemap, because crawls are run asynchronously.

Ensure each URL within your sitemap matches the exact domain — here defined as scheme + host + port— for your site.
Different subdomains (like `www.example.com` and `blog.example.com`) and different schemes (like `http://example.com` and `https://example.com`) require separate sitemaps.

The web crawler also supports sitemap index files.
Refer to [using sitemap index files](https://www.sitemaps.org/protocol.html#index) within the sitemap standard for sitemap index file details and examples.

#### Manage sitemaps

To add a sitemap to the domain `https://shop.example.com`:

1. Determine which pages within the domain you’d like to include
- Ensure these paths are allowed by the domain’s crawl rules and the directives within the domain’s `robots.txt` file
2. Create a sitemap file with the appropriate elements from the sitemap standard, for example:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://shop.example.com/products/1/</loc>
</url>
<url>
<loc>https://shop.example.com/products/2/</loc>
</url>
<url>
<loc>https://shop.example.com/products/3/</loc>
</url>
</urlset>
```
3. Publish the file on your site, for example, at the root of the domain: `https://shop.example.com/sitemap.xml`.
4. Create or modify the `robots.txt` file for the domain, located at `https://shop.example.com/robots.txt`
- Anywhere within the file, add a Sitemap directive that provides the location of the sitemap. For instance:
```txt
Sitemap: https://shop.example.com/sitemap.xml
```
5. Publish the new or updated `robots.txt` file.

The next time the web crawler visits the domain, it will fetch and parse the `robots.txt` file and the sitemap.
Empty file removed docs/features/SCHEDULING.md
Empty file.

0 comments on commit 2d29f08

Please sign in to comment.