Skip to content

Commit

Permalink
Try to avoid intermittent errors when checking links (#86)
Browse files Browse the repository at this point in the history
* Avoid using the default HTTP transport

It's a global tranpsort and could be changed by any dependency.

Signed-off-by: Douglas Camata <[email protected]>

* Add sensible timeouts and reduce parallelism

This is an effort to avoid random errors when checking a bunch of links
across many files:

- Considering this is a tool often ran in resource constrained
  environments, like Github Actions, the HTTP client timeout was bumped
  to 30 seconds.

- Colly was configured with a depth of 1, to avoid crawling too much
  information from the websites

- The parallelism configuration of `colly` has been brought down from 100
  to 10. This setting means that at most 10 requests will be sent in
  parallel for the same matching domain. This way we are more friendly
  to the servers and hopefully they will be happier with us too, returning
  more responses with 200 status code.

- Added a random delay of up to 1 second to create new requests to
  matching domains, again to be more friendly with the servers, etc,  etc.

Signed-off-by: Douglas Camata <[email protected]>

* Move away from `Dial` (it's deprecated)

`DialContext` is the new way to go.

Signed-off-by: Douglas Camata <[email protected]>

* Remove maxDepth and default timeout of 30s

Signed-off-by: Douglas Camata <[email protected]>

* Make a few HTTP options configurable

These options are:
* Colly's HTTP parallelism
* Colly's random delay between HTTP requests
* HTTP transport's max connections per host

* Fix MaxConnsPerHost usage

Signed-off-by: Douglas Camata <[email protected]>

* Update example validate configuration

Signed-off-by: Douglas Camata <[email protected]>

* Add documentation about default config values

Signed-off-by: Douglas Camata <[email protected]>

* Handle parallelism config more safely

Signed-off-by: Douglas Camata <[email protected]>

* Revert editor's autoformat

Bad editor!

Signed-off-by: Douglas Camata <[email protected]>
  • Loading branch information
douglascamata authored Jul 12, 2022
1 parent e74bd1d commit d39d57f
Show file tree
Hide file tree
Showing 4 changed files with 80 additions and 25 deletions.
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,10 @@ For example,
```yaml mdox-exec="cat examples/.mdox.validate.yaml"
version: 1
timeout: '1m'
parallelism: 100
host_max_conns: 2
random_delay: '1s'
validators:
- regex: '(^http[s]?:\/\/)(www\.)?(github\.com\/)bwplotka\/mdox(\/pull\/|\/issues\/)'
Expand All @@ -135,10 +139,16 @@ validators:
- regex: 'thanos\.io'
type: 'roundtrip'
```
As seen above, mdox supports passing an array of link validators with types and regexes. There are three types of validators,
As seen above, mdox supports validate configuration supports a few parameters and passing an array of link validators with types and regexes. The supported configuration parameters are:
* `timeout`: The HTTP client's timeout. Defaults to "10s".
* `parallelism`: The maximum amount of concurrent HTTP requests. Defaults to 100.
* `host_max_conns`: The maximum amount of HTTP connections open per host. Defaults to 2.
* `random_delay`: A random delay between 0 and this value is added between requests. It takes values like "500ms", "1s", "1m", or "1m30s". Defaults to no delay.
There are three types of validators:
* `ignore`: This type of validator makes sure that `mdox` does not check links with provided regex. This is the most common use case.
* `githubPullsIssues`: This is a smart validator which only accepts a specific type of regex of the form `(^http[s]?:\/\/)(www\.)?(github\.com\/){ORG}\/{REPO}(\/pull\/|\/issues\/)`. It performs smart validation on GitHub PR and issues links, by fetching GitHub API to get the latest pull/issue number and matching regex. This makes sure that mdox doesn't get rate limited by GitHub, even when checking a large number of GitHub links(which is pretty common in documentation)!
Expand Down
5 changes: 4 additions & 1 deletion examples/.mdox.validate.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
version: 1
timeout: '1m'
parallelism: 100
host_max_conns: 2
random_delay: '1s'

validators:
- regex: '(^http[s]?:\/\/)(www\.)?(github\.com\/)bwplotka\/mdox(\/pull\/|\/issues\/)'
Expand All @@ -9,4 +13,3 @@ validators:

- regex: 'thanos\.io'
type: 'roundtrip'

26 changes: 22 additions & 4 deletions pkg/mdformatter/linktransformer/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,16 @@ import (
type Config struct {
Version int

Validators []ValidatorConfig `yaml:"validators"`
Timeout string `yaml:"timeout"`

timeout time.Duration
Validators []ValidatorConfig `yaml:"validators"`
Timeout string `yaml:"timeout"`
Parallelism int `yaml:"parallelism"`
// HostMaxConns has to be a pointer because a zero value means no limits
// and we have to tell apart 0 from not-present configurations.
HostMaxConns *int `yaml:"host_max_conns"`
RandomDelay string `yaml:"random_delay"`

timeout time.Duration
randomDelay time.Duration
}

type ValidatorConfig struct {
Expand Down Expand Up @@ -84,6 +90,18 @@ func ParseConfig(c []byte) (Config, error) {
}
}

if cfg.RandomDelay != "" {
var err error
cfg.randomDelay, err = time.ParseDuration(cfg.RandomDelay)
if err != nil {
return Config{}, errors.Wrap(err, "parsing random delay duration")
}
}

if cfg.Parallelism < 0 {
return Config{}, errors.New("parsing parallelism, has to be > 0")
}

if len(cfg.Validators) <= 0 {
return Config{}, errors.New("No validator provided")
}
Expand Down
60 changes: 42 additions & 18 deletions pkg/mdformatter/linktransformer/link.go
Original file line number Diff line number Diff line change
Expand Up @@ -207,42 +207,66 @@ func NewValidator(ctx context.Context, logger log.Logger, linksValidateConfig []
}
}

linktransformerMetrics := newLinktransformerMetrics(reg)
collector := colly.NewCollector(colly.Async(), colly.StdlibContext(ctx))
transport := &http.Transport{
Proxy: http.ProxyFromEnvironment,
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ResponseHeaderTimeout: 10 * time.Second,
ExpectContinueTimeout: 5 * time.Second,
DialContext: (&net.Dialer{
Timeout: 10 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
}
if config.HostMaxConns != nil {
transport.MaxConnsPerHost = *config.HostMaxConns
}
v := &validator{
logger: logger,
anchorDir: anchorDir,
validateConfig: config,
localLinks: map[string]*[]string{},
remoteLinks: map[string]error{},
c: colly.NewCollector(colly.Async(), colly.StdlibContext(ctx)),
c: collector,
destFutures: map[futureKey]*futureResult{},
l: &linktransformerMetrics{},
transportFn: func(url string) http.RoundTripper {
return http.DefaultTransport
l: linktransformerMetrics,
transportFn: func(u string) http.RoundTripper {
parsed, err := url.Parse(u)
if err != nil {
panic(err)
}
return promhttp.InstrumentRoundTripperCounter(
linktransformerMetrics.collyRequests,
promhttp.InstrumentRoundTripperDuration(
linktransformerMetrics.collyPerDomainLatency.MustCurryWith(prometheus.Labels{"domain": parsed.Host}),
transport,
),
)
},
}

v.l = newLinktransformerMetrics(reg)
v.transportFn = func(u string) http.RoundTripper {
parsed, err := url.Parse(u)
if err != nil {
panic(err)
}
return promhttp.InstrumentRoundTripperCounter(
v.l.collyRequests,
promhttp.InstrumentRoundTripperDuration(v.l.collyPerDomainLatency.MustCurryWith(prometheus.Labels{"domain": parsed.Host}), http.DefaultTransport),
)
}

// Set very soft limits.
// E.g github has 50-5000 https://docs.github.com/en/free-pro-team@latest/rest/reference/rate-limit limit depending
// on api (only search is below 100).
if config.Timeout != "" {
v.c.SetRequestTimeout(config.timeout)
}
if err := v.c.Limit(&colly.LimitRule{

limitRule := &colly.LimitRule{
DomainGlob: "*",
Parallelism: 100,
}); err != nil {
}
if config.Parallelism > 0 {
limitRule.Parallelism = config.Parallelism
}
if config.RandomDelay != "" {
limitRule.RandomDelay = config.randomDelay
}
if err := v.c.Limit(limitRule); err != nil {
return nil, err
}
v.c.OnRequest(func(request *colly.Request) {
Expand Down

0 comments on commit d39d57f

Please sign in to comment.