Result data omits links not matching crawlUrlfilter filter #51

geotheory · 2019-01-28T01:31:23Z

Thanks for this super useful package. I want to restrict the crawl to certain URL specifications, but capture all links on the crawled pages regardless of whether they match the filter. I can't get this to work in practice. An example:

Rcrawler(
  Website = "https://beta.companieshouse.gov.uk/company/02906991",
  no_cores = 4, no_conn = 4 ,
  NetworkData = TRUE, statslinks = TRUE,
  crawlUrlfilter = '02906991',
  saveOnDisk = F
)

Page https://beta.companieshouse.gov.uk/company/02906991/officers (which is crawled) includes links such as
https://beta.companieshouse.gov.uk/officers/... but these pages are not included in the results. E.g:

NetwIndex %>% str_subset('uk/officers')
character(0)

Shouldn't this links be captured, since I have provided no dataUrlfilter argument? Or am I missing something here?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result data omits links not matching crawlUrlfilter filter #51

Result data omits links not matching crawlUrlfilter filter #51

geotheory commented Jan 28, 2019

Result data omits links not matching crawlUrlfilter filter #51

Result data omits links not matching crawlUrlfilter filter #51

Comments

geotheory commented Jan 28, 2019