Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't canonicalize URLs before passing them to URL_FINGERPRINT_FUNCTION #335

Open
kmike opened this issue Jul 17, 2018 · 1 comment
Open

Comments

@kmike
Copy link
Member

kmike commented Jul 17, 2018

Currently frontera passes an URL to URL_FINGERPRINT_FUNCTION which is already canonicalized by w3lib's canonicalize_url function. By changing the API to pass raw URL users will be able to use canonicalize_url options like remove_fragments=False (which can be desired e.g. for Splash), or swap canonicalize implementation altogether. This would be backwards incompatible, though if desired it can be made backwards compatible (use a different settings, etc).

@sibiryakov
Copy link
Member

Currently frontera passes an URL to URL_FINGERPRINT_FUNCTION which is already canonicalized by w3lib's canonicalize_url function
Only if URL comes from Scrapy's link extractor with canonicalisation enabled. Also there is a create_request method which is used when adding new seeds, generating new urls in CS and discovery from sitemap. In other words this is not always true.

There is probably a better way to manage canonicalisation through the whole pipeline: a dedicated middleware. http://frontera.readthedocs.io/en/latest/topics/frontier-canonicalsolvers.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants