Fix Your Crawler
Every now and then I open the admin panel of my blog hosting and ban a few IPs (after I’ve tried messaging their abuse email, if I find one). It is always IPs that are generating tons of requests (and traffic) – most likely running some home-made crawler. In some cases the IPs belong to an actual service that captures and provides content, in other cases it’s just a scraper for unknown reasons.
I don’t want to ban IPs, especially because that same IP may be reassigned to a legitimate user (or network) in the future. But they are increasing my hosting usage, which in turn leads to the hosting provider suggesting an upgrade in the plan. And this is not about me, I’m just an example – tons of requests to millions of sites are … useless.
My advice (and plea) is this – please fix your crawlers. Or scrapers. Or whatever you prefer to call that thing that programmatically goes on websites and gets their content.
How? First, reuse an existing crawler. No need to make something new (unless there’s a very specific use-case). A good intro and comparison can be seen here.
Second, make your crawler “polite” (the “politeness” property in the article above). Here’s a good overview on how to be polite, including respect for robots.txt. Existing implementations most likely have politeness options, but you may have to configure them.
Here I’d suggest another option – set a dynamic crawl rate per website that depends on how often the content is updated. My blog updates 3 times a month – no need to crawl it more than once or twice a day. TechCrunch updates many times a day; it’s probably a good idea to crawl it way more often. I don’t have a formula, but you can come up with one that ends up crawling different sites with periods between 2 minutes and 1 day.
Third, don’t “scrape” the content if a better protocol is supported. Many content websites have RSS – use that instead of the HTML of the page. If not, make use of sitemaps. If the WebSub protocol gains traction, you can avoid the crawling/scraping entirely and get notified on new content.
Finally, make sure your crawler/scraper is identifiable by the UserAgent. You can supply your service name or web address in it to make it easier for website owners to find you and complain in case you’ve misconfigured something.
I guess it makes sense to see if using a service like import.io, ScrapingHub, WrapAPI or GetData makes sense for your usecase, instead of reinventing the wheel.
No matter what your use case or approach is, please make sure you don’t put unnecessary pressure on others’ websites.
Every now and then I open the admin panel of my blog hosting and ban a few IPs (after I’ve tried messaging their abuse email, if I find one). It is always IPs that are generating tons of requests (and traffic) – most likely running some home-made crawler. In some cases the IPs belong to an actual service that captures and provides content, in other cases it’s just a scraper for unknown reasons.
I don’t want to ban IPs, especially because that same IP may be reassigned to a legitimate user (or network) in the future. But they are increasing my hosting usage, which in turn leads to the hosting provider suggesting an upgrade in the plan. And this is not about me, I’m just an example – tons of requests to millions of sites are … useless.
My advice (and plea) is this – please fix your crawlers. Or scrapers. Or whatever you prefer to call that thing that programmatically goes on websites and gets their content.
How? First, reuse an existing crawler. No need to make something new (unless there’s a very specific use-case). A good intro and comparison can be seen here.
Second, make your crawler “polite” (the “politeness” property in the article above). Here’s a good overview on how to be polite, including respect for robots.txt. Existing implementations most likely have politeness options, but you may have to configure them.
Here I’d suggest another option – set a dynamic crawl rate per website that depends on how often the content is updated. My blog updates 3 times a month – no need to crawl it more than once or twice a day. TechCrunch updates many times a day; it’s probably a good idea to crawl it way more often. I don’t have a formula, but you can come up with one that ends up crawling different sites with periods between 2 minutes and 1 day.
Third, don’t “scrape” the content if a better protocol is supported. Many content websites have RSS – use that instead of the HTML of the page. If not, make use of sitemaps. If the WebSub protocol gains traction, you can avoid the crawling/scraping entirely and get notified on new content.
Finally, make sure your crawler/scraper is identifiable by the UserAgent. You can supply your service name or web address in it to make it easier for website owners to find you and complain in case you’ve misconfigured something.
I guess it makes sense to see if using a service like import.io, ScrapingHub, WrapAPI or GetData makes sense for your usecase, instead of reinventing the wheel.
No matter what your use case or approach is, please make sure you don’t put unnecessary pressure on others’ websites.
Hey. thanks for mentioning my article in this post. I moved the blog to a separate domain of my own. Please change the link.
old https://medium.com/@pedchenko/comparison-of-open-source-web-crawlers-62a072308b53
new https://outsourceit.today/comparison-open-source-web-crawlers/
Thank you
Fixed, thanks.
https://tushirnitin.wordpress.com/2019/08/28/the-2019-web-developer-roadmap-a-guide-to-becoming-a-front-end-back-end-or-devops-developer/