How to handle connection problems during long scraping exercises?

Habiba · September 14, 2022, 8:44pm

There is a potential that the owner of the web blocked you.

These are 5 tips you might consider

1. IP Rotation

The number one way sites detect web scrapers is by examining their IP address, thus most of the web scraping without getting blocked is using a number of different IP addresses to avoid any one IP address from getting banned. To avoid sending all of your requests through the same IP address, you can use an IP rotation service like ScraperAPI or other proxy services in order to route your requests through a series of different IP addresses. This will allow you to scrape the majority of websites without issue.

2. Set a Real User Agent

User Agents are a special type of HTTP header that will tell the website you are visiting exactly what browser you are using. Some websites will examine User Agents and block requests from User Agents that don’t belong to a major browser. Most web scrapers don’t bother setting the User Agent and are therefore easily detected by checking for missing User Agents. Don’t be one of these developers! Remember to set a popular User Agent for your web crawler (you can find a list of popular User Agents here). For advanced users, you can also set your User Agent to the Googlebot User Agent since most websites want to be listed on Google and therefore let Googlebot through. It’s important to remember to keep the User Agents you use relatively up to date, every new update to Google Chrome, Safari, Firefox, etc. has a completely different user agent, so if you go years without changing the user agent on your crawlers, they will become more and more suspicious. It may also be smart to rotate between a number of different user agents so that there isn’t a sudden spike in requests from one exact user agent to a site (this would also be fairly easy to detect).

3. Set Other Request Headers

Real web browsers will have a whole host of headers set, any of which can be checked by careful websites to block your web scraper. In order to make your scraper appear to be a real browser, you can navigate to https://httpbin.org/anything, and simply copy the headers that you see there (they are the headers that your current web browser is using). Things like “Accept”, “Accept-Encoding”, “Accept-Language”, and “Upgrade-Insecure-Requests” being set will make your requests look like they are coming from a real browser so you won’t get your web scraping blocked. For example, the headers from the latest Google Chrome are:

“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, 

 

image/apng,*/*;q=0.8,application/signed-exchange;v=b3″, 

 

“Accept-Encoding”: “gzip”, 

 

“Accept-Language”: “en-US,en;q=0.9,es;q=0.8”, 

 

“Upgrade-Insecure-Requests”: “1”, 

 

“User-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36”

By rotating through a series of IP addresses and setting proper HTTP request headers (especially User Agents), you should be able to avoid being detected by 99% of websites.

4. Set Random Intervals In Between Your Requests

It is easy to detect a web scraper that sends exactly one request each second 24 hours a day! No real person would ever use a website like that, and an obvious pattern like this is easily detectable. Use randomized delays (anywhere between 2-10 seconds for example) in order to build a web scraper that can avoid being blocked. Also, remember to be polite, if you send requests too fast you can crash the website for everyone, if you detect that your requests are getting slower and slower, you may want to send requests more slowly so you don’t overload the web server (you’ll definitely want to do this to help frameworks like Scrapy avoid being banned).

5. Set a Referrer****strong text

The Referer header is an HTTP request header that lets the site know what site you are arriving from. Generally, it’s a good idea to set this so that it looks like you’re arriving from Google, you can do this with the header:

“Referer”: “https://www.google.com/”

You can also change this up for websites in different countries, for example, if you are trying to scrape a site in the UK, you might want to use “https://www.google.co.uk/” instead of “https://www.google.com/”. You can also look up the most common referrers to any site using a tool like https://www.similarweb.com, often this will be a social media site like Youtube or some social media sites. Setting this header, makes your request look even more authentic, as it appears to be traffic from a site that the webmaster would be expecting a lot of traffic to come from during normal usage.

Hyder_Zaidi · December 20, 2022, 10:50pm