In this article I would like to explore several popular methods of protecting website content from automatic scraping. Each method has its own pros and cons, so choose the one you need in light of a particular situation. Moreover, none of the methods is a panacea and virtually every method has its own bypasses, which I will also mention here.
1. IP address ban
The easiest and most common way of detecting website scraping attempts is to analyze the frequency and interval of queries to server. If any IP address sends too many queries or sends them too often, this address gets blocked and in order to unblock it you need to enter captcha.
It should be noted that for this type of protection you have to set the boundary between natural frequency and number of queries, and scraping attempts, to avoid blocking innocent users. Usually this boundary is determined through analyzing the behavior of normal website users.
Google can serve as an example of this method. It controls the quality of queries from a particular address and issues a warning along with blocking IP address and suggestion to enter captcha.
There are services (such as distilnetworks.com) that enable automatic monitoring of suspicious activity on your website and even initiate user checkup with captcha.
You can bypass this protection using several proxy servers that hide scraper’s real IP address. Services like BestProxyAndVPN offer inexpensive proxies, and SwitchProxy is specifically designed for automatic scrapers and allows for huge loads, though it is more expensive.
2. Using accounts
This protection method implies that only authorized users can access data. It makes it easier to control users behavior and block suspicious accounts regardless of IP address used by client.
Facebook is a good example. It actively controls actions of its users and blocks suspicious ones.
This protection can be bypassed by creating (including automatic) a plethora of accounts (there are even services that sell accounts for popular socials networks, such as buyaccs.com and bulkaccounts.com). A substantial impediment to automatic creation of accounts can be account verification requirement via phone and checking its uniqueness (the so-called Phone Verified Account, PVA). Although, this one can also be bypassed though buying a bunch of SIM cards.
3. Using CAPTCHA
This is also a common method of protecting your data from scraping. Under this scenario a user is asked to enter captcha (CAPTCHA) to access the website data. A substantial drawback of this method is that user actually must enter captcha. That is why this method is best for systems, whose data is accessed through non-frequent separate requests.
An example of using captcha for protection from automated queries can be services that check website search results position (such as http://smallseotools.com/keyword-position/).
Captcha can be bypassed with captcha recognition software and services that are divided into two categories: automatic recognition (OCR, such as GSA Captcha Breaker) and recognition with the help of a real person (when somewhere in India people are processing image recognition requests online, like in Bypass CAPTCHA). Human recognition is usually more efficient, but in this case you have to pay for each captcha, and not once, unlike buying a software.
4. Using a complex JavaScript logic
Under this scenario browser sends a special code (or several codes) upon sending a query to server. The code is generated by a complex logic written on JavsScript. At the same time, the code of this logic is often obfuscated and installed into one or several JavaScript files that are being uploaded.
Typical example of using this protection from scraping method is Facebook.
This protection can be bypassed by using real browsers for scraping (for example, with the help of Selenium or Mechanize libraries). This method has an additional advantage: by executing JavaScript scraper will show itself in website analytics (such as Google Analytics), which enables webmaster to take note of irregular activity.
5. Dynamic modification of page structure
One of the effective ways to protect your website from scraping is a frequent modification of page structure. This applies not only to name change of classes and identifiers, but even element hierarchy. This heavily complicates the writing of scraper, but, on the other hand, it complicates the system code too.
On the other hand, these changes can be made manually once a month (or once in several months). This will also give scrapers hard a time.
This protection can be bypassed by creating more flexible and ‘smart’ scraper or (if changes are made not too often) just manual modification of scraper following the changes.
6. Restricting the frequency of queries and downloadable data volume
This allows making bulk data scraping a very slow and therefore unreasonable process. Restrictions should be set based on typical user needs, to avoid decreasing the usability of website.
This protection can be bypassed by access website from various IP addresses or accounts (simulation of many users).
7. Displaying important data as images
This content protection method substantially complicates automatic data harvesting. At the same time, it preserves users’ visual access to data. Emails and phone numbers are often converted into images, but some websites even manage to replace random letters in text with images. Nothing prevents owners from displaying website content as graphics (be it Flash or HTML 5), although this can substantially harm site’s indexing by search engines.
The drawback of this method is not only in the fact that not all the content is going to be indexed by search engines, but also that it precludes users from copying data to clipboard.
It is very difficult to bypass this protection. Most probably one has to use automatic or manual image recognition, similar to captcha.