Scraping Google Search (without getting caught)
Disclaimer: use ideas here under your own responsibility.
The IP blocking problem
If you are into web scraping you probably know that websites don’t like automated bots that pay a visit just to gather information. They have set up systems which can figure out that your program is not an actual person and, after a bunch of requests coming from your script, you usually get the dreadful HTTP 429 Too Many Requests Error. This message means that your IP address has been blocked from querying the website for a certain amount of time. Your bot can go home and cry.
Scraping Holy Grail
Google’s search engine is the obvious target to perform scraping. It gives you access to millions of URLs, ranked by relevance to your query. But these guys at google are very good at detecting bots, so it is a site particularly difficult to scrap. However, I present here a workaround that can bypass google’s barriers.
Winning your battles from the inside
The idea is very simple, and can be stated with a few keypoints:
- If you scrape Google Search directly you will get caught after a while and will receive an HTTP 429 Error.
- Google App Engine lets you deploy services to the cloud. Instead of running them in a specific host, they are dynamically distributed in machines called containers.
- Each time you re-deploy your service it lives in a different container with a different IP address.
- We can build a scraper and run it from App Engine. When we get caught we save the state, re-deploy our scraper, and keep scraping from a new container.
Ok, so how are we going to do this? We will follow a master-slave model for the architecture. The Slave will be deployed to the cloud and will run scraping jobs on demand. The Master will live in your local machine and will orchestrate the slave (or slaves, although we won't go that far), sending jobs.