Scraping Google Search (without getting caught)

A scraping method resilient to IP blocking

Disclaimer: use ideas here under your own responsibility.

If you are into web scraping you probably know that websites don’t like automated bots that pay a visit just to gather information. They have set up systems which can figure out that your program is not an actual person and, after a bunch of requests coming from your script, you usually get the dreadful HTTP 429 Too Many Requests Error. This message means that your IP address has been blocked from querying the website for a certain amount of time. Your bot can go home and cry.

Google’s search engine is the obvious target to perform scraping. It gives you access to millions of URLs, ranked by relevance to your query. But these guys at google are very good at detecting bots, so it is a site particularly difficult to scrap. However, I present here a workaround that can bypass google’s barriers.

Winning your battles from the inside

The idea is very simple, and can be stated with a few keypoints:

  • If you scrape Google Search directly you will get caught after a while and will receive an HTTP 429 Error.
  • Google App Engine lets you deploy services to the cloud. Instead of running them in a specific host, they are dynamically distributed in machines called containers.
  • Each time you re-deploy your service it lives in a different container with a different IP address.
  • We can build a scraper and run it from App Engine. When we get caught we save the state, re-deploy our scraper, and keep scraping from a new container.
Image by author
Image by author

System architecture

Ok, so how are we going to do this? We will follow a master-slave model for the architecture. The Slave will be deployed to the cloud and will run scraping jobs on demand. The Master will live in your local machine and will orchestrate the slave (or slaves, although we won't go that far), sending jobs.

When the slave eventually fails by receiving a Too Many Requests Error, the master will kill it and deploy a new slave which will continue the job. Since it will have a new IP, google won’t recognize it as the previous scraper. (I just realized that all this might seem pretty obscure for people unfamiliar with the jargon. Please don’t get me wrong and take a look at the master-slave model)

Anyways, a picture is worth a thousand words.

Scraper architecture based on the master-slave model. Image by author
Scraper architecture based on the master-slave model. Image by author
Scraper architecture based on the master-slave model

Hands-on implementation

First, we need to configure a Google Cloud project. Only then can we build the Master and the Slave.

I won't go into the details, but to get this going you first need to set up Google Cloud and do a couple of things:

Building the Slave

Our slave will use Flask as a backbone for communications and a Scraper Object for the heavy lifting. Both (Flask app and Scraper) will run in separate processes and communicate through a Pipe. The Flask app will use it to send the requested jobs to the scraper, and the scraper will answer with its state when it changes.

As you can see in the scrape method, the slave is just using a Scraper() object which receives a job (a dict with parameters which have not been defined anywhere) and returns a DataFrame. This gives a lot of freedom to implement the Scraper() the way you like. For the scraper I wrote as an example(here), a job is defined as something like this:

job = {“query”:“Football”, “start”:“2020–01–01”, “end”:“2020–03–29”}

When my scraper receives this job, it will retrieve the URLs of the top 10 results for “football”, filtering by each date between “start “ and “end”. That is, the top 10 for “2020–01–01”, top 10 for “2020–01–02”… I use the the “googlesearch” package, which comes in very handy.

Three routes are defined for the server:

  • /start : called by the master at the very beginning. It creates the Scraper child process which runs in the background
  • /job : passes jobs from master to slave with the parameters encoded in the URL.
  • /state : tells the master the current state of the slave (“idle”, “busy” or “scraping-detected”)

Building the Master

The main method in the Master class (no pun intended) is orchestrate() , which periodically checks the state of the slave with check_slave_state, and sends a job if it is idle.

If the state is scraping-detected then restart_machine() is called and the app gets deployed. The master keeps track of the job which caused the failure and re-sends it to the slave once it is running again.

If the Slave stops answering (no-answer) it is because it is under deployment.

Ok, but show me the code

Allright, you can take a look at my github repo. To test it in local you just need to run:

  • pip install -r requirements.txt to install all the needed dependencies
  • python to run the master
  • gunicorn -b :8080 slave:app --timeout 360000 --preload to run the slave, (command used to launch the server in google cloud)

Remember to run master and slave from different terminal sessions. The master execution from the example looks like this:

Deploying the slave to Google Cloud

After you have tested everything locally, the final step is all about deploying the slave to App Engine from your terminal. First, you should have logged in to Google Cloud through the gcloud command-line-tool (run gcloud init and select your project). Then, the magic line to deploy the slave is…

gcloud app deploy

This takes the configuration from app.yaml and uses it to create your instance. Check the file and modify it to fit your project

Hope you found this insightful and feel free to ask me any questions in the comments section or by email.

Interested about almost everything

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store