Scraping Google Search (without getting caught)

A scraping method resilient to IP blocking

Juan Luis Ruiz-Tagle
5 min readNov 7, 2020

Disclaimer: use ideas here under your own responsibility.

The IP blocking problem

If you are into web scraping you probably know that websites don’t like automated bots that pay a visit just to gather information. They have set up systems which can figure out that your program is not an actual person and, after a bunch of requests coming from your script, you usually get the dreadful HTTP 429 Too Many Requests Error. This message means that your IP address has been blocked from querying the website for a certain amount of time. Your bot can go home and cry.

Scraping Holy Grail

Google’s search engine is the obvious target to perform scraping. It gives you access to millions of URLs, ranked by relevance to your query. But these guys at google are very good at detecting bots, so it is a site particularly difficult to scrap. However, I present here a workaround that can bypass google’s barriers.

Winning your battles from the inside

The idea is very simple, and can be stated with a few keypoints:

  • If you scrape Google Search directly you will get caught after a while and will receive an HTTP 429 Error.
  • Google App Engine lets you deploy services to the cloud. Instead of running them in a specific host, they are…

--

--