With the final season of Game of Thrones coming up, I wanted to do something fun. And by FUN, I mean writing a 3 part series on some cool things you can do with Python! Because why not?
Web Scraping in a nutshell
In this tutorial, we will be exploring web-scrapping. The big picture process is:
- We’ll have Python visit a webpage.
- We’ll then parse that webpage with BeautifulSoup.
- You then set up code to grab specific data.
- For example: You might want to grab all the h1 tags. Or all the links. Or in our case, all of the images on a page.
Overall, a very simple process.
Except when it isn’t!
The challenge of Web Scraping for images
My goal was to turn my knowledge of web scraping content to grab images.
While web scraping for links, body text and headers is very straightforward, web scraping for images is significantly more complex. Let me explain.
As a web developer, hosting MULTIPLE full-sized images on a single webpage will slow the whole page down. Instead, use thumbnails and then only load the full-sized image when the thumbnail is clicked on.
For example: Imagine if we had twenty 1 megabyte images on our web page. Upon landing, a visitor would have to download 20 megabytes worth of images! The more common method is to make twenty 10kb thumbnail images. Now, your payload is only 200kb, or about 1/100 of the size!
So what does this have to do with web scraping images and this tutorial?
It means that it makes it pretty difficult to write a generic block of code that always works for every website. Websites implement all different ways to turn a thumbnail to a full-size image, which makes it a challenge to create a ‘one-size fits all’ model.
I’ll still teach what I learned. You’ll still gain a lot of skills from it. Just be aware that trying that code on other sites will require major modifications. Hurray for Zone of Proximal Development.
Python and Game of Thrones
The goal of this tutorial is that we’ll be gathering images of our favorite actors! Which will allow us to do weird things like make a Teenage Crush Actor Collage that we can hang in our bedroom. (like so)
In order to gather those images, we’ll be using Python to do some web scrapping. We’ll be using the BeautifulSoup library to visit a web page and grab all the image tags from it.
Some use cases for Web Scraping:
- You can grab all the links on a web page.
- You can grab all the post titles within a forum
- You can use it to grab the daily NASDAQ Value without ever visitint the site.
- You can use it to download all of the links within a repo that doesn’t have a ‘Download All’.
Web Scraping allows you to automatically grab web content through Python.
NOTE: In many website terms and conditions, they prohibit any web scraping of their data. Some develop APIs to allow you to tap into their data. Others do not. Additionally, try to be mindful that you are taking up their resources. So look to doing one request at a time rather than opening lots of connections in parallel and grinding their site to a halt.
# Import the libraries needed import requests import time from bs4 import BeautifulSoup # The URL to scrape url = 'https://www.popsugar.com/celebrity/Kit-Harington-Rose-Leslie-Cutest-Pictures-42389549?stream_view=1#photo-42389576' #url = 'https://www.bing.com/images/search?q=jon+snow&FORM=HDRSC2' # Connecting response = requests.get(url) # Grab the HTML and using Beautiful soup = BeautifulSoup (response.text, 'html.parser') #A loop code to run through each link, and download it for i in range(len(soup.findAll('img'))): tag = soup.findAll('img')[i] link = tag['src'] #skip it if it doesn't start with http if "http" in full_link: print("grabbed url: " + link) filename = str(i) + '.jpg' print("Download: " + filename) r = requests.get(link) open(filename, 'wb').write(r.content) else: print("grabbed url: " + link) print("skip") time.sleep(1)
Breaking down the code
Having Python Visit the Webpage
We start by importing the libraries needed, and then storing the webpage link into a variable.
- The Requests library is used to do all sorts of HTTP requests
- The Time library is used to put a 1 second wait after each request. If we didn’t include that, the whole loop will fire off as fast as possible, which isn’t very friendly to the sites we are scraping from.
- The BeautifulSoup Library is used to make exploring the DOM Tree easier.
# Import the libraries needed import requests import time from bs4 import BeautifulSoup # The URL to scrape url = 'https://www.popsugar.com/celebrity/Kit-Harington-Rose-Leslie-Cutest-Pictures-42389549?stream_view=1#photo-42389576' #url = 'https://www.bing.com/images/search?q=jon+snow&FORM=HDRSC2'
Parse that webpage with BeautifulSoup
Next, we push our URL into BeautifulSoup.
# Connecting response = requests.get(url) # Grab the HTML and using Beautiful soup = BeautifulSoup (response.text, 'html.parser')
Grabbing the content
Finally, we use a loop to grab the content.
It starts with a FOR loop. BeautifulSoup does some cool filtering, where my code asks BeautifulSoup find all the ‘img’ tags, and store it in a temporary array. Then, the len function asks for the length of the array.
#A loop code to run through each link, and download it for i in range(len(soup.findAll('img'))):
So in human words, if the array held 51 items, the code will look like
For i in range(50):
Next, we’ll return back to our soup object, and do the real filtering.
tag = soup.findAll('img')[i] link = tag['src']
Remember that we are in a for loop, so [i] represents a number.
So we are telling BeautifulSoup to findAll ‘img’ tags, store it in a temp array, and reference a specific index number based on where we are in the loop.
So instead of calling an array directly like allOfTheImages, we’re using soup.findAll(‘img’), and then passing it to the tag variable.
The data in the tag variable will look something like:
<img src="smiley.gif" alt="Smiley face" height="42" width="42">
Which is why the next step is pulling out the ‘src’.
Now we go to the final part of the loop, with downloading the content.
There’s a few odd design elements here that I want to point.
- The IF statement is actually a hack I made for other sites I was testing. There were times when I was grabbing images that was the part of the root site (like the favicon or the social media icons) that I didn’t want. So using the IF statement allowed me to ignore it.
- I also forced all the images to be .jpg. I could have written another chunk of IF statements to check the datatype, and then append the correct filetype. But that was adding a significant chunk of code that made this tutorial longer.
- I also added all the print commands. If you wanted to grab all the links of a webpage, or specific content — you can stop right here! You did it!
The one thing I do want to point out is the requests.get(link) and the open(filename, ‘wb’).write(r.content) code.
Request gets that link directly.
And open is a default python function that opens ‘filename’, gives it ‘wb’ parameter (writing & binary mode), then writes the link’s content (which is the image) into that filename.
Learn more about open here.
#skip it if it doesn't start with http if "http" in full_link: print("grabbed url: " + link) filename = str(i) + '.jpg' print("Download: " + filename) r = requests.get(link) open(filename, 'wb').write(r.content) else: print("grabbed url: " + link) print("skip") time.sleep(1)
Web Scraping has a lot of useful features.
This code won’t work right out of the box for most sites with images, but it can serve as a foundation to how to grab images on different sites.