In this article, we will scrape the top 100 movies from the IMDb using the
Beautiful Soup library in Python.
Importing required libraries:
First of all, we are going to import the required libraries to scrap the data.
import pandas as pd # to create dataframe import requests # to send the request to the URL from bs4 import BeautifulSoup # to get the content in the form of HTML import numpy as np # to count the values (in our case)
Sending HTTP request to the website:
- In this step, we have first assigned a variable named
urlwith the link of the website we want to scrape.
- After which we sent an HTTP
GETrequest to the IMDb website and parsed the HTML content of the response using
BeautifulSoup, allowing us to scrape and manipulate the data on the webpage.
Get the required meaningful "div" element to extract data from:
- Next, the code uses
BeautifulSoupto find and store all the HTML elements with the class
lister-item mode-advancedfrom the parsed webpage content.
- This class represents individual movie entries on the IMDb page, and we are extracting these elements for further data extraction and processing.
Loop through, scrape the content and create a dataframe:
- In the loop section, the code extracts various movie details for each entry on the IMDb webpage, including movie name, year of release, runtime, IMDb rating, Metascore (if available), number of votes, gross collection (if available), movie description, director, and the list of stars.
- This data is collected and stored in respective lists, and then we use those lists to create a Pandas DataFrame named
Saving data in an Excel file:
Finally, if you want to save your data in an Excel file, you can add the following line of code to the program: