Web Scraping
In this article, we will scrape the top 100 movies from the IMDb using the Beautiful Soup
library in Python.
Importing required libraries:
First of all, we are going to import the required libraries to scrap the data.
import pandas as pd # to create dataframe
import requests # to send the request to the URL
from bs4 import BeautifulSoup # to get the content in the form of HTML
import numpy as np # to count the values (in our case)
Sending HTTP request to the website:
- In this step, we have first assigned a variable named
url
with the link of the website we want to scrape. - After which we sent an HTTP
GET
request to the IMDb website and parsed the HTML content of the response usingBeautifulSoup
, allowing us to scrape and manipulate the data on the webpage.
Get the required meaningful "div" element to extract data from:
- Next, the code uses
BeautifulSoup
to find and store all the HTML elements with the classlister-item mode-advanced
from the parsed webpage content. - This class represents individual movie entries on the IMDb page, and we are extracting these elements for further data extraction and processing.
Loop through, scrape the content and create a dataframe:
- In the loop section, the code extracts various movie details for each entry on the IMDb webpage, including movie name, year of release, runtime, IMDb rating, Metascore (if available), number of votes, gross collection (if available), movie description, director, and the list of stars.
- This data is collected and stored in respective lists, and then we use those lists to create a Pandas DataFrame named
movie_df
.
Saving data in an Excel file:
Finally, if you want to save your data in an Excel file, you can add the following line of code to the program:
movie_df.to_excel("Top_100_IMDB_Movies.xlsx")