Web Scraping using Beautiful Soup

Web Scraping

Web scraping is a very powerful tool to learn for any data professional. With web scraping, the entire internet becomes your database. In this python tutorial, we introduce the fundamentals of web scraping using the python library, beautifulsoup. We show you how to parse a web page into a data file (csv) using a Python package called BeautifulSoup.

There are many services out there that augment their business data or even build out their entire business by using web scraping.Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products.

The Code

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

URl to web scrap from

In this example we web scrap graphics cards from Newegg.com
page_url = “Computer parts, laptops, electronics, and more - Newegg United States

Opens the connection and downloads html page from url.

uClient = uReq(page_url)

Parses html into a soup data structure to traverse html as if it were a json data type.

page_soup = soup(uClient.read(), "html.parser")
uClient.close()

Finds each product from the store page.

containers = page_soup.findAll("div", {"class": "item-container"})

Name the output file to write to local disk.

out_filename = "graphics_cards.csv"
# header of csv file to be written
headers = "brand,product_name,shipping\n"

Opens file and writes headers.

f = open(out_filename, "w")
f.write(headers)

Loops over each product and grabs attributes about each product.

for container in containers:
  #Finds all link tags "a" from within the first div
  make_rating_sp = container.div.select("a")
  
  #Grabs the title from the image title attribute then does proper casing using .title()
  brand = make_rating_sp[0].img["title"].title()
  
  #Grabs the text within the second "(a)" tag from within the list of queries
  product_name = container.div.select("a")[2].text
  
  #Grabs the product shipping information by searching  all lists with the class "price-ship" 
  #then cleans the text of white space with strip(). Cleans the strip of "Shipping $" 
  #if it exists to just get the number
  shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")
  
  #Prints the dataset to console
  print("brand: " + brand + "\n")
  print("product_name: " + product_name + "\n")
  print("shipping: " + shipping + "\n")
  
  #Writes the dataset to file
  f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "\n")

Close the file

``` f.close() ```