Originally published at: R Tutorial: Automated Web Scraping Using RVest
In this R tutorial, we show you how to automatically web scrape using rvest periodically so you can analyze timely/frequently updated data.
There are many blogs and tutorials that teach you how to scrape data from a bunch of web pages once and then you’re done. But one-off web scraping is not useful for many applications that require sentiment analysis on recent or timely content, or capturing changing events and commentary, or analyzing trends in real time. As fun as it is to do an academic exercise of web scraping for one-off analysis on historical data, it is not useful to when wanting to use timely or frequently updated data.
You would like to tap into news sources to analyze the political events that are changing by the hour and people’s comments on these events. These events could be analyzed to summarize the key discussions and debates in the comments, rate the overall sentiment of the comments, find the key themes in the headlines, see how events and commentary change over time, and more. You need a collection of recent political events or news scraped every hour so that you can analyze these events.
What we’ll do:
We’ll go through the process of writing standard web scraping commands in R, filtering timely data, analyzing or summarizing key information in the text, and sending an email alert of the results of your analysis. We’ll set up our script to run every hour so that text is scraped and analyzed periodically to capture changing events and commentary, or analyze trends in real time.
Let’s go fetch your data!