A very basic web gathers financial data from a website and saves the results to CSV for further data analysis.
Learn the concepts of how to scrape data from websites and understand the different Python technologies available for web scraping.
- How to build a basic web scraper
- How to use the requests library to make HTTP requests
- How to save data to a CSV
BeautifulSoup is a very user friendly Python library, but does require additional libraries to work effectivey (e.g. lxml).
BeautifulSoup is especially useful for quickly build small or one-off web scrapping scripts (e.g. I used these skills to scrape all of the images from my old blog).
More complicated projects that require multipage scraping etc might be better built using scrapy.
Step 1: Install Required Packages
All of the below (except for Python) come pre-installed with the Anaconda package, but they can be manually installed via pip if required.
Step 2: Become familar with the DOM of the website
Access the webpage you want to scrape and inspect the DOM. Spend some time becoming familiar with the structure of the DOM and the css attributes of the content you want to scrape.
The example website I will scrape is the Yahoo Finance World Indicies page.
Step 3: Create a Python script and import dependencies
Create a Python script with the .py extension (e.g. “webscraper.py”) and import the required dependencies.
Step 4: Parse the HTML into a BeautifulSoup object
Grab the website content and create a BeautifulSoup object
Check your work so far by printing the object
Step 5: Grab a single elements from the soup object
Start small and grab a single element from the page. This way we can test that our code works correctly before scaling it up and grabbing multiple elements.
Use the .find() method to find and store the parent element with all of the content you want to scrape.
Ensure you are regularly testing your code by printing the results to avoid bugs in the final code.
Step 6: Grab multiple elements from the HTML object
Once we have established that we are correctly grabbing a single element from the page, we can then refactor our code and use the .find_all() method to grab all of the elements from the page.
- The singular .find() method is used when finding the index_name as it has a unique class name of class="text"
- The plural .find_all() method is used when finding all of the td elements, as there are multiple, non-unique td elements in each row containing financial data
- The close price data is stored in the 3rd td index
- The printing of the results above is only done to test the code. In the next step we will store the results into a variable for writing to CSV.
Step 7: Exporting results to CSV
Create csv file and write to it. This syntax needs to be included before the for loop.
Step 8: Run the final script
Either run the script directly from Sublime (Command + B) or run the script from your terminal.
The final script I created does not 100% resemble the above instructions as it includes a refactoring for capturing data from multiple fields, replacing non-numerical data with zeros etc.
The final script can be downloaded here