Capture Data with a speed that is Humanly Impossible, a step-by-step Python Tutorial

August 30, 2023 Rudi Sygo

Welcome to this exciting Python tutorial on web scraping! In this tutorial, we will learn how to extract and transform data from websites using Python. Web scraping allows us to gather information from web pages and store it in a structured format, such as a CSV file. We will be using the requests, BeautifulSoup, and pandas libraries to accomplish this task.

Table of Contents

Step 1: Extracting Data

To begin, we need to extract data from a website. We will use the requests library to send an HTTP request to the website and retrieve its HTML content. Here’s the code snippet to extract data:

				
					import requests
from bs4 import BeautifulSoup
def extract(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find('div', class_='#listings')
    links.find_all('href')
    return links

Step 2: Transforming Data

Once we have extracted the data, we need to transform it into a structured format. In this case, we will extract the name, address, phone, and tel information from each item on the website. We added some error handling for each element which needs to be captured incase the element has changed, which will cause our script to instantly crash, which sucks! Here’s the code snippet to transform the data with peace of mind:

				
					def transform(articles):
    for item in articles:
        name = item.find('b', class_='#company_name').text
        address = item.find('div', {'text.location'}).text.strip().replace('n', '')
        try:
            phone = item.find('div', class_ = 'text.phone').text.strip()
        except:
            phone = ''
        try:
            tel = item.find('div', class_ = 'text').text.strip()
        except:
            tel = ''
        business = {
            'name': name,
            'address': address,
            'phone': phone,
            'tel': tel
        }
        main_list.append(business)
    return

Step 3: Loading Data

Finally, we need to load the transformed data into a CSV file. We will use the pandas library to create a DataFrame and save it as a CSV file. Here’s the code snippet to load the data:

				
					import pandas as pd
def load():
    df = pd.DataFrame(main_list)
    df.to_csv('travelagents-yellosa.csv', index=False)

Putting It All Together

Now that we have defined the necessary functions, let’s put them together and run the web scraping process. We will iterate through multiple pages of the website and extract the data from each page. Here’s the code snippet to run the web scraping process:

				
					import time
main_list = []
for x in range(1, 2):
    print(f'Getting page {x}')
    articles = extract(f'https://www.yellosa.co.za/category/travel-agents/{x}')
    transform(articles)
    time.sleep(5)
load()
print('Saved to CSV')

Conclusion

Congratulations! You have successfully completed this Python tutorial on web scraping. You have learned how to extract and transform data from websites using Python. Remember to always be mindful of the legality of web scraping and respect the website’s terms of service. Happy scraping!

Step 1: Extracting Data

Step 2: Transforming Data

Step 3: Loading Data

Putting It All Together

Conclusion

Leave a Reply Cancel reply