Scraping Flight Data with Python

Note:  The code for this project can be found in this github repo.

I have been building a new project that requires the prices of flights.  I looked for APIs but couldn’t find any that were free.  However, I had a great experience using BeautifulSoup when scraping baseball data for my Fantasy Baseball auto-roster algorithm project, so I decided to try to get my own data that way.  To start, I chose a single airline:  southwest.  In this post I’ll step you through a couple basics, including how to scrape when you need to submit form data.

My overall vision for the project is to provide a service where users can submit trips they are generally interested in and then they will get e-mails if the service finds a trip below a given price threshold.  So, for instance, maybe you live in Atlanta and are interested in weekend trips to Vegas if they’re less than $200.  The four main pieces are: user interface for subscribing to trips; orchestrating the queries to get the data; analyzing the data to find suitable trips; and notifying the user that a trip was found.  This post will focus on querying the data.

Basic Scraping

The main code for scraping is in the src subdirectory of the repo above.  Almost all of the code in scraping.py is dedicated to manipulating the data to add structure to the html.  To actually get the data the code is quite simple.  I’ll paste the critical parts here:

from datetime import datetime
import requests
from bs4 import BeautifulSoup

class SouthwestFlightData(object):
    def __init__(self, form_data):
        url = "https://www.southwest.com/flight/search-flight.html?preserveBugFareType=TRUE"
        response = requests.post(url, data=form_data)
        self.soup = BeautifulSoup(response.content, "lxml")

 

If you’re submitting data via a form you’ll need to use the requests package’s .post method.  You’ll then include the form’s data in a dictionary that looks like this:

{'form_variable1': value1, 'form_variable2': value2}

 

Now the question is how do you get the form variable names?  If you use Chrome, go to the website and just use and submit the form like a normal user to get the page we want to scrape.  Then, Right Click + Inspect to open up the Developer’s console, choose the Network tab, enter method:POST into the filter, select the request (search-flight.html in this case), choose the Headers tab, and then scroll down to see the Form Data.  The screen shot below shows an example.

scrape-ss

 

Only some of these fields are required.  You can dig into the html to find indicators of this, but I’ll leave that as an exercise for the reader :).  You of course need to assign values to these variables to complete your form dictionary object, but that’s as easy as looking at the real form data and inferring what’s acceptable.  Sometimes it takes trial and error.

We submit the url and form data to the requests.post method to get the html response.  This response content is then used as the raw input to the BeautifulSoup parser.  This allows us to search for particular elements and then extract their information.  For instance, to get all the flight numbers I have defined the following method:

@property
def flight_numbers(self):
    master_list = []
    # Scan each row
    for row in self.soup.find_all('tr'):
        # Create new list to hold all route's flight numbers
        flight_list = []
        # Check if row has the relevant entry
        if row.find('a', class_='bugLinkText'):
            # Find all flights
            for span in row.find_all('a', class_='bugLinkText'):
                    flight_list.append(int(span.text.strip().split()[0]))
            master_list.append(flight_list)

    return master_list

 

This scans each table row, searched for a particular element (<a class="bugLinkText">) and then extracts the relevant information.  In this particular case there is the option of having multiple flight numbers due to lay-overs, so there’s another “for” loop to iterate over multiple flight numbers.  Once you have the data then you just need to clean it.  Here I had to remove whitespace and split out unnecessary text.  Figuring out rules to get the information you need is definitely the part that takes most of your time.  It can get pretty hack-y.

The repo shows how to get the rest of the data, but I ultimately turn it into a pandas DataFrame object.

Weekend Trip Searches

Now that I could programmatically query for specific trips I wanted to automatically generate the queries that satisfied a particular use-case.  Specifically I wanted to search for trips between two particular cities that departed on a Friday and returned on a Sunday.  This is implemented in the tripsearches.py file.

In general we just need to programmatically create form data and then pass it to the previous class to get a DataFrame of all the flight options.  In this particular case everything is specified except for the dates, so there is some logic that finds all the possible Friday departures and Sunday returns, which result in several forms to use for the query.

Each of these trips returns several departure options and several return flights, all of which have different prices.  However, we need all possible combinations of departure/return options if we want the user to be able to specify an overall trip price maximum.  Joining the DataFrame on itself by using a unique trip identifier as the joining key gives a DataFrame with all possible depart/return options.  This makes it easy to filter on a variety of constraints.

That’s all for this post.  Feel free to play with the code or make suggestions on possible features for the service.  I’m currently working on the web interface, so I’ll try to find something worth writing about there.

Leave a Reply

Your email address will not be published. Required fields are marked *