Extracting Live Location Data

Go back to Data and Methods

May 17, 2026

As you may know from the main page of this Dissertation Diary, I asked Claude on May 12th to guide me on scraping live bus location data from the Bus Open Data Service (BODS). It instead pointed me to the work done by Open Innovations, who have already scraped the data in GTFS-RT format and stored it publicly since June 18th, 2025. So all I needed to do was to convert it into a retrospective GTFS format so that I can plug it into r5py to do accessibility analysis based on actual bus journeys instead of timetabled schedules (GTFS-RT is not the same as GTFS! Read more here). Even that has already been done by Open Innovations, where they have shared their Python script via a separate public GitHub repository!


Get the Relevant Morning Peak Hour GTFS-RT Data
Firstly, I downloaded the GTFS-RT data for one specific day (September 17, 2025) from the Open Innovations' archive. The choice for this day is because CfC's report indicated that they collected data from one specific midweek day in September 2025 (so I have three other choices, assuming that they meant a Wednesday). Apparently, GTFS-RT was scraped from BODS into the archive every 30 seconds, with the timestamps recorded in UTC as part of the zipped folder name. So, to isolate the realtime data for the morning peak hour (so as to be as close to the CfC report as possible), I extracted zipped folders with timestamps from 0600 -> 0830 UTC, which corresponds to 0700 -> 0930 in BST (UK only leaves Summer Time in October), into a new separate folder (let's call this folder gtfsrt_morning_17Sept).


Inspect the GTFS-RT Data
Then, I adapted a Python script from the above hyperlinked GitHub repo, found in 'one_off_scripts' -> 'extract.py', to extract the BIN files in the timestamped zipped folders into another folder that contains all the unzipped BIN files (let's call this folder gtfsrt_morning_17Sept_unzipped). It is important to note that at this point, GTFS-RT data itself is found in those BIN files that is not human readable at all. Thus, I thought it would be wise to see how the GTFS-RT data looks like in a CSV file as a pre-processing step using script from 'pipelines' -> 'gtfsrt_to_csv.ipynb' with adaptations to match my local file structure. It was through inspection via CSV format that I discovered a small issue - BODS's GTFS-RT did not really capture 'current_stop_sequence' and 'current_status' information of the buses. These two information are crucial for the conversion of GTFS-RT data into retrospective GTFS data via 'pipelines' -> 'gtfsrt2gtfs_interpolation.ipynb'.


Convert GTFS-RT to Retrospective GTFS Data
Fortunately, the repo also contains `demo.ipynb` within the 'pipelines', which contains the "older method for matching buses to the timetable using distance and bearing between stops and buses". This was what I adapted to convert GTFS-RT into retrospective GTFS data that I can then feed into r5py! As the project progresses over the months, I will make my own GitHub public repo where I share the adapted scripts as well as highlight the Python packages and related utils needed to run them. As of now, all these steps that I did was to extract live location data for buses in Greater Manchester.


Caution Note
It is important to note that during the conversion of GTFS-RT into GTFS, it will create a 'feed_info.txt' file. This file contains information about when the GTFS schedule - retrospective or future - starts and ends, which is a requirement for a valid GTFS data to be run by r5py package. The way in which 'demo.ipynb' develops this file is to rely on scheduled GTFS data archived from BODS - and BODS defaults the end date of the schedule to a year beyond 2100. While this is not a problem for the validity of the GTFS per se, it is a problem specifically for r5py. Thus, before eventually plugging the GTFS data - either retrospective or scheduled - into r5py, one has to change the 'feed_end_date' to a year before 2100 first!