For the research practicum of my Masters in Computer Science we were challenged to use historical Dublin bus data to create a more accurate Dublin bus journey planner. This was a large project which we worked solely on for three months. It had challenges at every stage and was a great learning experience.
The first problem was parsing and understanding the historical Dublin bus data. There were more than 60 million rows of bus pings and it was a huge challenge to determine which ones were spurious and which were relevant. The data itself was from 2012 so we also needed to understand the differences between the system routes as they were then and as they are now. I built a series of tools to process this data and separate an individual journey of a single bus.
Below is a sample of the historical data with the last ping of the bus in a single journey highlighted.
Once we had an understanding of the system and the data in a state that we could use, we looked at how we were going to model it. After some testing we decided to create a single model for each route. We used arrival time as our target feature and start time and weather as our input features.
Below we can see 9 individual journeys as they travel along the route. X axis is stop order and Y axis represents time.
We then looked at many different models using 10 fold cross validation and Gradient Boosting was the clear winner. As the models would be deployed on a website and queried directly by the user it was important that they ran quickly and used as little memory as possible whilst still being accurate. Below are some of the results of the model testing.
Once we had our models, we used them to create an API. We used the Python Bottle framework for this because of its speed and simplicity. To route journeys between multiple bus routes we used the Google Directions API to find the connections and then got the journey time for each connections from our API.
Our stack is outlined below.
While the results are not perfect and the Google Directions implementation needs some additional work we were really happy with the experience of training, building and deploying the models. The project was a massive undertaking and taught me a huge amount along the way.