In the previous post I set the context of a project to monitor and predict the traffic over the border passages between Pays de Gex and Geneva. You can see it online here: http://fergonco.org/border-rampage/.
As one can see in the next data flow diagram, the project basically consists on gathering data from Geneva public transport (TPG) APIs, modelling the data, producing forecasts and drawing all this in a map:
In this post I will describe the process of gathering the data (in red).
As I said, Geneva public transport system, TPG, offers a REST API that allows querying the arrival time from their vehicles to each stop. Basically I am using two calls: GetAllNextDepartures and GetThermometer. The first one returns all the departures that take place from a specific stop to a specific destination. For example, this call:
will return all the departures of the line “Y” from Val-Thoiry (VATH) to Ferney-Voltaire:
<?xml version="1.0" encoding="UTF-8"?> <nextDepartures> <timestamp>2017-08-05T11:09:23+0200</timestamp> <departures> <departure> <departureCode>143732</departureCode> <timestamp>2017-08-05T16:24:23+0200</timestamp> <waitingTimeMillis>18940000</waitingTimeMillis> <waitingTime>&gt;1h</waitingTime> <reliability>T</reliability> </departure> <departure> <departureCode>143906</departureCode> <timestamp>2017-08-05T17:09:23+0200</timestamp> <waitingTimeMillis>21640000</waitingTimeMillis> <waitingTime>&gt;1h</waitingTime> <reliability>T</reliability> </departure> ... </departures> </nextDepartures>
The second call, GetThermometer, gives you the “thermometer” of one particular departureCode, which contains the expected arrival time of the vehicle to each stop (or the actual arrival time if the vehicle was already there):
<?xml version="1.0" encoding="UTF-8"?> <thermometer> <timestamp>2017-08-05T11:23:49+0200</timestamp> <lineCode>Y</lineCode> <destinationName>Ferney-Voltaire</destinationName> <destinationCode>FERNEY-VOLTAIRE</destinationCode> <steps> <step> <departureCode>143732</departureCode> <timestamp>2017-08-05T16:25:00+0200</timestamp> <stop> <stopCode>VATH</stopCode> <stopName>Val-Thoiry</stopName> </stop> <deviation>false</deviation> <reliability>T</reliability> <arrivalTime>301</arrivalTime> <visible>true</visible> </step> <step> <departureCode>143733</departureCode> <timestamp>2017-08-05T16:27:56+0200</timestamp> <stop> <stopCode>THGA</stopCode> <stopName>Thoiry-Gare</stopName> </stop> <deviation>false</deviation> <reliability>T</reliability> <arrivalTime>304</arrivalTime> <visible>true</visible> </step> <step> <departureCode>143734</departureCode> <timestamp>2017-08-05T16:29:20+0200</timestamp> <stop> <stopCode>THMA</stopCode> <stopName>Thoiry-Mairie</stopName> </stop> <deviation>false</deviation> <reliability>T</reliability> <arrivalTime>305</arrivalTime> <visible>true</visible> </step> ... </steps> </thermometer>
In order to obtain the arrival time of the vehicles to the stops, a process makes use of these calls in the following way:
This process was designed to be run indefinitely on the server. But of course the process would definitely crash at some point. How could I avoid taking a look at it every morning and, nevertheless, be sure it is working properly?
I used Monit to monitor the service. It allows to check different aspects of your system (process running, disk space, service output, etc.) and receive mail alerts when some conditions are met. With Monit it was a matter of minutes to setup an alert to be triggered if the service was not running.
But I didn’t want to check only that the service was running. I wanted to be sure the service was running and gathering data properly. The details are explained in a previous post.
This proved very useful to detect that the service was running but collapsed due to a memory leak, that databases can run out of space (see next) and that APIs can fail (see further down).
I had two episodes of this problem.
On the first one, due to a careless design of the database, I was storing more data than necessary each time a vehicle arrived to a stop. Which happens very frequently.
On the second one I just forgot the debug logs on, which are very verbose each time a vehicle arrives to a stop. So in few days the process filled up several Gigabytes. As a quick workaround I learned that it is possible to roll the log file in Docker with these options:
--log-opt max-size=50m --log-opt max-file=7
As said previously, the first design of the database was done without care. It was a first draft in a project that I thought was not going very far. At the end it did. And it was a huge mistake not to fix the database design at the earliest.
There were more than eight million records with redundant information. Due to the full disk episodes they were spread in three database backups, some of them overlapping.
When forecasting proved impossible with this data model it took me around ten days to unify the old backups and migrate them to the new, better model.
I wanted to draw the map based on the speed of the vehicles between stops. By querying the TPG API I could know when the vehicles arrive, which along with the length of the route between stops would allow to calculate speed. And speed is what I stored in the database.
Then I realized that the routes between some stops where wrong, and so where the speeds calculated between those stops. To correct the speeds I had to undo the speed calculation using the wrong route lengths, obtain the time between stops and calculate the speed again with the right route length. Very entertaining.
Then I realized that there were speeds equal to 3402km/h !! What happened? Probably some wrong data coming from the API or detecting the vehicle arrival. But, as I am processing the data before I save it on the database I cannot easily know the reason.
The current database design does not store speeds anymore, just the time between stops. The speed is calculated later, when the data is processed for visualization or for analysis, where I can use the route lengths that I know are true at the moment.
One day the data gathering service monitor alerted of data not being gathered. I saw in the logs that the request at 4am, when the process gets the planned schedule, had failed because the answer contained an empty response. I copy&pasted the request URL on the browser and I got an empty response again. At first I thought there was a problem with my API key. A bit later I tried again and it worked. Probably a temporal ban? Then it worked, then it didn’t. And so on without a recognizable pattern.
Then I run a script with 100 calls which yielded a ~50% success ratio. So basically the service was failing 50% of the time. Why? My hypothesis is that there was a load balancer with two machines behind, one of them being broken.
So I changed the API layer to retry N times before raising the failure to upper layers. Now the question was: Which N to choose?
The probability to get a failure in one call is 0.5. Two failures in a row 0.5*0.5=0.25 and so on. So if I want a failure to happen a 0.1% of the times I should find N so that:
0.5^n = 0.001
which I solve thus:
log(0.5^n) = log(0.001) n * log(0.5) = log(0.001) n = log(0.001)/log(0.5) n = 9.97
So retrying 10 times should get over the 50% API failure rate 99.9% of the times.
Despite of all these problems I think it was ok. Being a service that had to run constantly it could have turned into a nightmare but it didn’t. However I learnt some things: