A couple of weeks ago I was inspired to get a hold of Great Britain's STATS19 accident data record (lovingly held at the UK Data Service) in order to do some analysis regarding cycle accidents for a lecture I'm delivering to the uni's transport masters' students in March. I've used this data before when I did my Master's degree dissertation and essentially it is the content of the forms the police fill in when you have (or report) an accident. Short of the written statement component (which is not included in the data set for confidentiality reasons), the data is comprehensive and exhaustive. As such, it's one of those data sets you really need to know what question to ask before you start trying to crunch it.
To give some sense of scale, the data totals some 1.5Gb and that's simple tab/comma delimited text; nothing more. Even my quad-core, 16Gb, 4.5GHz i5 PC struggles with it as the software tools just aren't built for crunching that amount of data well in it's rawest state.
Unfortunately, truth be told, in the intervening weeks I've forgotten what it was specifically I wanted to use the data to demonstrate so when I finally got around to downloading it all yesterday, whilst I'd gotten familiar with the basics of SQL and put it all in a nicely accessible MS Access-linked database, I needed a question.
2,500 consecutive days have now passed without a fatal train crash* in GB... [*Except a few road/footpath users struck at level crossings.]— Joe Dunckley (@steinsky) December 28, 2013
and then asked:
I wonder how many of those days saw 0 fatal road crashes? I wonder if *any* of them did? Guess there must be 1 or 2 in the statistical noise— Joe Dunckley (@steinsky) December 28, 2013
This is a question we can answer.
Data Extraction and Excel-fu
A simple SQL query SELECTing accident IDs and dates for fatal accidents in Access gives us something we can export to Excel and work with.
So then, first we want a list of dates from the start to end of the data set. We can't quite do the last 2500 days as the data is only published for the previous year in the middle of the following so we only have data until end-2012. But this will be enough to make the point. Datesthen range from 01/01/1986 to 31/12/2012; some 9861 days.
A little bit of =FREQUENCY(...) array formula action and we get a per day frequency of fatal accidents. If you're not interested in the whole set, skip on. Here's some high level:
- Minimum fatal accidents per day: 0
- Maximum fatal accidents per day: 30
- Average fatal accidents per day: 9
- Total number of fatal accidents: 88581
Bit sobering that last one; nearly 90000 fatal accidents on GB roads 1986 to 2012 inclusive (Note GB as Northern Island does its own thing stats-wise). And note also that those are recorded accident counts. There are two issues with this: first, under-reporting, though this is likely to be marginal for fatal accidents given the involvement of the police in practically all such cases; but secondly, more than one person can die in a single fatal recorded accident, this is not likely to be insignificant. I could find that actual number by checking the casualty data, but thanks to the helpful wording of the question, this need not detain us here.
There's clearly other results of interest here but seeing as they aren't the point of this, I'll leave them save for the following two figures:
Time Series of Daily Recorded Fatal Accidents in Great Britain; 1986–2012
Rather than the usual downward pointing trend line, I like this graph as it shows the data in all its messy spiky glory. What can I say, I used to be an oceanographer; we like our time series. Already we can see a few days where the line touches 0 and that seems to happen more frequently as we come closer to present.
Frequency Histogram of Recorded Fatal Accidents per Day in Great Britain; 1986–2012
I like this second one because it looks like a Poisson Distribution. It is supposed to look like a Poisson Distribution. When we study or demonstrate Poisson Distributions, we always use accidents as an example. I love it when science works.
For reference, there are 30 days in the 1986–2012 time series upon which there were no fatals.
So, the original question relates to the last 2500 days. For the purpose of answering here, we are going to say 2500 days to end 2012. This then starts 27/02/2006. 2500 days to yesterday is 23/02/2007 (i.e. Grayrigg crash); there are no accident free days in the intervening period 06–07, but there may well have been in 2013. We can see that the roads seem to be getting safer over time so an earlier 2500 day period gives a greater chance of fewer fatal accident-free days and would make the point behind the question more forcefully. I.e. do we expect 2500 days on the roads to give any fatal accident-free days?
Well, for the 2500 days 27/02/2006 to 31/12/2012:
- Minimum fatal accidents per day: 0
- Maximum fatal accidents per day: 20
- Average fatal accidents per day: 6
- Total number of fatal accidents: 14768
The above two graphs now look like this (note the rescaled axes):
Time Series of Daily Recorded Fatal Accidents from 27/02/2006 to end 2012 for Great Britain
Frequency Histogram of Recorded Fatal Accidents per Day in Great Britain from 27/02/2006 to end 2012
So in answer to the original question: there are 25 days in this 2500 day period that have zero fatal accidents recorded; or on average, once every 100 days (3 or 4 per year). That's 25 of the total 30 fatal-free days within only the last 7 years of nearly 30; who says the country is going to hell?
Incidentally, based on those parameters, a Poisson Distribution predicts a zero fatal accident day only once every 368 days; i.e. one per year or about 6 or 7 total in this data set. I haven't done a test on it to see if this is significant in the statistical sense, maybe we have just been lucky and this is just noise. However, you can see from the first graph how things have clearly got better on the roads in the last three decades regards fatalities, despite the increase in traffic (and thus exposure) which has happened in the meantime. I'd like to think this is because of all the hard work being done by engineers to improve roads but really this is likely strongly correlated with improved vehicle safety. It is also not a trend borne out by pedestrian and cycle casualties, but that is a post for another time.
Data sources are linked at point of appearance above. Additionally, STATS19 data is available from: