In the first of a series of video blogs by DataSoc, we chat with Professor Jake Olivier who is a researcher and lecturer at the UNSW School of Mathematics and Statistics, as well as the Deputy Director of Transport and Road Safety Research Centre at UNSW. Join us as we find out more about his journey of studying and teaching mathematics, as well as his research in public road safety.
Julian: Hi everyone, I’m Julian and I’m with Gordon and we’re both from DataSoc and we’d like to introduce you today to Jake Olivier, who is a Professor of Statistics with the School of Mathematics and Statistics as well as Deputy Director of Transport and Road Safety Research at UNSW. Some of his other research interests include road safety and statistical methods for evaluating public health interventions. So with that said, could you just give us a brief overview of your career so far and your education and what led you up to this point? I know it’s a bit of a big question but -
Jake: Well, there are a lot of sub-questions in there. But if I start from the beginning I suppose I was originally – for one year I was an engineering student, then I realised I liked mathematics far more than engineering so I swapped for mathematics and I did pure mathematics as an undergraduate student and effectively as a masters student – in the US, we don’t have honours years – but I knew kind of early on that I wanted to do what might be considered applications of mathematics and where I did my pHd, statistics was one of the primary, one of the two which could be considered applications of mathematics – the other one would have been graph theory, I suppose, or let’s say combinatorics – and so I decided to do statistics.
So I do my PHd in Statistics and right out of school I got a position at a medical university, working in biostatistics. And I probably spent the first year or so trying to learn what biostatistics was because I didn’t really have the background for that. So the chair of my department when I started there, he actually gave me a book on epidemiology, said this is what you’re going to do for the rest of your life so you need to read that. I didn’t really understand what epidemiology was when I started. But having read it, at that point I had a pretty strong background in pure mathematics and statistics, so reading a book on epidemiology from a mathematics perspective is very straightforward, I think, usually. There are some very difficult concepts in epidemiology that are mathematical, but I find if you have enough of a background it usually isn’t.
However, epi also has its own kind of constructs and things that are very specific to epidemiology that isn’t taught in mathematics or statistics or computing or anything else that might be considered analytics. Like, what is the difference between a case-control study and a cohort study, and why shouldn’t be doing certain things in either of those of studies or types of studies. There are things underneath the mathematical argument that might not be obvious from a mathematical perspective.
So I found when I finished my undergraduate degree and started working in biostatistics, which I really didn’t know a lot about when I started, that I had to learn a lot in what I would say a short amount of time. Short, in like probably two or three year. What was good, in a sense, for me, was that it prompted me to have this attitude where I was always learning things, always trying to do things in a sense better or trying to pick up new skills as I was going along, and I kind of have to do that even now.
Like in the courses that I teach – I started my PHd in 1995, back before R didn’t exist. I don’t really know anything about Python, by the way, so I don’t want to say anything and make myself sound stupid, but for me, for stats people back then, SASS was kind of the main thing, and you had to connect to SASS remotely because it was on some big mainframe computer in the middle of campus. This was back before PCs were – you couldn’t do as much with PCs back then. So R didn’t exist so I learnt how to do things in SASS. In medical stuff, SASS still dominates most places I think, but when I came to UNSW in 2008 I had to learn how to use R because I had to teach R, I also had to learn Matlab because I had to teach Matlab. I still stuck with SASS primarily for the first years but then – I feel like primarily because of RStudio and tinyverse stuff, things that are related to that, it prompted me to give R more of a shot, I guess, and over the years tried to introduce more and more R stuff into my courses. I kind of do a hybrid SASS-R from a computing perspective. I’d probably do R fully if ever I had to pay for SASS – maybe don’t tell SASS people that, but it’s very expensive to have SASS and not everyone has it, and R is free, obviously.
That doesn’t give an idea about why I do the research I do, I suppose, but that’s kind of the background, for me, education-wise and kind of at least the beginnings of my career.
Gordon: We can talk more about your research later on, but just to touch a bit on your experience in education at UNSW, we’re wondering if you have any specific teaching methodologies or principles as a teacher at school?
Jake: I mean, in terms of methodology, I feel like I – when I teach students, it’s very important that students are doing stuff while they’re trying to learn and so I, irrespective of the course that I’m teaching, there’s a fair bit of computing involved in it. Because you have to have an understanding of – I mean, methods are important, and I do talk at length about the underlying theory that motivates or is the foundation of method, but I feel there has to be some computing aspect, otherwise, stats doesn’t really make sense to a lot of people. You have a bunch of Greek letters and likelihoods and all kind of stuff on a board, but, none of it really makes sense unless you, even for me, until you actually have some data and you try to do something with it. So I’ve always got those kind of things in my courses.
One of the things I learnt early on in UNSW was that if I put everything on the slides students didn’t really pay attention. I’ve tried over the years eliminating certain things and doing them in class, now doing things virtually has made that very difficult. I do have an iPad with a pencil, and I tried writing on it but my handwriting is just awful. I’m ok on the chalk board and the whiteboard but I’m not very good on the iPad. I’ve tried to just open up a word file and type stuff out during lectures and that seems to be ok, but I feel like if I do that too much it just slows everything down. I really can’t wait to go back to teaching face to face – I’m not sure if you guys are the same, it’d be great once we knew. Active learning, I think, is really important in stats.
Gordon: Yeah, exactly.
Jake: It’s really important for a lot of things. I also think it’s really important that our stats students work on, not toy examples, in classes, but use real data to learn. I can simulate data all day long, it doesn’t really reflect the real world.
Gordon: So, would you say that some of these computing technologies that have come along over time have made stats more accessible to more students?
Jake: In a sense, yes. I mean, when I was in the US as an academic, I was the “property officer”. Which meant, once a year, someone from the States came and had to lay eyes on all of our pieces of equipment. Which is a bit, it’s a bit of strange thing, but you know, the university’s paying for things, there’s accountability, you have to have accounts and audits and auditors and whatever. One of the curious things about that, so I started in 2003, was that the department I had had a calculator, a hand-held calculator, it was about this big [gestures with hands to shape something approximately four handspans large] and it had tape in it, as in it printed stuff out, because it only had a LED screen, so it had to print out, and that’s how people used to do analysis and queries, or Bayesian regression. It was this computer, and you had to type it all in to get the data in, right, and I know it could do one-way anova, I don’t know if it could do two-way anova, but, the curious thing – so every year I had to find that device, that no one has ever used in like two decades probably, and show it to some auditor to make sure we still have, even though it’s never used.
So that’s how things were done in the 70s. Personal computing became more popular in the 80s, they made things a bit more accessible, but in the mid-90s I’d started my PHd and I was still having to use a PC to Telnet, which was before SecureShell, to connect to Mainframe and do stuff that way. Which meant I had to Telnet and dump data to the computer and run code, and like, I didn’t have Internet at my apartment, so I’d be typing code in my word file or some editor and I’d have no idea if the code worked. So yes, things are definitely more accessible now than they were back when I was a student. I didn’t even know what I was supposed to do, right, at that point I wasn’t trying to learn how to do stuff, it’s like I was just trying to implement stuff, and it was – it was impossible to do that. So yes, things are far better.
I don’t really like the R GUI – I don’t mind saying that, I hope I don’t offend anybody – but I hate the R GUI. It’s not really a GUI pipeline, but it’s fantastic that it’s free open-source, that whole mentality is revolutionary I think in statistical computing and it propelled people to do lots and lots more instead of having everything closed-box. SASS is massively expensive. It’s not hyperbole, it is massively expensive. You would not have your own license, ever, as a student, it’s that expensive.
Julian: Yeah, I think that resonates a lot with me, someone who has to use STATA a lot for econometrics, very big – if you don’t know what STATA is it’s very similar.
Jake: I know.
Julian: Yeah, right, very big contrast for me who also uses Python with all – as you said, lots of open-source packages as well in that area, so very big contrast going from STATA to Python and back –
Jake: Yeah, I think, normal equations for linear models is not – it’s been around for a long time, right, it’s not that hard to program on your own to do, right. As an example. So it’s not always clear to me why things are so expensive, as they are. I mean, some biostats people did use to use STATA as well, it’s ok, I suppose, I have no idea how to use it. It’s like, I know how to program in SASS and R, a bit of Matlab, I used to use Maple, long time ago, to do symbolic kind of stuff, and um, I’m not learning anything else. I hate to say – don’t quote me on that, like I would argue you should always be picking up new skills all the time but part of me about STATA is like I’m not doing that.
Julian: Yeah, [laughs] I know exactly what you mean.
Jake: Yeah, if it wasn’t for either – so we teach stats to the engineering students, we teach them using Matlab, if it wasn’t for that I think I’d have enough room in my brain for STATA. Or probably Python, maybe I’d go with Python at this point. Or Julia, I suppose. There’s a ton.
Julian: Yeah, lots of different programming languages out there, definitely. But shifting to a different category of questions, in terms of your research. Just wondering – so considering your work in the area of epidemiology and road safety and public health – just wondering what the role of statistical research was in policy design, and specially what the process is from writing a paper and then to, sort of, advising on policy?
Jake: At the start one of the most difficult things about trying to research that’s meant to, let’s just say evaluate policy – the policy may have already come in and existed, and we want to evaluate, was it a good idea. There are times when the policy doesn’t exist, but you want to do research on whether the policy should exist. You really have to start off with data, I suppose. Look, if a policy’s already been implemented, you’re not going to be able to collect your own data that existed before the policy. You need to have some sense of how it happened prior to the policy being implemented, right? And that, if you weren’t part of, or somebody wasn’t at the start of the study prior to that, there is no pre-policy data, let’s say pre-legislation data. So you have to exist on routinely collected data.
Now, depending on where you are in the world, like Australia does a fairly good job with medical stuff, so hospital records around the country are all electronic, which is fantastic, death data is available, electronically – that’s also fantastic – and the ABS collects population and cremation all the time so we know how many Australians there are at various places, and around in different states, and LGAs and other different divisions of – geographic divisions. But if that stuff doesn’t exist, then you really can’t get a good sense of what’s going on.
So you have to start off with good sources of data, and I’d say that’s why we need the ABS, we need other bodies in the government to routinely collect and import data. One of the things – so in road safety, one of the things I think that we do an awfully poor job of in Australia is to collect mobility data. As in, how do people get around? How often do they use the loo, do you walk to work, or work to the grocery store, or the shops or whatever, do you ride a bike, or do you drive, or – how often – that needs to be separated out into different genders, sexes and also different age groups and different locations and other things like that, so we know that this is how people are getting around. It’s important, because it impels the politicians to spend money to support those modes of transport. So if tons of people are riding bikes all the time, why the hell are we building roads for cars, for example. But without that information, we don’t fully have an idea of it.
Now there’s bits of information that may exist at various times, like the City of Sydney might do some survey, and collect some information, the Melbourne, whatsit called, it’s like the inner city of Melbourne has an acronym, it’s either City of Melbourne or something like that, but they also do a bit of that, but there’s no Australia-wide data that does that. Not many countries do, apparently, but some countries collect mobility data quite effectively.
So you have to start off by having data to be able to do stuff, and honestly, let’s just say from an injury or road safety kind of perspective, there’s often not data there. Or the data that is there is not quite ideal. Now sometimes the data there is so bad you really just need to ignore it. It’s too bad, we can’t actually answer questions from this, we just need to push it aside. Not everyone is willing to do that unfortunately, but sometimes the data that is available just isn’t up to doing – to answering whatever research question you have in a reasonable way. Just because you’ve got numbers doesn’t mean you have valid information.
Gordon: I was wondering, leading off from that, what are some limitations you’ve encountered from using statistical methods in your research?
Jake: In epidemiology, we often deal with, pretty much always deal with observational data. So there is no randomised control data, there is no random samples of the population, and so having observational data automatically comes with certain kinds of weaknesses because of that. Confounding is a very difficult thing to understand conceptually, but that can explain we may observe something to be either, let’s just say, successful or a failure, but it’s because there’s some confounding that happened. Confounding often happens because we don’t randomly choose to treat people in observational studies.
For example, we have seatbelt laws, in here and other places. But when you get in a motor vehicle, no one is prompting you to put it on, right? So there are compliance issues there, where people will voluntarily choose not to do something, or choose to do something. We don’t have the ability to control that. But, having said that, we can’t also go to everyone’s house or whatever and make them put a seatbelt on when they get in a car. We can’t make people or not make people to smoke cigarettes, or use tobacco, or use alcohol, or other things like that. People will, voluntarily, do something silly, like drink a lot and get in a motor vehicle and drive. It’s illegal, but there’s no one who can automatically stop someone from doing it.
So observational data is always problematic for that reason. There’s also other, less advertised things like what’s called regression to the mean, and that’s, if you have, say, before and after studies, things can get bad – so what happens sometimes in road safety and other policy kind of areas is that policies may come in because things get really bad. So there’s maybe a lot of crashes, like injuries, serious injuries or even fatalities. So if you think about a times series, things spike up and then the government says, oh, we need to do something about that, and they do something about it, and things go down.
Now they could go down because some new intervention was effective, but it could have just been that that time, when things looked really, really bad, that was just in the far tail of some distribution, say the regular distribution that has been observed for some time now. Sometimes, we observe stuff in the tails of distributions. And when we do that, things will just automatically go back to normal. That’s where the idea of regression to the mean comes from. So things that are normal, and it’s not like I mean normal like in the normal distribution, but let’s just say typical, maybe, that they seem atypical but it’s really just part of some difficult process that things go back to normal and everything looks like there was this big change. But it’s not because there really was a change, it’s from what called regression to the mean.
Gordon: Another thing about your research that we’d like to know I guess, is whether there’s any surprising or unexpected results that you’ve discovered?
Jake: So one of the things I’ve done for the last couple of years is gotten into what speed limits should be for motor vehicles, travelling near – what should speed limits be around pedestrian areas. It’s clear that it shouldn’t be, say, 80km/h or 100 km/h, if there are pedestrians around. But there is this fairly big debate worldwide as to what speed limits should be. The default speed limit in Australia tends to be 50km/h, so there’s nothing posted where it’s 50km/h.
Now, in the City of Sydney they’ve adopted 40km/h. So in those areas – where the lines are marked off isn’t always 100% clear – but there is this idea in those areas where there may be a lot of pedestrians should be lower than 50. There’s still a question about what should it be. Should it be 30? A lot of European countries – sorry, cities in Europe, have 30km/h. So I got interested in that and did a systematic analysis where we collected studies that looked into pedestrians being hit by the front of a motor vehicle, and whether they died or not. Now, some studies had serious injuries or not, meaning the person didn’t die, but primarily we’re trying to base this off fatalities. And it turns out that if you’re hit a motor vehicle running at 30km/h, our best guess is that you have about a 5% chance of dying.
No probability larger than 0% is an acceptable risk, but the only way to get to 0% is to have no motor vehicles – which some people would argue you should be doing, and they’re not necessarily wrong about that, but if we have motor vehicles and they’re allowed to be sort of near where pedestrians are, the speed at impact should be no more than 30km/h. Now, you could then kind of deduce that if the impact speed is 30 it means maybe we could set the speed limit to 40, assuming that if there was about to be a crash the car could slow down to at least 30 or less. Now, when getting into serious injuries, you would think that the probability of these things happening, the probability of fatalities would usually be less than the probability of serious injuries. But, given the data we have, it’s actually the other way around.
Now, I know why that’s happening. It’s not that you’re less likely to – so, if you’re hit by a vehicle, you’d think dying would have a lower probability than having an injury, just, a survivable injury. The issue is that when the data’s being collected, auditors have to go to a scene and use various measuring ways – they measure like length of skid marks that the tyres made, how far the body was thrown when hit by the vehicle and things like that: they use that to get the estimated impact speed. Now, if a car bumps into a pedestrian, maybe knocks them over but they get up and walk away, those never get audited. So the ones that are not very serious don’t exist in any datasets. The ones that are low speed, they don’t exist.
So there’s this problem of what I call sampling. There’s very little information at the lower speeds for injuries, and so only the really serious ones are being captured in various databases that could then be analysed. Because we often don’t collect information – we really don’t ever collect information on things that are minor. And if we don’t know what’s happening with minor injuries, we don’t really have a comparison to be able to accurately estimate what I call risk curves, which is the probability of having some event given an impact speed, say. But we don’t really have an idea about that unless we have – until we collect more data, basically, if that makes sense. It’s not that you’re less likely to die as opposed to have a serious injury, it’s actually the other way around, quite likely, I mean I’m 100% positive that it’s the other way around, but it’s because of the data collection not because of anything.
But trying to get governments and politicians to collect that data is a struggle. I’ve got a colleague at the – well, I guess he was at the World Bank, and they do a fair bit of road safety stuff. It took me a long time to convince him that’s what was happening, but they’re not going to invest any money into collecting better data.
Gordon: Just as a closing question then, do you have any general advice for students studying Maths, Statistics, Data Science, or any other fields that you’re interested in or involved with?
Jake: That’s a good science. I mean, data science didn’t exist when I was a student. I mean, the various aspects of it did, right, like we can think of data science or analytics as being this combination of various disciplines coming together. If I were starting out, I think I would tell myself maybe, um. Uh.
I’d say mathematics is important. Statistics is important. You need to have a good, solid understanding of those things. But also understanding how data is generated, not just I’m going to fit these very complicated neural networks or other kinds of really complicated – none of that really matters unless you get a really good sense about the content area itself. I think I’ve kind of said it before: just because you have a bunch of numbers doesn’t mean you have information that’s relevant to what you’re trying to do.
I think, one of the things that’s helped me in my career – look, at the minimum I had to have a good understanding of mathematics and statistics to do what I do, that is the – I think that’s obvious though. The things that’s really helped me is getting a good understanding of content area, and spending a lot of time doing so. I had professors tell me when I was a student, oh, you don’t really need an understanding of biology to do biostatistics, but I feel like I really should have done some minor in biology a long time ago to have a better understanding. I think I would have been a better statistician, data scientist, data analytics person had I had that information. I’m not – I had to learn epidemiology as part of my job, and I’m glad I did, and I wish I had that as a student. It’s something I was able to pick up, thankfully, but I had to have a good understanding of things like that, the content areas I was working in.
Which a lot of it was how are data generated, how are hospital information coded, for example, like the WHO group publishes the International Classification of Diseases, so if you’ve ever been hospitalised, in Australia, pretty much anywhere in the world, whatever you had is a code. Probably multiple codes. Having an understanding of what those things are and how we can use to databases to identify, like, these are all the patients who had multiple myeloma, for example, it’s a kind of cancer. Or we had to identify all the people who’ve been in a car crash who had a skull fracture, things like that.
So having an understanding of all these things is really important – I think, maybe as a general kind of ethos, I feel like you have to be an expert in lots and lots of different things. It’s not really a jack of all trades. I think if you’re just a jack of all trades, you don’t really – you don’t have enough substantive information in any one area, you kind of get lost in the mix, I think. You kind of need to have expertise in lots of different things to be a good data analyst, statistician, data scientist. And I’d also say, you kind of have a like of always learning new things.
‘Cuz everything is always changing, right, like I remember back when I was a student like backwards elimination was like, oh that would kind of an interesting thing to do to select a model, and them some people would say, two years after that, oh that’s a terrible idea because you don’t want the data to tell you what the model should be and then over time like, it seems like it pops up again and someone says, oh, well it’s a good reason to do this for this reason, and some people would say, it’s a terrible thing.
There’s this ongoing debate about – I’m having a discussion over email. It’s about whether we should be recording a letter called odds ratio, or whether we should be reporting relative risk. They’re both computed from effectively the same data, but some people say, well, relative risk is a measure that’s considered collapsible, and this is a great property to have, and then some people will say, well, I don’t care about collapsibility it should be what’s called portable.
And so you’re getting all these broad discussions and so you think, well, these are things that have been settled, these are debates that we should have had 20 years ago, we should have settled on an answer. It turns out, in reality, we often don’t have 100% answers to everything. So it’s really good to have an understanding on what all these various things are, so when you’re out on your own and you don’t have someone who you can just, hey, can you tell me what the answers are, you have a sense about what the arguments are, and you can make good decisions for yourselves.
Gordon: Alright, thank you so much for that, then, Jake. That just about wraps it up for the interview. It was really nice to hear about your experience and research interests, so thank you very much for your time.
Jake: No problem.
Julian: Yeah, thank you, thank you very much.
Jake: Yeah, no problem Julian.
Stay tuned for more upcoming video blogs with professors from the UNSW School of Mathematics and Statistics!