Data (Science) is everywhere and for everyone – but only if we are professional

Rachel Hilliam, wearing a blue scarf and smiling and looking at the camera

In her inaugural lecture, Rachel Hilliam, Professor of Statistics in the OU’s Faculty of Science, Technology, Engineering and Mathematics, explored the role played by Data Science in ensuring that data is being used ethically, stored safely and analysed robustly.

In a world awash with data, who do we trust to collect, store and analyse our data? The last decade has seen the emergence of the Data Scientist, but what is Data Science? This lecture explored the rise of Data Science and how this has influenced Professor Hilliam’s career.

It also considered the question of whether Data Science is a new profession or just Statistics with a fancy title.

Watch the recording of Professor Rachel Hilliam’s inaugural lecture

Kevin: Good evening everybody. Thank you very much for coming along. I’m Kevin Shakesheff, I’m the Pro-Vice-Chancellor for Research and Innovation. One of the great pleasures in my role is to host our inaugural lectures. The inaugurals are a real celebration of academics who have reached the point in their careers where they have become Professors and its an opportunity for them to give a presentation. Its lovely to see some family and friends as well as fellow colleagues from the OU at these events. Tonight after the presentation from Rachel we’ll have a chat so get ready with some nice difficult questions for Rachel at the end of the evening. So if you are on Twitter you are behind the times but keep going, we do like you just to use social media whilst you are here tonight. Just let other people know. So we have got a hashtag #OUTalks. I don’t know what you do on Threads or Mastodon or all the other ones but please do put things on Twitter. It’s lovely to celebrate this event.

So I’m going to invite Rachel to the stage in a second. Let me say a few things about Rachel. She's Professor of Statistics in our School of Mathematics and Statistics. She's been with us since 2011. We were just discussing various roles in the Midlands where I'm also from, as well as her time as a medical statistician in the NHS. Rachel's played lots of roles within the OU and she's going to tell us all about her research this evening. She also, of course, has external roles, and is currently the Vice President at the Royal Statistical Society. So great to have our academics out there in these prominent roles. So on that I will invite Rachel and we're really looking forward to hearing all about your research.

Rachel: Thank you very much. It's sort of an exciting curtain event as you pass through there. So thank you very much for the introduction. Thank you very much to everybody who's come today as well. It's fantastic to see so many people here. I'm kind of aware that there's an awful lot of people both in this room and I know online as well who have been incredibly supportive to me over my whole career. I use the word ‘career’ lightly because I always think of what I've done as more of a random walk than a linear career path really. But as I say, for the next 40 minutes or so we're going to talk all things data science and data. But before we get onto that because there are so many people to thank and I'm not going to thank you all in person, but I might pretend to buy you a glass of wine at the end, although there's free wine out there. What I do want to do is just take a moment to thank 2 very special people.

So these are two people who I owe an awful lot to. Unfortunately neither of these people are any longer with us. The gentleman on the right hand side many of you in this room and online will know Derek Goldrei who was a massive figure in our School of Maths and Statistics, and I'll talk about him in a moment. The gentleman on the left there's probably fewer people than I can count on one hand in this room who will know who he is. So Maurice Sewell was my dad. He was basically the first person that I ever talked statistics with. So my dad was an incredibly patient person, as well as being a very blunt Yorkshireman. We used to go on quite long drives together because he used to take me round to various orchestras that I played in and we would always have conversations about statistics and mathematics. So as he was doing this taxi service he was also preparing me for this whole world that I now am very privileged to inhabit. He was also one of the first students at The Open University. So he was one of the people that The Open University was really designed for. So he left school, he went straight out to work in the steel industry, but he had a real love for learning. He was very proud of The Open University and everything that it stood for. One of the reasons he was so proud of The Open University is because having his degree which was mostly in mathematics, with a bit of physics and computing thrown in, it allowed him in the twilight years of his career to actually come out of industry and retrain to be a maths teacher, which he did right to the end of his life. So he was always incredibly grateful to the OU for having had that particular ability to be able to do that. Unfortunately he died far too young and he died a year before I became a tutor at the OU. So he never actually knew that I ended up working at the University to which he was so proud to be a student at. There's a bizarre link between these 2 men because Derek almost certainly gave those very early OU BBC2 programmes in the early morning, many of which that I used to go down as a very small child and laugh in front of as my dad was no doubt trying very hard to study for his degree. So sorry about that. So we used to kind of laugh about that but Derek was an amazing person, he touched many people in this room. He was the most encouraging wonderful person that I think I've ever come across. I was very lucky to have him as my mentor when I first arrived in 2011 as a full-time member of staff, but he was always encouraging. He was the first person to celebrate any success you had. He was also the person to pick you up when things didn't go quite so well. He did that not just for me but for everybody and he leaves an enormous Derek-size hole, not just in maths and statistics at the OU, but in many communities. So I'd kind of like to dedicate my talk tonight to these two very special men.

So having successfully got through that we will now talk data science. So data science, many of you in this room potentially might not know what data science is so we're going to explore a little bit what data science is. We're going to explore some of the pitfalls alongside working as a data scientist and things that I think it's really important we should be thinking about. But before we do that, we're going to talk data first of all. So I want you to think in the last year about the amount of data that's been collected from you. So if we keep it simple first of all, maybe just think about your day to day. So your travel here, you'll have come by public transport or you would have driven your car down here. You'll have passed numerous cameras of one sort or another taking pictures of you. You'll have also potentially filled up with fuel for either yourself or your car, paying by card or some sort of other digital means. Everybody I'm suspecting in this room and also online as well has got their phone somewhere near them. Depending on what privacy settings you've got on your phone, it's not only recording all of your social interactions, but it's also recording where you are through GPS servers. You've probably got all sorts of interesting apps on there. I'm looking at my 2 boys. They've got all sorts of interesting apps on their phones. But at the very least you’ll have got some sort of health app on there no doubt recording the steps that you've taken today, the amount of sleep that you've had, your heart rate, all the rest of it. So all of that data is being collected about you all of the time, unless you're very careful about what you do with your particular settings. If we take it a little bit further let's think about the last 2 months. You've probably taken part in some sort of survey. It seems at the moment I can't drive past my local garage without them asking me whether I'm satisfied with some survey that they've provided for me. So at least you'll have ticked boxes like that.

It may be that you're interested in data much more. So it could be that the OU has really inspired you this year in terms of the Wild Isle programmes and maybe you have suddenly, when I talked to Alice earlier this year when we were discussing this lecture, Alice suddenly found that she was very excited to be a citizen science, and that she could actually be a data scientist because she participated in some of these things. So it could be that you are the sort of person that might like to get involved with this. So with Treezilla you can measure and monitor trees. iSpot you can record nature and learn more about the nature around you. We've got various projects that sit on nQuire. Other non-OU platforms are available such as Zooniverse as well. So it could be that you're actually actively measuring things, contributing to these sites, and maybe analysing some of that data for yourself. So the thing is that data is everywhere. So it's in every single thing that we do in all of our walks of life, whatever it is that we're interested in. So my question to you is how does this actually make you feel? Does it make you feel excited because there's all sorts of possibilities with how we might use all of this data to help forward our society? Or does it make you slightly nervous to think about all of this data being held about you? So if we go for the excited option first of all, because I don't want you to go out of here feeling miserable. So if we go for excited, maybe think about that health app that you've got there which is recording, as I say, all of the ways that you live throughout the day. What about if that was linked into your medical records. So in your health records you've got various flags potentially for different conditions that you might be susceptible to. What if changes in your health regime together with those flags actually sent some warning signals and this was potentially linked into your GP and some proactive interventions could be put in place. Does that make you feel excited? That's one example of that isn't it but there are many other examples of how we could link data together just in terms of things like child protection, and actually linking all of the different services that we've got to flag things early on. Of course if we do that then we have to think quite carefully about who owns that data, and how we're going to use that data in an ethical and responsible way. So that's one of the things that I want to talk about. So in order to make sense of that data, and really in all of those different places that that data is stored, we need to link it together and we need to look for patterns to be able to answer those questions that we're interested in.

So that's what our data scientists do. So when you ask about 20 different data scientists what data science is you will get 20 different answers. But they will all be based around something like these diagrams here. So a data scientist is basically, for the purposes of this evening we're going to keep it simple, somebody who can use the computer science and IT tools that we have to be able to collect and store these vast amounts of data, put it in the right sort of format so that we can then use our mathematical modelling and statistical modelling tools to be able to make these links and patterns that we're interested in to answer questions, importantly in whatever domain it is that we're interested in answering questions about. So we're going to think about data scientists as being people that can take data from lots of different places, put it all together, look for patterns in that data and links to be able to answer questions hopefully to make society a better place.

So if we think about data in that sort of way, doing that very much linking of our data together, I'm going to introduce you to somebody called John Graunt. So there are a whole lot of different occupations who claim John Graunt as their father. So what’s John Graunt? Well John Graunt published for a start this book with the extremely catchy title back in about 1682 I think it was. What John Graunt was interested in was how the plague worked. There were lists of parishioners across London and what they died of was collected in lots of different lists. What John Graunt was interested in was if you put all of those lists together could he actually look for patterns for how the plague actually worked. So not only did he do that, he started to find lots of different patterns in these lists of data. So for instance, he found that if you went from one year to the next, the death rate stayed fairly constant from certain conditions. So one of those being suicide, maybe not so surprising there, but also other diseases as well. Whereas there were certain diseases such as the plague, other diseases, such as smallpox and so on, where the death rate changed quite a lot. So this is just one example of some of the questions that John Graunt was interested in and some of the things that he was able to do. So I would argue that along with the epidemiologists, demographers, actuaries, all of those people who claim John Graunt, I would also say that we're going to claim him as one of the first data scientists as well, because he was doing exactly this idea of taking this large quantity of data, putting it together, and trying to come up with patterns to answer questions that he was interested in.

However, we've not been talking about data science since the early 1600s so potentially the first time that we really came across the term ‘data science’ really in the public consciousness was this particular report from the Harvard Business Review. So the Harvard Business Review then. Data Scientist: The Sexiest Job of the 21^st Century, so all of you that are data scientists out there, Sophie, for instance, do you feel sexy as in the 21^st century? Excellent, good. So what this report basically said was 3 things. It said that we didn't have people with the right set of skills to fill all of these different jobs, vacancies, that were going to be out there in data science. It also said that we didn't have university courses in data science, and that there was no clear structure for where these people fitted into an organisation. So this was kind of a call to action to the community as a large to think about these 3 things. We had an explosion of online courses, particularly in the States in data science and there's been many of us since this point that have been really thinking about where does the data scientist fit into all of these different companies and places that want to employ people to really answer these questions about our data. So moving on from 2012, what happened during the next 10 years or so is that people realised that maybe it was a bit of a big ask to have one person that had all of these sets of skills, particularly as more things were being developed in each of these areas. So what we tend to find these days, it depends very much on the company, is that we have a data science group of people, and there are people that work across that entire spectrum of data science.

So I'm going to tell you a little bit about what the data science process actually consists of, and therefore where all these different people might fit into this data science process. So there are a lot of different ways of describing the process and this one is the OSEMN process. The only reason I like this is because I was told it rhymed with possum and also if you say it in the right sort of accent you can also say as the ‘arsesome’ process as well, so that kind of appeals to me as well. So basically what the data science process does is we have a question in our particular domain that we're interested in and that kicks off this data science process. So we need to go and obviously obtain some data, that's not quite as easy as it might sound. We then go through the process of scrubbing the data, and I'm going to talk to you more about what scrubbing the data is because this is the part of the process which I think is particularly important and where we need to give particular thought to how we do this in a very ethical sort of way. We then explore the data using some nice whizzy graphics, you're all used to that during COVID, you had all of those graphics that you looked at and thought, yeah, that's meaningful. My boys got very, very upset with me when I used to shout at the screen during COVID. Then we do the exciting part, because I'm a statistician, we take all of that data and we make some sort of model. So as I say we're looking at links and being able to answer questions. We then interpret that, ideally so that not just the company can understand what it is we're trying to say, so we're taking those data insights and making them into something that the company can work with for actions, but also ideally that that can be communicated out more broadly as well. Certainly, as Kevin said, I used to work in the NHS. I was always very conscious in the NHS that I wasn't just the person that was doing the data analysis, but I was also the custodian of the results that were put out there as well and not to have those misused in the wrong type of way in conversations.

I'm going to take this data science process and split it into 2 broad bits there. So we're going to take what I as a statistician mostly work with which is the Explore, Model, Interpret end of this. So as I say, the exploring end is looking at that data, doing some initial analysis in there, making some nice, hopefully useful graphics out of that, and ideally using those graphics to try and really carry on the conversation within that domain, because it could well be that there are other questions that come out which we need to think about in a different sort of way and think about our data in a different sort of way as well.

So I'm aware not everyone's a statistician in the room. So when we're talking about modelling this, this could be something like a classification type of model where let's say we're taking a product and saying that we can classify that into whether it's popular or not. We might be doing some sort of clustering there. By clustering together let's say the purchasing history that people have. It could be that we're doing some sort of regression where let's say we're looking at something like satisfaction that people have in a particular product that could be based on a whole range of different things, such as the availability, whether it's reliable, right down to the colour of the packaging. So it's that sort of thing that we're looking at down there. As I say, the interpretation comes at the end and also all the way through that process.

We're going to label this part of the process as the data analytics part of that process. We're going to do that for a reason that you'll see in a moment. So that's when we have the data, doing all of that modelling to answer some sort of questions. The other part of that is the obtaining and scrubbing the data in the first place.

So obtaining the data, as I've alluded to there's data out there that's collected in a variety of different forms and it's also when we're obtaining that data, it's not just data that already exists, we might need to think about how we go out and collect more data as well. Because you've got so much of it and in lots of different places, you need some good computer science and IT knowledge in terms of knowing about distributed storage databases, how to link all of that together. So it's not quite as easy as just having a nice spreadsheet then press a button and away you go. There's much more to it than that. Whatever size your dataset is your dataset will be messy. That's just how it is in life. But now when we're talking about a dataset being messy, we're talking about the fact that we've got it in these very different formats. So we've got image data, we've got audio data, text data, as well as data that is nice numeric that we can just do something with. So we need tools to be able to put all of that data together in a format that we can actually work with. That's fine. We've got the tools to be able to do that. What's more of an issue is that our data is messy in such a way that we have lots of missing data. When I'm talking about missing data, I'm not just talking about the fact that there's particular questions on a survey that you haven't filled in for whatever reason, you've decided not to fill them in. I'm also talking about the data that is just not there because we've not collected it. So if we think about that simple example with Treezilla where people go out and measure their trees, there's going to be whole areas of the country where we haven't got data. That's not necessarily because there's no trees in that area, it could just be that people haven't been out and measured the trees in that area. So if we're making some sort of decisions based on that fact, we need to think very carefully about those places where we haven't got data. That's a simplistic example but you can extrapolate that out to think about areas of our community where if we're not collecting data about them, but we're making policy decisions for the whole of the UK, then actually those models may be slightly inaccurate, shall we say. So what I want us to think about in terms of ethics is not just whether we are working ethically with all of those privacy and GDPR laws, but actually whether we're thinking about the data we've got, and how we model that in a very ethical way. We need to do that really by thinking about some of the data that we might not have. So that's a quick romp basic through what a data scientist actually does, very simplified down there as well.

But if you want more information the Royal Statistical Society have got this website called Real World Data Science. On there there's various case studies and there are also career profiles of people that work across data science as well as within the different areas. So go and check out that site. I think the slides are available afterwards as well so if you want to click on the links afterwards you can do that.

So hopefully you've got the idea that it's not easy and it's not easy because of what I've alluded to, the fact that this data doesn't come in nice, neat packages. So we have what I have called big dark data, which is basically putting together two different things.

So we'll talk about big data first of all. So big data is a phrase that's been around for some time there. People tend to think of big data as being just that, ie, there's lots of data which is true, and there are issues to the fact that we've got lots of data. We need to think about computer storage and so on and how we link those together. But big data is more than that. There are three Vs that are important in big data. So the other two are the velocity and the variety of that data. So by velocity what I mean is the speed that this data is actually changing. So if you think about a store that is wanting to stock it shelves with the right sort of goods, they need to base that on some sort of knowledge. So that can be on what they always stock at particular points during the year, on the history that's gone on in the last few weeks in terms of buying because obviously with the amount of money that's around in the country or not that's going to change. But also there are other things. So one of the big book moments for a big supermarket is predicting when that first barbecue of the year is going to be. So if they get that right, they make an awful lot of money. If they get that wrong, they've got an awful lot of meat that they need to shift within the next few days. So that is also based isn't it on the weather, also whether there's some sporting event coming up. There's all sorts of things that need to go into that and that's data that's changing all the time so we need to be able to cope with that sort of situation as well. The last V is variety and I've talked about this in terms of the fact that we get image data and click data, text data, and we need to put all of that together in some sort of format.

So I'm going to do two book plugs. There are lots of good data books out there. So you don't have to just come up with these two books, but I feel I can plug these two books because both of these authors have got a link with the OU. So Timandra Harkness is a brilliant comedian, writer and also BBC radio broadcaster as well. So if you haven't come across her do check her out on BBC Radio 4. She also has a Maths and Statistics degree from the OU as well, and she's also the person who first introduced me to John Graunt. So I've got quite a lot to be thankful for Timandra for. So if you want a nice easy read on big data, which is quite a fun read as well, check out Timandra’s book.

The other one is this book by David Hand called Dark Data. So this is the other part of data that I want us to think about. So David was also a Professor of Statistics from the late ‘80s and during the ‘90s, before he moved off to Imperial. So he's written this fantastic book that really talks about those areas that I've just discussed in terms of thinking about the data that we don't have and the fact that the data that we don't have is just as important as the data that we do have. What he has done is basically classified the ways that we can have this missing data into 15 different ways that I don't have time to go through today so you have to go and buy his book in order to be able to do some of that. But if we think about, as I say, I had that example of not collecting all of the data from particular pockets of our society. It wasn't that long ago that we had image recognition software that was incredibly good if you're a white male. It was much less good if you happened to be a black woman. That is exactly the reason that it hadn't had enough of the data of right types to be able to learn what it was doing in terms of those algorithms.

Another one that you'll all be familiar with, the fact that definitions change over time. So you will remember during COVID that suddenly we got a big difference in terms of COVID death rates. That wasn't because more people were dying or not dying, that was just because we redefined what it meant to die of COVID. So it's all those sorts of things that if we're not careful and really think about them properly, are going to affect the model that we build. Now that may not matter so much but it does matter if then we're putting in policy based on that model and are really wanting to live in this evidence-based society where we're making policy decisions that are based on all of this data.

So this is why I really want to think about the professionalisation of data science and this is work now that I've been doing for about 3½ years. So if we can really think about our data scientists working in a very professional way and thinking about all of these ethical challenges that we've got, then the goal at the end of this is that the public trust all of these people that are involved in all of the processes that go through that data science spectrum, so collecting, storing, using and making decisions based on our data.

So that means that we've got public trust in the whole plethora of titles that now exist across this spectrum of data science, and there's more of these exciting titles coming up every day. So the work that I've been doing then is really trying to look at whether we can create standards and create ethical standards for people working across that whole breadth of data science. So I've been involved in something called the Alliance for Data Science Professionals, as I've said for about the last 3½ years and brand change, this is now the Alliance brand. So what the Alliance for Data Science Professionals is, is basically four learned societies who all have members who would call themselves data scientists. So we've got the Royal Statistical Society there, the Chartered Institute of IT, that's BCS, The Institute of Mathematics and its Applications and The OR Society. Alongside those four learned societies we also have two institute's which are The Alan Turing Institute, and NPL which is the National Physical Laboratory. We've been working together, as I say, to try and create these standards that we can use across our four learned societies and more broadly, so we've got a common set of standards that we can work to in order to give professional accreditation to people that work in data science, be that at the data analytic end or the data engineering end or somebody that works across that whole umbrella.

The reason that we started to do this work is basically as a call to action from this report that came out of the Royal Society. So this was back in May 2019 when they wrote a report called Dynamics of Data Science Skills. Their call was really to develop data science as a profession and really try and establish what they hoped would be industry-wide standards for individuals working as data scientists, as well as looking at accrediting university data science courses. So this was fairly quickly picked up by the government and it's been written into their National Data Strategy that was back in December quite a number of years ago, but the government have picked it up at various points since. The latest one being in the Parliamentary Office of Science and Technology, which has also referenced the work of the Alliance there and the need to really think about professionalisation of people across all of these different roles.

So we started off this work basically with a phone call between the British Computing Society and the Royal Statistical Society. That was way back in December 2019 and we decided the best way to tackle this was to really go out to the community and have a massive face to face workshop where we brought all of the stakeholders together who worked as data scientists across these different learned societies that might be interested, as well as people also from academia. We planned for that to happen in March 2020. So unfortunately we didn't get our big face to face workshop but this was potentially quite a good thing because it meant we could move all of this online and it actually meant that we got much more ongoing engagement with all of these people throughout this process than we would ever have done if they would have had to have travelled down to London the whole time to do this. So I'm really grateful to many people in this room and online for having done all of this work because it wouldn't have happened without a huge number of people here doing this work.

So we spent about 18 months creating these standards, and also mapping these standards across various other places that were out there including AI, as well as the more statistical end, various apprenticeships and we came up with these 5 different skill areas. So initially what we wanted to do was to look at people who were already working as data scientists. We thought that would be the best way to start because they could be our pioneers, and then evangelise about this to everybody else in the UK and beyond. So the standards for individuals tick off these five different skill areas and people have to show their competency across these five skill areas to one extent or another.

As I say, we're very conscious of the ethical aspects of doing data science so they have to show their competency with it being based in an ethical way. So that is they have to think quite carefully about the ethics to do with things like data collection, whether that data is valid for the intended purpose and so on, about those models, again thinking about whether we've got the right sort of data there. Are they fair models, is there bias in there? Also, something to what I alluded to earlier in terms of those communications, whether in fact those results that come out of that could actually be misinterpreted and misused. So that's on top of what we normally think about ethics as being which has been those relevant laws and permissions of usage of data. I'm not saying that's not important because it is definitely important but quite often all these other things are thought of much less by not just the people working out there as data scientists, but also when we as educators put together data science degree courses. So I think that's something that we need to think about a little bit more.

Okay, so the story of the Alliance then. As I say we spent about 18 months creating these standards with a whole army of people. So we got to July 2021 and we were in a position then where we’d pretty much got these standards there so we were able to go out and formally announce that the Alliance existed with a press release. We also had to do things like put in place a Memorandum of Understanding as to how all the learned societies would work together in beautiful harmony to make sure that they accredited people to exactly the same level. So that was not easy to do that but we all worked together with this common goal to be able to make this happen. So it was great, in terms of actually doing this. We then spent the next year or so refining the process so we could put people through this accreditation type of process down there. So we spent a whole year doing this with, I was going to say 13 guinea pigs, we're going to go for 13 pioneers, not least because one of them is in the room. So I'm going to talk about them as being a pioneer rather than a guinea pig. So this was across all of the learned societies that were members of the Alliance, because we needed to make sure it worked within all our different processes. We had an event which I was clearly still wearing the same dress for, I have got changed in the last year you'll be pleased to know. I do own more than one dress as well. So we had a big event down at the Royal Society where we presented these certificates to our 13 people. Those of you that can count quickly, there aren't 13 people there, there's another 3 that didn't quite make the event and we launched this to the great and the good at a big posh event down at the Royal Society.

So this is our Advanced Certificate in Data Science. So that's kind of where we are at the moment. So I'm going to give you where we are and the next chapter. In order to make this a fully-fledged chartership, what you have to go through is an entire process with the Privy Council. So we are some way through that process at the moment. So this will eventually become a chartered data science certification that you can put a post nominal after your name. Hopefully it won't be that long before that is actually in place. As I say the other arm to this was looking at accreditation for universities. Again, the standards are in place, we're in the final tweaks of the processes in order to be able to do that and we've got a lower level of certificate called our data science professional. What we're imagining those people to be is to have the competencies equivalent to somebody who would have come out of an accredited data science degree. Very excitedly, and quite nervously for me, we've got some new members joining the Alliance. So we'll have a press release about that later in the year in terms of new societies and institutions that are joining us.

So really because somebody thought this was quite amusing in the abstract, is it just statistics with a fancy title? Well hopefully I've convinced you that data science isn't just that and if you wanted more proof you can go on to the Royal Statistical Society professional accreditation, because we have these two different pathways, one taking you to be a chartered statistician and one taking you to be an advanced data science professional.

So I want to end with some opportunities. I've put ‘challenges’ in small letters, ‘opportunities’ in big letters, because that's what you're supposed to do isn't it with these things. So I'm going to advocate for more interdisciplinary working. So when I'm talking about interdisciplinary working I'm talking about proper interdisciplinary working where we understand each other's language, we understand what each other does in terms of the bits of the process that they are working in. I’m going to ask for a real focus on ethics, as I say, not just in terms of those laws, but in terms of the whole process of doing data science. A plea for increased data literacy. As I say, COVID was great because everybody thought you were a bit of a rock star if you were a statistician. It was the only time in my life that people went ‘Oh, you're a statistician. Brilliant. Can I ask you about the R number?’ ‘Yes, you can ask me about the R number.’ and so on down there. So I want us to hopefully go away evangelising from this room about data, mainly so that when I go to a party and people ask me what I do and I say that I'm a statistician, they won't go ‘I was really bad at maths at school’ and walk away, but they'll in fact want to have a really interesting conversation about data with me. So it's all about me really. The last one down there is a real plea in terms of outreach into schools. So we have a plethora of jobs that we cannot fill in data science at the moment. We've had all sorts of initiatives in terms of trying to retrain people and that's great and those are gaps that we need to plug. But unless we get that pipeline coming through, we're always going to have this problem at the top. Let me tell you, if you go out into a school and say, ‘Do your kids want to be a data scientist?’ the teachers will look at you and go, ‘A what?’ They have no idea. My husband does because he gets it all the time, but they have no idea generally that data science actually exists, which is a bit of a shame because it's quite a high salary job. It's got really good job prospects out there. With my women in STEM hat on, it's also quite flexible as well and if we go about this in the right sort of way thinking about our language, we can actually talk about this as being a really creative job. We don't need to talk about it as being ‘this is something where you just do lots of coding.’ We can also talk about it in terms of whatever it is that these kids are interested in, there is data in whatever it is they're interested in. So be that the environment, health science, social justice, as well as big tech, film industry, gaming, whatever it is, there is lots of data out there. So there is absolutely no reason why we can't excite children in a career in data science. So that's where I’d like to finish, go out talk about data science, particularly to school children, because I'd like to live in that world where we can make those links and make our world a better place by harnessing that data. But ideally, we want to do it in a very ethical and responsible sort of way. So thank you very much to listening to me for 40 minutes, George.

Kevin: Rachel, thank you. Thank you so much for a wonderful inaugural. We are going to have a chat. So get ready with your questions if you wouldn't mind. Fantastic. So I think we've got roaming mics.

Tarek: Thank you very much for such an interesting and inspiring talk. My name is Tarek. I'm a lecturer here at the OU and proud to be. I have a question. I don't mean to drag you into controversy, it's an open question so answer it to whatever degree you think is worthwhile. But I'm interested in from your experience and the way you work, especially with the NHS and your roles as an academic, the relationship between data scientists and policymakers when it comes to evidence-based decision making, and what I've heard and I like the term, decision-based evidence gathering. So I wonder if you can kind of wade through that for a while.

Rachel: I've been to quite a few of the government data skills conferences that they have at the moment and there's a real drive at the moment in terms of the government really looking and harnessing data in a much more cohesive way and trying to have a real feeling that they can work across different departments to be able to make this happen. So they've got a roadmap for doing this. So this is really at the heart of what they want to do in terms of turning things around. I mean whatever happens down the road at the next election is another matter. But I do think at the moment, not least because of having the new department down there as well, there is a real focus on science and we've got a lot of people around that top table who are taking these issues quite seriously. So just at the moment I do think there is a willingness and wanting to work in that way that you're talking about. That's a short answer and maybe we can pick up separately.

Kevin: Very good. Thank you. We have some people online so we might have an online question.

Speaker: This has come in via the mailbox. It's from Stanley Robinson who is a student. Is it ethical to convince people that their digital data will not be mishandled?

Rachel: Well, I think it's ethical to be transparent about how data is handled. So I think that is at the heart of all of this, that you need to be transparent about how that data is collected, and what the use of that data is going to be down the road. So that would be my answer to that is for really thinking about the end usage of that data and making sure that we're very transparent about that.

Sophie: Thank you, I'm Sophie Carr. I was the guinea pig, but I forgive you because you called me sexy so that's okay. I was really interested because what gives me joy is working with you at the Royal Statistical Society where we both get to be VPs and you mentioned data literacy. One of the things I get to talk about is education at the Royal Statistical Society. So if you were going to start improving everybody's data literacy, which is a bit like trying to boil the ocean, where would you start and if you can only increase one bit of literacy in society which bit is it?

Rachel: So I think we kind of do a reasonable job at the primary school end in terms of really engaging kids. So even at a university level when I'm teaching statistics I try and get people to be involved in the data collection, because people are always far more interested in doing anything with their data if they feel like they've collected it in some sort of way. I'm not talking about going out there and counting how many red, blue and green cars there are. I'm talking about much more interesting types of statistics that we can do. So I think at a primary school level we do that reasonably well. Then we seem to lose that as people transition from primary school to secondary school. I think it's partly to do with the fact that then we focus very much on whatever subject area we happen to be in, rather than really thinking that data crosses across all of those subjects, particularly in the social sciences, as well as the sciences as well. We also then start to get boys like mine that go, ‘Well, what's the point in learning maths, because it's just like there's a trigonometry function in there’. So you kind of lose the point of the data being at the heart of all of this. So if there was one win I would say it’s that transition to secondary school, and really thinking about that curriculum that we might have in secondary school, and where data science not only fits within the maths curriculum, but also more broadly across the board. But to do that you need people to teach it as well.

Speaker: Thank you. While you were talking I vaguely remembered the Cambridge Analytica scam, which I don't remember too well about except that it was a scandal. What I do seem to recall is Cambridge Analytica was doing a job so basically it was a client employer job, and a very successful job in a sense. So how does your framework of ethics play into a situation like that?

Rachel: So the Cambridge Analytica scandal I am going to say something that probably isn't right here. But basically it was to do with Facebook really using that data in ways that people hadn't given permission to use that data for. So some of that data was being sold off, mainly for political gain in terms of elections. So it goes back to the question that was asked online really earlier, in terms of transparency about the use of what that data was. So to give you one example. In my work as a medical statistician you have to be very careful in terms of setting forwards the protocol of exactly not just the data that's going to be collected and how it's going to be analysed, but what the end purpose of that is. If you happen to find something quite interesting along the way, you can't necessarily do that analysis as part of it and report it as well. So the ethical framework in there is partly to do with making sure that we are, as I say, custodians of that data, and really making sure that we're using it for the purpose that everybody that signed up thought it was intended to be.

Speaker: Yes, nobody would disagree with that but what should Cambridge Analytica have done in an ethical framework? Should they have said, ‘Oh, we won't touch this job.’ because they were actually hired to do a job and so their ethical point of view was to do that.

Rachel: But their ethical downfall in that was the fact that they were using data which in fact nobody had agreed to the data being used for that particular purpose.

Speaker: So if Cambridge Analytica were to do this now and they had a moral sense, they would say, ‘oh, sorry, we won't touch this job, no matter how millions you offer us.’

Rachel: I would sort of hope that they would because they got a fairly bad press from doing it last time around. So there's a kind of weigh up isn't there in whatever company you're in between the amount that you might be being paid for the job, and the amount of reputational damage that comes along with doing something that really is quite unethical.

Speaker: In a way it's a bit of a follow up to that, isn't there something naive about asking for people to know what the purposes are in advance? Because surely some of the power that's come out of data science is reusing data for things that nobody actually knew that it was important for. So even back to John Graunt, that data wasn't collected for the purpose he actually used it for. He did something novel with it.

Rachel: Yes, absolutely right. But there's a difference between looking for patterns within that data set and there's a difference between exactly what you're talking about and selling it for a different use. This is partly going back to the question that was asked earlier on in terms of where are we in terms of evidence-based and linking those data sets together. We need to be much clearer about how not just the purpose of that particular data set but knowing the advantages. So it's again about being clear in terms of transparency of communication. So at the moment we've got a very difficult place that we're in, that people are terribly worried about handing their data over precisely because of things like the Cambridge Analytica scandal. So there's no way at the moment that you could do something along what I alluded to earlier in terms of linking that data together with your NHS records. There's no way at the moment that government departments can link across different areas to really identify early on whether we've got children that are vulnerable and could essentially do something about that. So it's about doing this in an ethical way but really thinking about what people have signed up for for that data set. But if we are going to move to this situation where we can link that data with another data set in another government department, being clear about the benefits that this is going to bring to society. So it goes back to having public trust. So we're only going to be able to do this if we don't have those sort of scandals and people start to really trust that their data is going to be used for the public good and not for purposes that are may be not so ethical shall we say.

Neil: Thanks very much and Rachel thank you for a really interesting and inspiring talk as your inaugural lecture. It was just really exciting and on the nail for me. I'm Neil Stansfield. I'm from the National Physical Laboratory, which is a national laboratory supporting the Department for Science, Innovation and Technology. So we're at the heart of the debate in government about the importance of professionalising data and that evidence-based decision making that was raised by a colleague. My question for you. I've just come from meetings with our defence and security colleagues talking about AI and the impact of AI on the role of a data scientist. Counterintuitively it seems to me that the more the public trust in AI the more we need professional data scientists. I wonder if you had a comment on where AI takes the data science profession.

Rachel: Thanks for an easy question there Neil. So I think again it goes back to if we think about AI as being just those algorithmic how we're using that data to make those decisions. So I think one of the things is very much what I was talking about before, that we need to think very carefully if we're going to use AI in a much more serious way about all of that data that we haven't got, because however we're training these models, we need to make sure that we've got data across the whole of our society so we're not making that evidence-based policy which isn't actually based on the right evidence. So I think there's a real role in terms of professionalisation there so we're not just going down this rabbit hole of yes we can do these exciting AI models. But let's think about quite definitely what is going into those models to be able to produce those decisions out at the end of it. So I know there's work there at the moment in terms of creating standards, really thinking about the roles that AI has. So it's about making sure all of these things are brought together and I think the right people around the table. I know Matt's been involved in some of this work from The Turing Institute as well in terms of really making sure that we're not just talking about this in a different way. I think also AI has become quite a, how can I put this, quite a kind of sexy thing to be talking about as well. So I'm not sure when we're talking about AI we’re necessarily really talking about the whole of AI, but are really talking about that machine learning aspect. So I think we need to be quite careful when we're talking about data science what bit of AI it is that we're really talking about as well, because I suspect AI is being used when it really means machine learning.

Kevin: Super, thank you very much. Well we're up on time perfectly there so we're going to have some refreshments back out into the foyer. Can I thank you all for coming along. Great questions, and really nice to see a great audience here on campus. Finally, can you just join me in thanking Rachel for a fantastic inaugural. Thank you..

Read more about Professor Rachel Hilliam

Contact our news team

For all out of hours enquiries, please telephone +44 (0)7901 515891

Contact details

News & articles

Three people in brightly coloured cycling clothes and wearing helmets, riding their bikes

OU study: Transplant athletes match elite performance

A new study led by The Open University’s Professor Bart Rienties, and to be published in Progress in Transplantation in December, reveals that high-intensity transplant athletes can not only meet but greatly exceed existing physical activity guidelines.