Episodes
Monday Apr 15, 2024
Monday Apr 15, 2024
Professor Sabina Leonelli (Professor of Philosophy and History of Science) talks to Dr Chris Tibbs, Research Data Officer at University of Exeter about open research, the use of Artificial Intelligence in research, and the importance of understanding the diversity of research environments when implementing open research practices.
Podcast transcript
Chris Tibbs:Hello and welcome. I'm Dr Chris Tibbs and I'm the University Research Data Officer, part of the open research team based in the library here at the University of Exeter. My role involves providing support for researchers across the university as they work with and manage their research data, and today I have the pleasure to be joined by Professor Sabina Leonelli, a Professor of Philosophy and History of Science at the University of Exeter. So welcome, Sabina. Just to start, would you like to tell us a little bit about the research area that you work in?
Sabina Leonelli:Thank you and hello everyone. So, I'm interested in the dynamics of research and research processes. Why is it that people who work in science use the methods that they use, handle data in particular ways, decide to publish in particular ways? Why do they choose certain research goals, and how does that occur historically but also conceptually? And what are the social implications of those choices?
Chris Tibbs:That's very interesting. So you're really looking at these sort of different approaches and the different methodologies that different researchers are taking and that's very interesting because obviously different research areas will have different approaches and methodologies that they use. Now one thing that I noticed that you're very interested in, based on your web profile, is obviously open science and openness in research, and the European Commission and the United Nations, among others, all use this term of open science and just so that everyone listening is clear, open science is the approach to research based on openness and co-operative working, and it really emphasises the sharing of knowledge, results and the tools as widely as possible. But I just wanted to point out also that obviously these approaches can apply to all research disciplines, not just science. And so, for example, we are the open research team. And so, I tend to regard open research and open science as synonymous. So, I just wanted to get your take on this, Sabina. Do you see these as separate terms, or do you use them interchangeably?
Sabina Leonelli:I also tend to use them interchangeably, but I think it is very unfortunate that it’s the term open science that has gotten so much mileage in the English language because in the English language we are aware of the fact that it does tend to be taken to refer to the natural sciences, more rarely to the social sciences, and never to the humanities and the arts. And this is different for lots of other languages. I mean, most, I guess famously the term wissenschaft in German tends to encompass all of the research domains, including humanities and the arts. I’m very partial to that, partly because I think that we're in a moment where research is so interdisciplinary and the boundaries between domains are so blurred, that actually making strict distinctions between what counts as a humanist approach, and what counts as a natural science approach, or a mathematical approach is becoming more and more difficult. As of course in history it has been very difficult throughout. So yeah, so I'm very partial to the use of the idea of open research in English, but of course we tend to use a lot the term open science too, because this is, as you were saying, very well recognised by policymakers and by funding bodies and a lot of people working in academia more generally.
Chris Tibbs:OK. Well, thank you for explaining that. And again, the reason I just wanted to confirm this is because I want to ensure that everyone listening can be clear that what we mean by the term open science and that they don't feel that this doesn't apply to them, maybe because they don't see themselves as a scientist. So that's what I just wanted to clear up, and so that these practises do apply to all disciplines. Now moving on. Sabina, you hold many different roles and one in particular that I would like to mention is that you are the theme lead for the data governance, openness and ethics strand of the Exeter Institute for Data Science and Artificial Intelligence. So, given this particular role, I'd really be interested to hear your thoughts on how you feel artificial intelligence can play a role in the research process, and particularly around openness and the open research.
Sabina Leonelli:Yes, thank you. So, I guess openness lies at the heart of what it means to do research, no matter how you look at it, right. I mean, doing research basically means trying to answer a certain question, trying to solve a problem that you may have encountered in your everyday life and within the more scientific landscape, it means doing it in a way that's a bit more systematic, that is susceptible to scrutiny and can be evaluated by others. So, research and science are public enterprises pretty much by definition. If it was only something that, you know, the one individual does in their own room, then it wouldn't really be something that we count as being research or science. And in that sense, openness, the availability of the outputs of research, being able to discuss the methods one uses, the procedure one uses and make them available for scrutiny really are what defines the very idea of research. So, given this, of course, the fact that we now have an emergence of more and more artificial intelligent tools that can be pretty much directly applied to the research process affects the ways in which we think about openness, because this means accelerating the research process to some extent. It means the potential to automate some parts of it, or at the very least work together with machines so that some of the, if you want, hopefully at least more tedious tasks associated to research, the more repetitive ones, the ones that more easily standardised can actually be delegated to machines and then can be iteratively dealt with in collaboration with humans. So in that respect there is a strong temptation to think that bringing AI into research processes will almost automatically improve the openness of research because it will allow, and in fact it will be almost incentive for people to make their methods ever more transparent, to make their data more available, and to be more careful in noting down the procedures that they’re using and make them available to others because all of those strategies make it easier to adapt research work to AI and to make it machine readable if you want so that machines can actually take over some parts of that work. The problem in assuming that AI automatically enhances openness comes at several levels, however. First of all, there is this problem, I'm sure many people have heard about of opacity in AI. The fact that because a lot of the reasoning that machines go through to produce certain outputs, so the type of algorithms that are used, particularly machine learning, tend to become less and less transparent and more and more opaque as time goes by – precisely because the machine is doing operations that humans wouldn't quite do or wouldn't be able to follow in the same way, and we can't quite track every single step that the machine is making in that sense. Then actually the system that is AI powered, becomes by definition less open because it's less obvious how do we read the system, how do we make it more transparent, more scrutinizable and more open for review. Given that there are all these parts of the system which are not necessarily intelligible to humans. So that's one issue that is happening in the area. Another very big issue is the fact that many of the providers of artificial intelligence technologies and particularly tools that are then applied in research, we can only think about large language modelling and tools like Chat GPT, which is produced by a company which contrary to his name, is not publicly funded but is a private company, Open AI. Many of those tools are privately funded. The ways in which they operate is even less transparent because a lot of the algorithms that are used, are actually trademarked, are not available for public scrutiny. A lot of the training data that is used to refine those algorithms is also not necessarily transparent, in some cases not even entirely clear that the data are data that are, you know, in fact right to use, they tend to have been data scraped off the internet in a variety of ways that may or may not be ethically acceptable. And so, we are in a situation where whenever we pick a tool from the internet thinking, oh, great, this is going to help me to do my bibliography; this is going to help me to write my essays; this is going to help me to search for, you know, literary sources on a particular topic, we are immediately given data and relying on tools that are not openly accessible, that have been developed in a way which is not immediately scrutinizable. And in fact, it may be using some of our own information in a way that is not open. So, let's just say that there are quite a few questions that are open around whether the use of AI in research in fact is favouring openness, and whether that's going to happen more and more in the future or in fact it’s going to have the opposite effect.
Chris Tibbs:Wow, that's really interesting. I really love that sort of yes, you might think that it's going to help make research more open, but actually in fact it's not so clear and so that is really interesting and I think, the one thing I just wanted to add is about obviously any data that's being used by AI systems, the data need to be well maintained, well documented. You need good data going into training the models to help ensure that the outputs are accurate. And so, I just also wanted to mention that, as the saying goes, right, without data there's no AI. So, it's really reliant on having good data at that point. I just wanted to move on because I know that you obviously, like I said, you're working on many different projects, but one of your projects is a European Research Council-funded project called a Philosophy of Open Science for Diverse Research Environments, and given that title, I just wanted to hear a little bit more about this specific project and the aims of this project, particularly because obviously we've been talking about open science on this episode.
Sabina Leonelli:Yes, thank you. So the project starts from the observation which comes from many years of me collaborating with scientists and studying lots of different research situations, especially in the domains of the life sciences and the biomedical sciences, and observing the fact that depending on what topics people are working on and what they're interested in, but also the specific locations they're working in, the materials that they're working in, they tend to work in very, very different ways and some forms of scientific research are highly dependent on cutting-edge technologies, for instance, others are not. If you're doing work in molecular biology, you may really want to have access to the latest model of genomic sequencer. If you are doing work on observational field studies, looking at how particular varieties of crops are growing in response to certain environmental conditions, you may in fact not be that reliant on that kind of technology and need other things. And similarly, you may have a situation where people are served by very good infrastructures. For instance, a very reliable broadband connection and reliable access to research facilities in situations where people don't have such reliable resources but maybe have other things going for them. For instance, they have access to very particular kinds of flora that other people in other parts of the world don't have access to. So there tends to be a huge variation in the conditions under which researchers can do very good research and produce really important knowledge. And this is not always something that is recognised by people who are writing about what it means to do research and what we mean by research methods, like indeed people in my discipline, the philosophy of science, but also policymakers, people who are thinking about the research landscape, how to fund it, how to support it. So, you know, funding bodies, for instance, the publishing industry, which tends to assume that people have a certain way of working and a more standardised approach to, say their experimental techniques or the kind of materials they maybe have access to, and this creates an issue when it comes to thinking about open science, because one of the interventions we're trying to do when we're trying to make science more open is to come up with guidelines and principles and policies that are going to push researchers and also especially research institutions to incentivise the openness of their work. So actually, put some effort into curating your data and making sure that other people can see what data you produced to back up your research, or put in some effort in looking at research outlets, which many people can access, and they're not just restricted to, you know, the very few people who have a subscription to a certain kind of journal, or making sure that when you're producing research, you're actually talking to people who may not just be academics, might be also be people who are interested in the kind of research and have some expertise in it, but they're not working in academia. So, all of those aspects are aspects of open science. But the ways in which they may manifest in research may be very, very different depending on the conditions on the ground. So, there is very often a tension between trying to produce some generalised guidelines that can provide these generally incentives for people to, for instance, share the research materials and the fact that when you look at different situations of different research domains and different locations, those vary same generalised guidelines may in fact prove to be problematic or sometimes downright damaging, right. For instance, very simply, when one thinks about sharing data, it looks on the surface like this great thing everybody should do it, it’s going to help research as a whole, but in fact it depends very much on who then is liable to pick up that data and do something with the data. If for instance, you are a researcher who's working on ethology, and so you are in the business of spotting, for instance, rare animal species and studying them and understanding their behaviour, particularly species which are at risk of extinction, then making all your data immediately public, including location data, so that you basically tell poachers where to find rare animals and how to locate them and potentially kill them all, may not be the most useful thing for you to do. Or in other situations, if you are the kind of researchers who's working with sensitive data, such as, for instance, most obviously human data, personal data, you have to be very, very careful about which of those datasets may be beneficial to share with other people and which not – may actually lead to unwanted implications. So, let's just say that the point of the project is trying to look at what happens at different locations of research. So, we're working with researchers in India, in Ghana, in Italy, in Greece, of course in the UK, in Germany, in Brazil and various other locations and looking at what they're doing on the ground, what are the challenges they are encountering and how they interpret the notion of openness in a way that actually would be helpful to them for the work that they're trying to do.
Chris Tibbs:So it's very obviously clear that this one-size-fits-all approach just doesn't work across, as you mentioned, the different environments in which research is taking place. So, what can we do? How can we avoid this? What needs to be done to take into account those differences?
Sabina Leonelli:Well, I think the answer comes from lots of different levels because there's so many people involved in the landscape and that's actually a good thing. It’s not just a responsibility of one agency or one person. So, at the level of research institutions and especially scholarly societies, I think it's very important that each domain of research tries to think very hard about what are the specific needs and the particular situations likely to emerge in relation to that research. So, working on phenotyping in plants, for instance, is going to be a very different requirements that people who are working on animal research in labs, or people who are working on clinical trials in biomedical research or in drug development in a synthetic environment. It's very important that there is an effort by researchers and the institutions within which they're working to think through what are the actual requirements for their field and how does one best think about openness in those situations. Then parallel to that, there should also be a strong effort to think about research as geographically dispersed. So rather than always think that, you know, the stereotype and the prototype. For best practise, say in molecular biology, is that particular lab in Oxford or at MIT or in Cambridge, so in very powerful, very well-resourced institutions which are sort of the very upper end of having access to a lot of resources. It'd be better to think a little bit more, you know, cogently about how do we adapt openness requirements to lots of, to the vast majority of situations where smaller institutions, institutions which are based in rural areas, whether that's in the UK or elsewhere, may not have access to all of those resources, and yet these are exactly the kind of institutions that researchers which should benefit the most from open science, at least in theory. What is happening at the moment is that because there's so little thought still about how do we get open science to be more just and to be more inclusive and to benefit really everybody who would like to participate. We have a situation where the biggest beneficiaries of open science activities are people working at institutions, which are powerful and have the resources to take advantage of things like big open data repositories and code sharing initiatives and things like that. While what we want is to actually lower the bar of entry into the open science ecosystem so that people who are working under very different conditions can also participate and benefit. And in that sense, this is an effort from institutions, from policy makers, also from researchers on the ground. Everybody of us can think about what are the best ways to make our research accessible to others, no matter the conditions under which they're working and whether, and that's of course a very controversial question, whether pursuing cutting-edge technology as the end point or the ideal goal for anything we're doing in research is actually always the right thing to do. Sometimes it is in fact better maybe to use kind of a low bandwidth tool to share our data or to go through a low-entry friendly interface, which may not be as sophisticated, but it's much more usable. For instance, doing coding or programming so that we can increase the usability and the inclusiveness of our tools, rather than always aiming for something which may be technically extremely sophisticated but ultimately may prove to be completely useless because very, very few people, aside from us can understand how this works and actually have the condition to use it.
Chris Tibbs:Yeah, that sounds really amazing. If we can get to be that inclusive and really just take that extra time to think. Given your experience of working and collaborating with colleagues around the world, do you think that's something that's on researchers’ mind, like I mean, I guess not every researcher is even thinking about openness. So, I guess, are there even fewer researchers thinking about this and trying to be more inclusive? What's your feeling about how likely it is that something like this can actually happen?
Sabina Leonelli:Well, I am cautiously optimistic about the research environment and research cultures because what I do encounter is a lot of people that may not be necessarily that aware of the open science movement or even that involved, but as soon as you start to discuss with them what are the challenges they see in their own work and what are the things that they would like to overcome, these kind of barriers immediately come up. So, I think a vast majority of researchers in my perception are well aware of the fact that there are very high entry points to participating in research in a variety of areas and that this affects negatively the quality of the research and the extent to which we can actually try and use research to address global challenges. So, in that sense, I think we are making progress. Where it's very complex is the fact that we are looking at a research system which is on the one hand very much subservient to a broader political economy, which is controlled by big tech companies, and this is important because all of us end up having to rely on things like Amazon Web Services and Google tools to carry out our research. And we know very well that this affects very much the choices that we're making in research in a way that we very often don't control. And so, the use of some of the proprietary tools really comes in the way of trying to implement openness to do it in a way which is more just, as we discussed. The other issue is that the incentives in academia overall, and certainly within Anglo American academia, continue to be not quite conducive to spending a lot of time thinking about what would be more sustainable, more responsible ways of sharing our research and making sure that we collaborate with others. Within UK institutions there is a lot of lip-service paid to try and co-design research and working, for instance, with local communities to try and devise solutions that make work for people. But when it comes to how researchers are evaluated, these are not the criteria that are used. We're still evaluated on the extent to which we published in very prestigious journals, impact factors, very often our citation numbers, the type of funding that we bring in, how much money we bring in, and these kind of metrics are very problematic when it comes to try and encourage a more intelligently open and justly open behaviour. So that's where I'm more pessimistic. I think we need to work very much on the system of incentives because otherwise, and also the platforming of the research, because otherwise, no matter how good the intentions are of people who are working, particularly in academic environment, they're still going to fall prey of this broader set of constraints.
Chris Tibbs:Yeah, I think that's important. Researchers alone can't change it, right? The structures around them that are confining them also need to be changed. Like you said, needs to be incentivized to do this work. So, a little bit of optimism along with a little bit of pessimism. I just wanted to ask you one sort of a final question just before we finish up here. So, we talked about your research, you're doing a lot of really interesting research. We talked about the idea of openness, open science. We talked about artificial intelligence. So, to finish up, I would just like to open the floor to you. Do you have any sort of final take-away messages for our listeners today?
Sabina Leonelli:So I think I want to talk specifically to early career researchers, PhD students, postdoctoral students, researchers, you know, beginning lecturers. I think it's such a fantastic time to be in the early stages of research or a research career. And I'm very aware of the fact that the job market is really not particularly good. But at the same time, there's lots of really interesting jobs in research also beyond academia. And there's so many opportunities to really think in a new and more creative way about what it means to do research at the moment. How do we use new technologies and how do we make sure that we make our research more open. I think some of the most interesting initiatives in terms of making more open research and more just research, more intelligent research, come from the younger generations as things like the reproducibilityTeas, you know, like younger people meeting and thinking, how do we improve the quality of research? What do we need to be able to do this? How do we avoid falling prey of commercial publishers? How do we manage to do research in a way that actually falls outside of these very well-defined traditional paths for having a career? I think all of these questions are very important and there is a real chance at the moment to try and change the system from the inside. I think there's lots of goodwill also by many senior academics who are themselves, I mean like myself and many other people, looking hard for answers to these questions and very much looking to the younger generation to inspire us as much as possible and to really try and let us know what they think should be done, given that these are the people that will carry out research in the future. So I would say very, very important to, even if you're in a system where you're under pressure in a variety of ways to try and do things in a conservative manner, to really use the impetus of being, you know, relatively newcomers on the block, seeing things in a different way, trust instincts into thinking if you see something that you don't think is quite right, that should be changed. Just pursue that idea. Very often you will find other people who are like minded and may in fact help you to try and change things from the inside. So, I really do think that a lot of impetus for change and transformation in research needs to come from researchers, and particularly the younger generation. And there are tools to do this. If you're interested, particularly in the question of data, data dissemination, data sharing, I would recommend that people look at the Research Data Alliance, which is a great organisation that brings together people from around the world to have these kinds of discussions. You will find people who are very like-minded and there is online working groups. They're very often in person conferences, so this is something they can really use directly. Certainly, the events organised by the Institute for Data Science and Artificial Intelligence very often will be conducive to having these kinds of discussions. We have a reproducibility network in the UK, which has a very strong base also in Exeter, and the people who are in charge of that would always be very available to talk to young researchers about what can be done. And also of course, you know, and Chris, you are one of the main people here, we have a wonderful library at the University of Exeter that can really help thinking these things through. So, I would just say don't get discouraged despite some of the obvious obstacles, seek out help and seek out alliances. And there's lots of work to be done in this area.
Chris Tibbs:What a brilliant message to finish on. Thank you very much. It’s been really great to hear from you today. Thank you very much for taking the time, sharing your thoughts and insights, and really, really great. Thank you everyone for listening. Thank you, Sabina. Take care everyone.
Sabina Leonelli:Thank you.
Thursday Nov 16, 2023
Thursday Nov 16, 2023
Dr Gavin Buckingham (Associate Professor in Public Health and Sport Sciences) talks to Dr Chris Tibbs, Research Data Officer at University of Exeter about the different types of research data he works with and best practices for managing research data during your project.
Podcast transcript
Chris Tibbs:Hello and welcome. I'm Dr Chris Tibbs and I'm the University of Research Data Officer, part of the open research team based in the library here at the University of Exeter. My role involves supporting researchers across the university as they work with and manage their research data, and so this episode is going to be all about research data and how best to look after it and manage it during your project. And to discuss all of this, today I have the pleasure to be joined by Dr Gavin Buckingham, an Associate Professor in Public Health and Sport Sciences here at the University of Exeter. So just to start with Gavin, would you like to tell us a little bit about your research and the different types of data that you work with?
Gavin Buckingham:Hi, there, Chris. Yeah, I'm a cognitive psychologist by training, and I'm interested in human perception and human motor control. And I've been looking at this in the context of measuring the movements and forces people apply to pick objects up, and more recently I've been looking at this in the context of immersive virtual reality as well. Now, most of this data takes the form of pretty simple time streams, time series of data, so numbers representing forces or positions of things in multiple dimensions, and their expression over time. So many thousands of lines of data potentially that we then take maybe the largest value or the value at some critical other time points and that reflects some aspect of human behaviour. So that pretty simply is really what it is that we deal with here.
Chris Tibbs:
So thinking about all those types of data that that you're working with, I mean you mentioned, like numerical time series data. I just want to point out that, you know, data can also mean a wide variety of other types of data and many people might not think that they work with data. But generally, when I refer to data, you know, I'm thinking about any sort of information, evidence, materials that are being collected and used for that research. So I’d just like to hear your thoughts on, so when you're thinking about your data and why it's important that you look after your data and you manage your data in terms of helping your research and also then potentially making that data available.
Gavin Buckingham:
Yeah, it's a really interesting question because the pipeline that goes from the stuff that comes out of the apparatus that I used to capture people's data to the things that are subsequently reported in the paper, that's a pretty lengthy pipeline that has many different steps. And those steps can be fairly clearly articulated, but being able to show the consequences of each of those steps, I think is a really key part in terms of people being able to eventually understand your data and make sense of it and use it in other sorts of ways and I really feel that's the narrative I feel most passionately about in many ways. I'm perhaps, slightly selfishly, I'm not so interested in other people finding mistakes that are present in my data, God forbid, but I'm more interested in this resource that was collected that could potentially be a useful thing for other people in ways that I cannot even really imagine. That for me is the really big value I see in my dataset and I work with clinical populations. I work with children, with older adults, typically developing university aged people, all of whom have interesting ways that they interact with the world around them that you know could feed into hitherto unforeseen mechanisms or rehabilitation or technological advances and, you know, I really see sort of the value of data just sitting there waiting for someone to be able to harvest in that way.
Chris Tibbs:
Yeah, all of this sort of potential that's in that data, that you know, doing analysis that are just completely irrelevant, that are completely separate from your research. So when did you sort of first start thinking about making, like managing your data, to make it available so that others could have it, and be able to analyze it? Was this sort of something that you had a discussion with, maybe your supervisor as a PhD student? Was this something that, you know, you sort of just picked up on sort of later during your career?
Gavin Buckingham:Yeah. When I was a PhD student and postdoc, this wasn't really part of the narrative at all. There was no real sense that this is what you would do, but it was actually more to do with the experimental and analytical code: the MATLAB files in my case that I fairly vividly remember asking someone if I could use the MATLAB files to run an experiment of my own, and they're like, well, these were developed in collaboration with my colleagues and it cost money to get these developed, so probably not. And I was sort of thinking to myself, that's a bit of a disappointing perspective given that this doesn't directly earn anyone any money and gate keeping it from me isn't stopping you getting the benefit from them. So when I got my first lectureship, I was given, as part of my start-up contract, a research system to help develop the code that would underpin the data collection in my lab and I was sort of very clear in my head that data will be available to everyone and I started creating a wiki from my lab webpage and you know a lot of this is lucky me to have the resources and the skilled person available to do this and set this up from the beginning. But really that was kind of the key, the key step as far as I was concerned. You know, once all of this MATLAB code to control the data acquisition unit in the force transducers that underpinned all of my research at the time was up online. That was number one, a really nice way for me to stay on top of something that someone else had written for me, which was a new experience for me anyway. But also to share with the world and you know, I mean sort of going forward since then I've had seven or eight people set up their labs with that code and it's a pretty niche research field, but it feels really nice to know that that code has been used in this way for this particular purpose. And then from then the sharing of data kind of felt like a pretty natural step once that became part of the narrative on social media in particular, is seeing people talk about this on Twitter, that has been really formative part of my education in this area.
Chris Tibbs:That's really interesting and just picking up on something. So you mentioned this wiki for your lab. So, this is obviously something that that you discussed with your team and with the PhD students that you supervise. So I mean, you made a concerted effort that this would be part of this. Obviously, they're learning when you're helping them to develop as researchers on their own. You made a concerted effort that this would be part of that process?
Gavin Buckingham:Yes, although maybe perhaps not as aggressively as one might imagine. I certainly don't mandate things like data sharing or sharing of code, because at the end of the day, particularly if you're future life is not likely to be outside of academia and you have potential intellectual property issues, or you want to display your own evidence of your expertise, that's done in very different ways in very different fields. So I encourage and I support my trainees to provide basically everything as open as it possibly could be, but I'm not that interested in mandating it to them. As it stands, they've been even more enthusiastic in their uptake of this than I have and you know, certainly some of my PhD students have improved my own nascent processes quite substantially and taught me things and do stuff a lot better than I'm able to do as it stands.
Chris Tibbs:So do you have any tools or techniques that you could share in terms of, so some of these examples of where you and students from your lab are sort of building-in these sort of best practices? You mentioned the wiki, you mentioned about data sharing. So is there any, you know, like sort of examples of like a tool or you know something that you could sort of just share, some sort of, this is one thing that we have done in our lab?
Gavin Buckingham:Every project that gets up and running in my lab, there's an Open Science Framework (OSF) page created for it. That Open Science Framework page might exist as nothing other than a place to put a preprint of the paper at the point of publication. So I know that everyone has access to at least the version of the scientific outputs, which I feel very, very strongly about. That seems like a complete no brainer, zero effort thing to happen. Oftentimes that's accompanied by a pre-registration document, be it a version of the introduction that we'd sort of hashed out together, me and the trainee, or a template from As Predicted or something like that. Eventually, this is also often populated with individual participant data and then the summary statistics that would have been used to calculate the F ratios and P values and things like that, and the statistical analysis and the supplementary materials that would go alongside the paper as well. So it becomes just this wonderful, convenient storage place to segregate everything to do with that particular research project, which as I've progressed through my career and I am working concurrently on what feels like 1000 different things at the same time, it's incredibly, I would say essential. An essential part of my practice, because otherwise I'd be like relying on my, uh, incoherent filing system to keep track of everything, whereas now I can look in my OSF page and all the things that are shared with me and capture a huge amount of stuff that's actually really useful for me.
Chris Tibbs:Yeah, that's really interesting, that’s a really good way to manage it. So I just wanted to highlight a few of the points that you raised there. So like having all of the documentation alongside the data, right, because it's obviously important, the data by themselves are essentially meaningless. So having all that documentation alongside the data is obviously important and having the data available, so alongside the publication, when someone reads the publication they can obviously access and see the data. I also, I just want to mention, so you obviously talked about depositing the data and the documentation all in the Open Science Framework which again is totally fine. I just want to, obviously highlight, point out, that the University also has a repository that can be used not quite in the same way that you use the OSF. The repository, Open Research Exeter is more for the published dataset to go alongside the publication. So just talking about publications, so you talked about preprints, you talked about you know, pre-registrations, registered reports. I just wondered if you could say a little bit more about particularly the pre-registration and registered reports as these are sort of new methods of publishing and sort of what it is they're trying to achieve that’s sort of maybe different from a standard publishing process.
Gavin Buckingham:Well, it's interesting that you sort of call it a new narrative, and it's definitely a new narrative, but one of the things that drove me in this direction was when I moved to Exeter, actually, I moved to a department that has this incredibly onerous ethics process. An ethics form that's some 20 pages long. And this for many disciplines seems like a completely bizarre idea, but it actually forces you to directly confront the background, the things you're hoping to measure, the things you're hoping to manipulate, and why you're hoping to do those things. And articulate who you're going to recruit and why that sample size, complete with a power calculation. So all of this stuff needs to happen before I can start collecting data, whereas back when I was a postdoc, I would apply for ethics with a fairly simple, this is what I'm going to do - it's pretty safe, so that will be fine. Here it's a much more onerous process, but this actually means that I already know all of this stuff at the beginning. So creating a pre-registration, where I have articulated what it is I plan to do, what it is that I hope to get out of this, what I'm going to do in terms of statistical analysis and even deeper details like how will I deal with outliers and what sort of things will I have in place to, you know, foreshadowing all those difficult research decisions that I might have to make later on that I've sort of forgotten I will have to make later on in many cases, is a really useful thing and it was sort of happenstance, really, that this ethics process landed at just the right time at which pre-registration opportunities were billowing into the, certainly the psychology ecosystem, through things like As Predicted, through Open Science Framework growing up, and as you say, through registered reports, which are a version of pre-registration where your study protocols are peer reviewed before you collect the data. And this in many ways is almost like, you know, how it is with a student and the supervisor. They come to you, they pitch an idea and then you refine it together and then you're finally ready to start off. Here, it's not just you and the student in your you know, little bubble, it's you and some reviewers who have really crafted what's the perfect experiment to answer this question. And then you go out and collect the data safe in the knowledge that no matter how it pans out and it's a significant difference, no significant difference, slightly awkward P value that sits in the middle of being able to be interpreted as one thing or the other. The publication will still happen and it will be accepted assuming that you stick to what it is you said you were going to stick to. And then, you know, you still have the opportunity to explore your data and in the way that you would have, uh, before the days of registered reports and pre-registration anyway. But it's a sort of really interesting publishing pathway, although one that I probably haven't embraced quite as fully as I would like to. I've done one registered report myself to date and hopefully there will be more, but I think that the challenges of identifying exactly what you're going to do to a dataset before you collect it are not ones that are easily overlooked. I think that you need to be very certain about the protocols and exactly how this data looks. It needs to be really firmly in your wheelhouse of expertise. Can't be a sort of study that's branching off into a slightly new area using a new technique, using a new data collection method. I think it's really got to be something that you know a lot about and for that kind of study to pop up just at the time when you're maybe a bit later in your career like I am, you know, it's a reasonably rare occurrence. That said, I think if not for the pandemic and if not for the things that had done to various data collection timelines and uncertainties thrown in there, I would like to think that many of my PhD students would have submitted stage one registered reports by now and be collecting data for them. It just really didn't seem like a pragmatic thing to do back in 2020.
Chris Tibbs:Yeah, I just, I mean that's really interesting that these, you know, sometimes these things just align, they just align up and things just work out. Something else I just wanted to mention as well was because I mean, you talked about this, right? So this, you know, your sort of move into this sort of you know, Open Research area wasn't something that you sort of developed as a PhD student or a postdoc, right? It's something that came sort of a little bit later, and I think that's important because I feel like this is sort of an example of it's not too late to learn, right? So just because maybe you're already a lecturer, you're already well established, doesn't mean there's not things that you could still learn or things that you could still implement. So I think it's important just that we can say that this is like, is not just something that, you know, you want to pick up as a PhD student, I think. You're never too late to learn, as the saying goes. You've obviously been, as I mentioned on this journey regarding looking after, managing data, sharing data. I just wondered if you could maybe, you know, highlight maybe some of the obstacles that are, you know potentially in the way that you think we as a community might still need to address.
Gavin Buckingham:Yeah, I mean, the obstacles are plentiful. Two that sort of spring to my mind initially are what do we do with the old data, as in the sense of currently my workflow will be to have our participants consent to having their data shared in an open repository. But again, that's something that has, come in over recent years, what about people who did not consent to that either because I never asked them or just because no one ever thought to ask them back ten years ago? Should that data go up online? Is that fair game to go up online? What has GDPR done to that? And what are the interpretations and legal consequences and how do they vary from institution to institution or data protection officer to data protection officer? And I really feel that these challenges are often so overwhelmingly insurmountable that many academics will just go probably best for me just not to bother, and I'm pretty sympathetic to that idea. I certainly at one point had all of my data up online and then I decided, probably based on watching something on Twitter, maybe I should just put the data that people have explicitly consented to up online and you know that's a sort of awkward position to find yourself in any way as an academic. I think the other big issue that we get to confront here is what is raw data. And the least raw data or the lowest barrier to entry, I think is let's put up the CSV file or the SPSS file or the R file that contains your summary statistics. The average of each person in each condition. Maybe a 20 by 30 matrix in my sort of a typical case, and then someone can do the same statistics I did taking my word for it that those numbers are real. Fine in some sense, useful for maybe doing a meta analysis and being able to calculate things I didn't report in my paper, such as confidence intervals or things like that. But less useful in other contexts, and certainly not raw data that you could learn anything about human behaviour from. Really, it's data presented in the way to answer the question that I wanted to answer. I could present my rawest of raw data: what comes out of my motion capture cameras? That's just about achievable for me, but each of those files is several megabytes and you know in a large study with many trials that quickly turns into gigabytes. If you're in bigger data worlds than I am, that becomes unfathomably large, in which case you need to start to rely on the university structures. Which isn't a bad thing. It's nice that the university structures have kind of caught up with this potential demand, although I suspect they're not utilized nearly as effectively as many people behind the scenes would appreciate them being utilized. Or the rawest data are completely identifiable to participants and we break away this anonymization. Thinking of an MRI scan of someone's brain, for example, quite easy to determine whose brain that is, or, you know, once you have the key to unlock that piece of information. In the world of movement control, it's still a little bit up in the air. You would probably assume that the way someone moves their arm to reach out and pick something up is not at all identifiable to that person, but with good enough mathematics, and particularly in the world of virtual reality and data sharing, which is a very hot topic around the main company that's involved in immersive virtual reality these days, Meta. There's a lot of unanswered questions that leave a lot of uncertainty, and it's definitely easier to err on the side of caution I feel, whether from a legal perspective or from an ethical perspective, and finding that right balance is definitely a study by study situation by situation challenge that makes it very hard to standardize processes and protocols without me ultimately having to make a judgment call.
Chris Tibbs:Yeah, I think that's very important. It's very, it's not nice, straightforward here's the data, every time. It's complex, right? And more so when you're dealing with human participants, and you have the ethical side of that. So yeah, it's not straightforward. And all of this takes time, obviously. All of this takes, you know, resource that, you know, that someone, the researcher usually, has to do that work, right? And so it's, yeah, it's complicated and there's no, at the minute, I don't think there's a straightforward answer or you know, one size fits all solution to that. So if, you know, we discussed various different things today. So if someone was listening and you know they were thinking about you know this, this sounds like it should be sort of the approach that I'm taking to my research. You know, I should be looking after my data, should be sharing it where possible. Do you maybe have like a sort of simple take home message for that listener? So maybe they're feeling a little, you know, overwhelmed. Not sure where to start. Would you sort of, you know, have maybe one simple message for how they could get started?
Gavin Buckingham:Yeah, and that simple message is that doing everything or something a little bit better is better than it was beforehand. And it can seem like the barrier to entry of the open science world and the reproducibility world is so high and you need to pass so many purity tests to be able to feel like you're one of the gang. That's definitely a narrative that I have no interest in. That's, I'm not some superstar of the open science and reproducibility world, I'm just a normal academic who has little bit by little bit been able to make fairly incremental changes that have, I think, substantially improved the way that my practices are from when I was a junior academic and you know this could be as just as simple as uploading all your papers on to your website so that they're all not just available, but you know, easily searchable, that's open in some senses, putting up the code you use to collect your data or analyze your data with the idea that, well, maybe someone else will be able to use this and save them a bunch of time and you're contributing to science that way. And I would say these things ultimately are hugely important aspects that will seem like everyday working practices for some people, and then you can think to yourself, well, actually, yeah, OK, well, I do these things already. Maybe my next study, this one seems like it will be appropriate for data sharing and maybe I'll do a bit of a pre-registration for this document for this next study because well, you know it's probably going to be useful for me. It will certainly save me having to try to re-remember why we did this and what we said we were going to do to remove the outliers and stuff like that back when we had this discussion a year ago. And then from there it becomes almost a fairly natural thing to think, well, let's just do a registered report. The timelines sit perfectly for this I'd like to try something new. It's kind of interesting to shake up what feels like a slightly jaded publishing process by the time you've reached my stage in my career, and it was actually a really refreshing feeling to do something new and different. And it wasn't at all any more or less effort than a traditional publication process, but it was quite a lot more fun I felt than the typical workflow I go through and you know, I think the opportunity to shake these things up is one to be grasped. That so not really a short answer, but there we go.
Chris Tibbs:Well that’s really good advice. I really like that, you know, take small steps, right, small steps can lead to big improvements. I think that's really good advice to end on. So Gavin, it is been really interesting to hear from you today. Thank you very, very much for sharing your knowledge, your experience and hopefully maybe we can inspire some listeners to start taking that first step. So thank you very much Gavin. Thank you very much everyone for listening. Thank you.
Tuesday Aug 15, 2023
Tuesday Aug 15, 2023
Dr Chris Tibbs, Research Data Officer at University of Exeter, discusses research data and how best to manage that data during your project with Dr Eilis Hannon, Senior Research Fellow in the Complex Disease Epigenetics Group at the University of Exeter Medical School.
Podcast transcript
Chris Tibbs: Hello and welcome. I'm Dr Chris Tibbs and I'm the University of Research Data Officer, part of the Open Research team based in the library here at the University of Exeter. So my role involves supporting researchers across the university as they work with and manage their research data, and so this episode is going to be all about research data and how best to manage that data during your project. And to discuss all of this today, I have the pleasure to be joined by Dr Eilis Hannon, a senior research fellow in Clinical and Biomedical Sciences here at the University of Exeter. So Eilis, would you like to tell us a little bit about your research, what it involves and the different types of data that you work with?
Eilis Hannon:Yes. Well, thank you very much for inviting me along today. So I'm based in the complex disease epigenetics group and we have a group of mixed modalities. We've got wet lab scientists and dry lab scientists, like myself. So we generate and analyze quite a lot of genomic data. So we're primarily interested in the brain and modelling gene regulation in the brain and we're in a really exciting time where there are so many different technologies and experiments that we can take advantage of,that the quantity of data we've started to generate has just kind of exploded. So from one single sample, we can have kind of, you know, be 4, 5, 6 different experiments and kind of layers of data. And so what I'm quite interested in doing is trying to integrate those different layers together. So a lot of what I'm working with is experimental data, but because a lot of these technologies are quite new, we're often developing new methods to analyze them in parallel. And so what we also do sometimes is simulate data where we kind of know what it looks like. We know what the outcome should be to kind of test and develop methods. So it's quite a broad spectrum of different data type.
Chris Tibbs:Yeah. So you mentioned it there, right? So you maybe have simulated data, you've experimental data, and so I just wanted to pick up on the point here when we're talking about data and this obviously might mean different things to different people. And so if you're listening to this discussion and thinking, oh well, I don't work with data or this doesn't apply to me, then I just want to really make clear that when I refer to data or research data, it really means all of the information or the evidence or the materials that are generated or collected or being used for the research, andso that we're clear about data and what it refers to. Why is it so important to manage this data effectively? I mean, you talked about you're producing a large quantity of data, so I'm guessing that's one of the reasons why it's important to look after it.
Eilis Hannon:Yes. So from my point of view, efficiency in terms of processing that data in, I mean you know if it wasn't organised in a kind of sensible or a kind of pre-planned format, then it would be incredibly challenging to work with, so from you know, we take advantage of the high performance computing available at the University and so to do that efficiently, we need to kind of have some pre-described format for the data. But there's also ethical implications. So, you know, we're working on data generated ultimately from a piece of human tissue. So we have requirements in terms of how we look after that data, what we do with it. Who uses it and how? So we need to make sure that you know our data is organized that such that those requirements can be met. But also, you know, one of the really nice things about what we do is from one experiment you can answer lots and lots of different research questions. So different people within the research group will be taking advantage of the same dataset. And to, you know, to really maximize that utility, we need to, you know, organize it in a way that we can find it. We know what's what. And we can really reap the benefit of that initial kind of financial investment.
Chris Tibbs:Yeah. So it's obviously clear, especially if multiple people are working, doing different analysis on the same data. It's obviously important to know what the data are and make sure that they're obviously described and who's doing what on the data, and version control, I imagine is something that's very important for you. Like, it's clear that the data are fundamental for the research, right, and it doesn't matter if you have, you know, the most sophisticated methodology to analyse the data, if the data are not described or the data are inaccurate then your results are not going to be good. They’re going to be inaccurate. They’re not going to be clear. So this is something obviously that you're doing at the minute, your managing your data. When did you really first start thinking about the idea of, you know, managing your data, particularly with the aim of potentially making it available to others to validate or to build upon your research? Was this something that your supervisor discussed with you as a PhD student or was this something that you sort of picked up later on in your career?
Eilis Hannon:So during my PhD, which I did in Cardiff, I was using publicly available data and so I had quite a naïve,I guess, view of kind of experimental work and when I came to Exeter and joined a team where we generated the data we analyzed, you suddenly start to realize that, you know, of course, experiments aren't perfect. Of course they don't work as expected all the time. And you know, I gained a real insight at that point because obviously questions about how we use the data, how we process the data and how we ultimately share the data became a lot more relevant to my work. But I also gained a huge insight, you know, being much more aware of the whole kind of research process from kind of study design, generating data, analysing the data and publishing it, you know, kind of what the requirements were and also the kind of challenges with data generation, and so that was, you know, I strongly recommend it to anybody who sees themself more as an analyst, that actually the insights you gain from working closely with the people that generate the data are just unfathomable really. It really opens your eyes and gives you a much, I guess I think much more holistic view of research.
Chris Tibbs:It's really interesting how your perspective changed from someone who's just analyzing the data to someone who actually is experimenting and generating the data, right? That's a really interesting view. When you start to be someone who's producing data and potentially sharing it, then it's a lot more important to think about all of these processes. So talking about these processes of looking after the data, I mean what sort of tools or techniques would you recommend to someone who's interested in, you know, making sure that they manage their data effectively or looking after their data?
Eilis Hannon:So I think, forward planning where you can. Thinking about where you're ultimately trying to get to in terms of you know what format do you need the data in to do the analysis that you want to do?But also thinking about kind of, particularly working with large datasets like we tend to, we can't store kind of multiple iterations. We need to be quite practical about what are the core stages that we need to stay, and actually if you sit down and think about it, for us the most important parts of the data are the raw data and then our analysis scripts, because from there we can recreate anything that we've kind of done after that point, if we were to lose it in some kind of, you know freak event or something. It's very tempting to hoard these kind of intermediate datasets, but often they actually make your life much harder because you can't actually remember at what stage each file relates to. And so the kind of more streamlined you can be, in terms of what you save and what you keep, does actually make management during these data much easier and, you know, clear records in terms of having scripts can also help you navigate that process. And as you become kind of more ingrained in your project, you do start to realize what the kind of critical points are that you want to kind of save and keep a record of.
Chris Tibbs:So you mentioned two points there that I'd like to pick up on. First of all forward planning, which I completely agree is very important. And so I just wanted to at this point highlight sort of the importance of a data management plan to do exactly that. And so this is the plan that you develop sort of at the beginning of the project and thinking about all of those things that you talked about in terms of what it is you want to do with the data and trying to think about them from the beginningso that you can identify potential obstacles or issues and then try and plan around them to mitigate them. So that's definitely very important. And then the other thing you picked up on was about code and reproducibility. So would you, would you say that for someone who's, you know, working in a similar area or with similar types of data that it's really a requirement to learn a programming language, so Python or R, to really ensure that not only is their analysis reproducible for someone else, but also for them. So like you mentioned, then you just need essentially the raw data and the code, and you can reproduce the analysis. So would you say that’s sort of a requirement?
Eilis Hannon:I would strongly recommend it, purely because of what you learn alongside a programming language, and that is, things like how to record what you've done. So it is in my opinion, it's the most transparent method section you could ever write. It tells you a lot more about what you actually did than someone could ever gain from reading a paragraph where you describe what you did. And it's also really, you know, if you do it in a way that it automates your analysis such that you can know confidently that every line in this script was run from top to bottom. It's really easy then to backtrack if you find an inconsistency further down the line and really easy to make a change and re-run it. You know it's frustrating, of course, and something you can spend a lot of time focusing on the oh all that time I wasted because of that error. But the beauty is that you can fix it really easily and if you can just let go of the kind of regret, that problem is fixed in a way that you could never fix it in the same way from the actual experiment. So we do have the ability to repeat our analysis. We do have the ability to make tweaks and we should kind of embrace that rather than see it as the kind of aww but all that time I wasted, you know on the incorrect version and see it for the positive that it can be.
Chris Tibbs:Yeah, the only thing I can add is document your code. It is so nice when you go back to an old code and you see that you've documented it so you know exactly what each of those steps were and why they're there. So yeah, definitely. Related to the... Oh, sorry, did you want to say something?
Eilis Hannon:I was just gonna say that that I'm gonna echo, I think it comes from the software carpentries and it's, you know, the person that you're gonna benefit the most is your future self.
Chris Tibbs:Indeed, and related to code, so you were awarded a Software Sustainability Institute (SSI)Fellowship, which is about, you know, providing funding to fellows who are trying to improve research software in their research areas. So would you like to say a little bit about that experience and also just about the work that the SSI are doing in general?
Eilis Hannon:Yes. So the Software Sustainability Institute kind of comes from recognizing that research is highly dependent on software and often highly dependent on bespoke software or computer programming solutions. But often, these aren't easy to reproduce, or even to find and the sharing of this software isn't brilliant really. And that's partly because the kind of emphasis on research is writing the paper and extracting the results from the software, rather than thinking about the software as a research output of its own. And so the Software Sustainability’s kind of mission is to just improve the quality of research software. So it does that by funding fellows at all career stages in all disciplines to kind of support the development, maybe by running workshops, or petitioning their local institution or funding bodies to try their best to recognize these research outputs. So one of the kind of parallelinitiatives that we have, a research software engineering campaign that's kind of at least 10 years old now and in the last two years Exeter have a central group of well of about 10 members and you know that's again recognising that within a university ecosystem we need people whose expertise is programming and can help us process our data, maybe do new things that we never thought possible because of the skills that we now have available. Yeah.
Chris Tibbs:Indeed, yeah. Indeed. It's very important to have those skills and have those skills recognized. So a big part of what we're talking about here in terms of managing data is also about the idea of sharing that data, where possible, so that like we talked about that others can come along and they can validate and replicate that. So I just wanted to ask you about your own experience of sharing data, you know, putting data in a repository or you know data associated with the publication and how you make sure that the data and the publication are both seen together and how users or readers of the publication can access the data. So have you deposited data in a repository? Is this something that you've done?
Eilis Hannon:Yes. So we, routinely deposit as much of our data that we can often because you know we're funded by the Medical Research Council and it is a condition of the funding that we share the data that we generated. Occasionally you know we're working in a study design where the kind of ethics don't permit it, but we try to find, you know, an alternative way of giving people access to the data if they need it. So maybe by some kind of application process. The route we typically use is to deposit it in a public repository, so we use the Gene Expression Omnibus, which is hosted on the NCBI website and the first time we did it, I'll admit it was a little bit of a faff because they have a fairly rigid format, unsurprisingly, and the only reason it was a faff was because it was something we thought about at the end of the project as opposed to the beginning of the project. And so we hadn't necessarily organized our files in the way they wanted them, so we kind of had to go back and re-extract some of the raw data. But once you’ve done it once, the format typically stays the same and so next time round we were able to better kind of plan ahead for that. And so it's a more efficient process for us and actually one thing we keep, because these are public repositories, we often think that there's a huge advantage for us to deposit our data actually at the beginning of the project because it's another way of backing it up. You know, we don't have to spend so much money ourselves on data storage because we've put it on someone else's server and you know, not only do other people benefit from being able to download it, but we can redownload it if we need to in the future. And also caveat that even though we might deposit it at the beginning of the project, you can put it under an embargo so it's not publicly released until you kind of say yes, go ahead and do that. So it can be quite a useful you know, we put an embargo on for maybe a year. So you know we can still do our analysis, we can still, you know, write our paper up and then typically at the point of publication is when we would say open it up to share it with the wider world.
Chris Tibbs:Yeah, it's really interesting and thinking about the benefits of sharing it at the beginning that then you don't have to look after, you know exactly where it is. So I just wanted to pick up on a couple of points. You mentioned, obviously you depositing your data with the Gene Expression Omnibus, which obviously specializes in genomics data. And I just want to point out obviously that the University has an institutional repository, Open Research Exeter or ORE, and although we have a repository, we do recommend that data should be deposited into specialist repositories where they exist. Obviously if the specialist repositories don't exist then that's where ORE comes in. You can deposit your data into ORE. And something else that is also very important is, you know depositing data in a repository is important and making it available to others. But it's not enough to simply make the data available, right. You need to make sure that others can understand the data and so this comes back to then, making sure that the documentation explaining the data are also available alongside the data. And then in terms of publications. So often you've got data underlying a publication. When you deposit your data into a repository then the data are often assigned apersistent identifier and this identifier which obviously uniquely identifies that dataset should be included in the publication as a way of making sure that someone reads your publication and thinks, oh that's really interesting, I wonder if I could use that in my own research and they can click on a link and go and find your data and then use your data and build on it. And so it's all about making the links and making everything clear so we know that others know exactly what the data are and therefore how they could potentially use them. So you talked about with the SSI trying to see software as this important research output, and I also feel the same way about data, right? It should also sort of be seen as this standalone output that's very important. So I just want to, I assume you have similar thoughts to what you do about the software in terms of having data as a standalone sort of output.
Eilis Hannon:Oh, definitely. I mean, so when we kind of, you know, when we receive a grant from say, the Medical Research Council, they're not only funding us to answer the question we outline, they are funding us to generate a dataset and they want us to, you know, they see the value of the dataset as a research output. And you know, that's why they kind of require that we share that and you know, certainly that's one of the selling points that we write into our grants is that you know whilst you know we're generating it to answer this question we recognize there is value more broadly and we want to help facilitate that. You know and I think that's one of, I'd say that's one of almost the misjustice is probably too strong a word, but actually there is a lot of pre-existing genomic data. So if I talk specifically about my field, that could be reused a lot more to answer a lot more questions, yet arguably the emphasis is still on generating new datasets and there's a little bit of a gap where you could fund analysts or computer scientists to come in and take advantage of publicly available data.And of course, that's actually quite a cheap grant, because you just paying someone's time as opposed to paying expensive experiments. But I don't know that the research landscape’s necessarily picked up that actually we could. You know, there's probably a lot of answers that we could make if we just funded a few more analysts in the field as opposed to constantly funding new data generation projects.
Chris Tibbs:Yeah, that seems like it’s missing a trick there with all the information and results that could still be generated from those archival data. I just wanted to perhaps ask you someone obviously who's sort of been on this journey from analysing publicly available data to then going and generating your own data about some of the obstacles and barriers that you have found and maybe potentially found a way around.
Eilis Hannon:So I think one of the biggest obstacles, I'm gonna talk specifically about code sharing code here.
Chris Tibbs:OK.
Eilis Hannon:And because that's, I guess, a big a kind of focus of mine and I recognize one of the barriers is I guess a feeling that my code isn't good enough to be shared. You know, it's not pretty enough or it's a bit, you know, I'm not an expert programmer, you know, I've taught myself or I've only been on a beginner’s course. And I'm kind of only here because I've had to get here and therefore I'm either exempt from sharing my code or, you know, generally there's a lot of anxiety that it doesn't look nice or, you know, someone might find a mistake. And you know that, a PhD can be a time of, you know, fear that, you know, there's so much pressure on my results and you know if someone were to find a mistake and that could be quite, have a big impact, but I'd actually flip it around and say that if you approach your project with the intention of making it open, you tend to program in a different way, and you tend to program in a more reproducible, robust way and actually think of that almost as,rather than thinking about people finding the mistakes, think about the fact that what I have been completely transparent in what I did. OK, maybe I haven't done it. Maybe somebody would disagree with the way I did it, but because I made my code available, they can see exactly how I did it and under what limitations that is, and my results are what they are based on this, you know, based on these decisions that I made and I have to say that the main benefit I've seen is my own personal satisfaction in my work. Because, you know, yeah, OK, there's lots of different ways we could have done it and somebody might criticize the way I did it, but they know the way I did it. And so they know the extent of my results and maybe they would’ve done it differently. But, you know, I've been transparent and how I did it. So they know how to interpret what I did and it really helps remove that anxiety about someone finding a mistake. If you think about it, it's just you being open in how you did it such that somebody can assess its value or not.
Chris Tibbs:Yes, very important about this all about having the mindset of approaching it from being open from the start. That's really interesting. So I just want to wrap up by asking one final question. You know, I mean you sort of touched on this in the last answer, about having this mindset of having, sort of,open from the start, but given your experience, like is there sort of one simple take away message for listeners who, they want to do the right thing, they want to look after their data, they want to be able to make it available, but they might be feeling a little daunted and not really sure where to start.
Eilis Hannon:So I think. There are lots of small steps that can be taken, and so if you're going, you know, if you're new to this, then you know the first thing might be the first step would just be to have a script that you know works and it's nicely commented and documented. Yeah, that's the first natural step. You know, and I think that actually this isn't something that you have to kind of go into expecting to be perfect from the start. There are lots of little things that you can do that just get you towards a more open environment and the one thing we have to be conscious of is that the requirements for different types of research vary hugely, so there is no one fits all solution here. It's all about what's relevant to your specific project and you know the outcomes from that, and to just you know to, I'd almost encourage people to be a bit reflective about what is it about what I do that someone else would want to benefit from. But also, what do I benefit? How do I benefit from working more openly?
Chris Tibbs:That's really, really, really good advice. Thank you very much for sharing your knowledge and experience and hopefully we can inspire some of our listeners to start thinking about, you know, managing and sharing their data. Thank you everyone for listening and thank you Eilis. Take care.Bye bye.
About us
This podcast from the Researcher Development and Research Culture Team at University of Exeter covers the skills and knowledge needed for postgraduate researchers' professional development.