Tuesday Aug 15, 2023

Episode 1- Open Research- Dr Eilis Hannon (Senior Research Fellow in the Complex Disease Epigenetics Group at the University of Exeter Medical School)

Dr Chris Tibbs, Research Data Officer at University of Exeter, discusses research data and how best to manage that data during your project with Dr Eilis Hannon, Senior Research Fellow in the Complex Disease Epigenetics Group at the University of Exeter Medical School.

Podcast transcript

Chris Tibbs: 
Hello and welcome. I'm Dr Chris Tibbs and I'm the University of Research Data Officer, part of the 
Open Research team based in the library here at the University of Exeter. So my role involves 
supporting researchers across the university as they work with and manage their research data, and 
so this episode is going to be all about research data and how best to manage that data during your 
project. And to discuss all of this today, I have the pleasure to be joined by Dr Eilis Hannon, a senior 
research fellow in Clinical and Biomedical Sciences here at the University of Exeter. So Eilis, would 
you like to tell us a little bit about your research, what it involves and the different types of data that 
you work with?


Eilis Hannon:
Yes. Well, thank you very much for inviting me along today. So I'm based in the complex disease 
epigenetics group and we have a group of mixed modalities. We've got wet lab scientists and dry lab 
scientists, like myself. So we generate and analyze quite a lot of genomic data. So we're primarily 
interested in the brain and modelling gene regulation in the brain and we're in a really exciting time 
where there are so many different technologies and experiments that we can take advantage of,
that the quantity of data we've started to generate has just kind of exploded. So from one single 
sample, we can have kind of, you know, be 4, 5, 6 different experiments and kind of layers of data. 
And so what I'm quite interested in doing is trying to integrate those different layers together. So a 
lot of what I'm working with is experimental data, but because a lot of these technologies are quite 
new, we're often developing new methods to analyze them in parallel. And so what we also do 
sometimes is simulate data where we kind of know what it looks like. We know what the outcome 
should be to kind of test and develop methods. So it's quite a broad spectrum of different data type.

Chris Tibbs:
Yeah. So you mentioned it there, right? So you maybe have simulated data, you've experimental 
data, and so I just wanted to pick up on the point here when we're talking about data and this 
obviously might mean different things to different people. And so if you're listening to this 
discussion and thinking, oh well, I don't work with data or this doesn't apply to me, then I just want 
to really make clear that when I refer to data or research data, it really means all of the information 
or the evidence or the materials that are generated or collected or being used for the research, and
so that we're clear about data and what it refers to. Why is it so important to manage this data 
effectively? I mean, you talked about you're producing a large quantity of data, so I'm guessing that's 
one of the reasons why it's important to look after it.

Eilis Hannon:
Yes. So from my point of view, efficiency in terms of processing that data in, I mean you know if it 
wasn't organised in a kind of sensible or a kind of pre-planned format, then it would be incredibly 
challenging to work with, so from you know, we take advantage of the high performance computing 
available at the University and so to do that efficiently, we need to kind of have some pre-described 
format for the data. But there's also ethical implications. So, you know, we're working on data 
generated ultimately from a piece of human tissue. So we have requirements in terms of how we 
look after that data, what we do with it. Who uses it and how? So we need to make sure that you 
know our data is organized that such that those requirements can be met. But also, you know, one 
of the really nice things about what we do is from one experiment you can answer lots and lots of 
different research questions. So different people within the research group will be taking advantage 
of the same dataset. And to, you know, to really maximize that utility, we need to, you know, 
organize it in a way that we can find it. We know what's what. And we can really reap the benefit of 
that initial kind of financial investment.

Chris Tibbs:
Yeah. So it's obviously clear, especially if multiple people are working, doing different analysis on the 
same data. It's obviously important to know what the data are and make sure that they're obviously 
described and who's doing what on the data, and version control, I imagine is something that's very 
important for you. Like, it's clear that the data are fundamental for the research, right, and it doesn't 
matter if you have, you know, the most sophisticated methodology to analyse the data, if the data 
are not described or the data are inaccurate then your results are not going to be good. They’re 
going to be inaccurate. They’re not going to be clear. So this is something obviously that you're 
doing at the minute, your managing your data. When did you really first start thinking about the idea 
of, you know, managing your data, particularly with the aim of potentially making it available to 
others to validate or to build upon your research? Was this something that your supervisor discussed 
with you as a PhD student or was this something that you sort of picked up later on in your career?

Eilis Hannon:
So during my PhD, which I did in Cardiff, I was using publicly available data and so I had quite a naïve,
I guess, view of kind of experimental work and when I came to Exeter and joined a team where we 
generated the data we analyzed, you suddenly start to realize that, you know, of course, 
experiments aren't perfect. Of course they don't work as expected all the time. And you know, I 
gained a real insight at that point because obviously questions about how we use the data, how we 
process the data and how we ultimately share the data became a lot more relevant to my work. But I 
also gained a huge insight, you know, being much more aware of the whole kind of research process 
from kind of study design, generating data, analysing the data and publishing it, you know, kind of 
what the requirements were and also the kind of challenges with data generation, and so that was, 
you know, I strongly recommend it to anybody who sees themself more as an analyst, that actually 
the insights you gain from working closely with the people that generate the data are just 
unfathomable really. It really opens your eyes and gives you a much, I guess I think much more 
holistic view of research.

Chris Tibbs:
It's really interesting how your perspective changed from someone who's just analyzing the data to 
someone who actually is experimenting and generating the data, right? That's a really interesting 
view. When you start to be someone who's producing data and potentially sharing it, then it's a lot 
more important to think about all of these processes. So talking about these processes of looking 
after the data, I mean what sort of tools or techniques would you recommend to someone who's 
interested in, you know, making sure that they manage their data effectively or looking after their 
data?

Eilis Hannon:
So I think, forward planning where you can. Thinking about where you're ultimately trying to get to 
in terms of you know what format do you need the data in to do the analysis that you want to do?
But also thinking about kind of, particularly working with large datasets like we tend to, we can't 
store kind of multiple iterations. We need to be quite practical about what are the core stages that 
we need to stay, and actually if you sit down and think about it, for us the most important parts of 
the data are the raw data and then our analysis scripts, because from there we can recreate 
anything that we've kind of done after that point, if we were to lose it in some kind of, you know 
freak event or something. It's very tempting to hoard these kind of intermediate datasets, but often 
they actually make your life much harder because you can't actually remember at what stage each 
file relates to. And so the kind of more streamlined you can be, in terms of what you save and what 
you keep, does actually make management during these data much easier and, you know, clear 
records in terms of having scripts can also help you navigate that process. And as you become kind 
of more ingrained in your project, you do start to realize what the kind of critical points are that you 
want to kind of save and keep a record of.

Chris Tibbs:
So you mentioned two points there that I'd like to pick up on. First of all forward planning, which I 
completely agree is very important. And so I just wanted to at this point highlight sort of the 
importance of a data management plan to do exactly that. And so this is the plan that you develop 
sort of at the beginning of the project and thinking about all of those things that you talked about in 
terms of what it is you want to do with the data and trying to think about them from the beginning
so that you can identify potential obstacles or issues and then try and plan around them to mitigate 
them. So that's definitely very important. And then the other thing you picked up on was about code 
and reproducibility. So would you, would you say that for someone who's, you know, working in a 
similar area or with similar types of data that it's really a requirement to learn a programming 
language, so Python or R, to really ensure that not only is their analysis reproducible for someone 
else, but also for them. So like you mentioned, then you just need essentially the raw data and the 
code, and you can reproduce the analysis. So would you say that’s sort of a requirement?

Eilis Hannon:
I would strongly recommend it, purely because of what you learn alongside a programming 
language, and that is, things like how to record what you've done. So it is in my opinion, it's the most 
transparent method section you could ever write. It tells you a lot more about what you actually did 
than someone could ever gain from reading a paragraph where you describe what you did. And it's 
also really, you know, if you do it in a way that it automates your analysis such that you can know 
confidently that every line in this script was run from top to bottom. It's really easy then to backtrack 
if you find an inconsistency further down the line and really easy to make a change and re-run it. You 
know it's frustrating, of course, and something you can spend a lot of time focusing on the oh all that 
time I wasted because of that error. But the beauty is that you can fix it really easily and if you can 
just let go of the kind of regret, that problem is fixed in a way that you could never fix it in the same 
way from the actual experiment. So we do have the ability to repeat our analysis. We do have the 
ability to make tweaks and we should kind of embrace that rather than see it as the kind of aww but 
all that time I wasted, you know on the incorrect version and see it for the positive that it can be.

Chris Tibbs:
Yeah, the only thing I can add is document your code. It is so nice when you go back to an old code 
and you see that you've documented it so you know exactly what each of those steps were and why 
they're there. So yeah, definitely. Related to the... Oh, sorry, did you want to say something?

Eilis Hannon:
I was just gonna say that that I'm gonna echo, I think it comes from the software carpentries and it's, 
you know, the person that you're gonna benefit the most is your future self.

Chris Tibbs:
Indeed, and related to code, so you were awarded a Software Sustainability Institute (SSI)
Fellowship, which is about, you know, providing funding to fellows who are trying to improve 
research software in their research areas. So would you like to say a little bit about that experience 
and also just about the work that the SSI are doing in general?

Eilis Hannon:
Yes. So the Software Sustainability Institute kind of comes from recognizing that research is highly 
dependent on software and often highly dependent on bespoke software or computer programming 
solutions. But often, these aren't easy to reproduce, or even to find and the sharing of this software 
isn't brilliant really. And that's partly because the kind of emphasis on research is writing the paper 
and extracting the results from the software, rather than thinking about the software as a research 
output of its own. And so the Software Sustainability’s kind of mission is to just improve the quality 
of research software. So it does that by funding fellows at all career stages in all disciplines to kind of 
support the development, maybe by running workshops, or petitioning their local institution or 
funding bodies to try their best to recognize these research outputs. So one of the kind of parallel
initiatives that we have, a research software engineering campaign that's kind of at least 10 years 
old now and in the last two years Exeter have a central group of well of about 10 members and you 
know that's again recognising that within a university ecosystem we need people whose expertise is 
programming and can help us process our data, maybe do new things that we never thought 
possible because of the skills that we now have available. Yeah.

Chris Tibbs:
Indeed, yeah. Indeed. It's very important to have those skills and have those skills recognized. So a 
big part of what we're talking about here in terms of managing data is also about the idea of sharing 
that data, where possible, so that like we talked about that others can come along and they can 
validate and replicate that. So I just wanted to ask you about your own experience of sharing data, 
you know, putting data in a repository or you know data associated with the publication and how 
you make sure that the data and the publication are both seen together and how users or readers of 
the publication can access the data. So have you deposited data in a repository? Is this something 
that you've done?

Eilis Hannon:
Yes. So we, routinely deposit as much of our data that we can often because you know we're funded 
by the Medical Research Council and it is a condition of the funding that we share the data that we 
generated. Occasionally you know we're working in a study design where the kind of ethics don't 
permit it, but we try to find, you know, an alternative way of giving people access to the data if they 
need it. So maybe by some kind of application process. The route we typically use is to deposit it in a 
public repository, so we use the Gene Expression Omnibus, which is hosted on the NCBI website and 
the first time we did it, I'll admit it was a little bit of a faff because they have a fairly rigid format, 
unsurprisingly, and the only reason it was a faff was because it was something we thought about at 
the end of the project as opposed to the beginning of the project. And so we hadn't necessarily 
organized our files in the way they wanted them, so we kind of had to go back and re-extract some 
of the raw data. But once you’ve done it once, the format typically stays the same and so next time 
round we were able to better kind of plan ahead for that. And so it's a more efficient process for us 
and actually one thing we keep, because these are public repositories, we often think that there's a 
huge advantage for us to deposit our data actually at the beginning of the project because it's 
another way of backing it up. You know, we don't have to spend so much money ourselves on data 
storage because we've put it on someone else's server and you know, not only do other people 
benefit from being able to download it, but we can redownload it if we need to in the future. And 
also caveat that even though we might deposit it at the beginning of the project, you can put it 
under an embargo so it's not publicly released until you kind of say yes, go ahead and do that. So it 
can be quite a useful you know, we put an embargo on for maybe a year. So you know we can still 
do our analysis, we can still, you know, write our paper up and then typically at the point of 
publication is when we would say open it up to share it with the wider world.

Chris Tibbs:
Yeah, it's really interesting and thinking about the benefits of sharing it at the beginning that then 
you don't have to look after, you know exactly where it is. So I just wanted to pick up on a couple of 
points. You mentioned, obviously you depositing your data with the Gene Expression Omnibus, 
which obviously specializes in genomics data. And I just want to point out obviously that the 
University has an institutional repository, Open Research Exeter or ORE, and although we have a 
repository, we do recommend that data should be deposited into specialist repositories where they 
exist. Obviously if the specialist repositories don't exist then that's where ORE comes in. You can 
deposit your data into ORE. And something else that is also very important is, you know depositing 
data in a repository is important and making it available to others. But it's not enough to simply 
make the data available, right. You need to make sure that others can understand the data and so 
this comes back to then, making sure that the documentation explaining the data are also available 
alongside the data. And then in terms of publications. So often you've got data underlying a 
publication. When you deposit your data into a repository then the data are often assigned a
persistent identifier and this identifier which obviously uniquely identifies that dataset should be 
included in the publication as a way of making sure that someone reads your publication and thinks, 
oh that's really interesting, I wonder if I could use that in my own research and they can click on a 
link and go and find your data and then use your data and build on it. And so it's all about making 
the links and making everything clear so we know that others know exactly what the data are and 
therefore how they could potentially use them. So you talked about with the SSI trying to see 
software as this important research output, and I also feel the same way about data, right? It should 
also sort of be seen as this standalone output that's very important. So I just want to, I assume you 
have similar thoughts to what you do about the software in terms of having data as a standalone 
sort of output.

Eilis Hannon:
Oh, definitely. I mean, so when we kind of, you know, when we receive a grant from say, the Medical 
Research Council, they're not only funding us to answer the question we outline, they are funding us 
to generate a dataset and they want us to, you know, they see the value of the dataset as a research 
output. And you know, that's why they kind of require that we share that and you know, certainly 
that's one of the selling points that we write into our grants is that you know whilst you know we're 
generating it to answer this question we recognize there is value more broadly and we want to help 
facilitate that. You know and I think that's one of, I'd say that's one of almost the misjustice is 
probably too strong a word, but actually there is a lot of pre-existing genomic data. So if I talk 
specifically about my field, that could be reused a lot more to answer a lot more questions, yet 
arguably the emphasis is still on generating new datasets and there's a little bit of a gap where you 
could fund analysts or computer scientists to come in and take advantage of publicly available data.
And of course, that's actually quite a cheap grant, because you just paying someone's time as 
opposed to paying expensive experiments. But I don't know that the research landscape’s 
necessarily picked up that actually we could. You know, there's probably a lot of answers that we 
could make if we just funded a few more analysts in the field as opposed to constantly funding new 
data generation projects.

Chris Tibbs:
Yeah, that seems like it’s missing a trick there with all the information and results that could still be 
generated from those archival data. I just wanted to perhaps ask you someone obviously who's sort 
of been on this journey from analysing publicly available data to then going and generating your own 
data about some of the obstacles and barriers that you have found and maybe potentially found a 
way around.

Eilis Hannon:
So I think one of the biggest obstacles, I'm gonna talk specifically about code sharing code here.

Chris Tibbs:
OK.

Eilis Hannon:
And because that's, I guess, a big a kind of focus of mine and I recognize one of the barriers is I guess 
a feeling that my code isn't good enough to be shared. You know, it's not pretty enough or it's a bit, 
you know, I'm not an expert programmer, you know, I've taught myself or I've only been on a 
beginner’s course. And I'm kind of only here because I've had to get here and therefore I'm either 
exempt from sharing my code or, you know, generally there's a lot of anxiety that it doesn't look nice 
or, you know, someone might find a mistake. And you know that, a PhD can be a time of, you know, 
fear that, you know, there's so much pressure on my results and you know if someone were to find a 
mistake and that could be quite, have a big impact, but I'd actually flip it around and say that if you 
approach your project with the intention of making it open, you tend to program in a different way, 
and you tend to program in a more reproducible, robust way and actually think of that almost as,
rather than thinking about people finding the mistakes, think about the fact that what I have been 
completely transparent in what I did. OK, maybe I haven't done it. Maybe somebody would disagree 
with the way I did it, but because I made my code available, they can see exactly how I did it and 
under what limitations that is, and my results are what they are based on this, you know, based on 
these decisions that I made and I have to say that the main benefit I've seen is my own personal 
satisfaction in my work. Because, you know, yeah, OK, there's lots of different ways we could have 
done it and somebody might criticize the way I did it, but they know the way I did it. And so they 
know the extent of my results and maybe they would’ve done it differently. But, you know, I've been 
transparent and how I did it. So they know how to interpret what I did and it really helps remove 
that anxiety about someone finding a mistake. If you think about it, it's just you being open in how 
you did it such that somebody can assess its value or not.

Chris Tibbs:
Yes, very important about this all about having the mindset of approaching it from being open from 
the start. That's really interesting. So I just want to wrap up by asking one final question. You know, I 
mean you sort of touched on this in the last answer, about having this mindset of having, sort of,
open from the start, but given your experience, like is there sort of one simple take away message 
for listeners who, they want to do the right thing, they want to look after their data, they want to be 
able to make it available, but they might be feeling a little daunted and not really sure where to 
start.

Eilis Hannon:
So I think. There are lots of small steps that can be taken, and so if you're going, you know, if you're 
new to this, then you know the first thing might be the first step would just be to have a script that 
you know works and it's nicely commented and documented. Yeah, that's the first natural step. You 
know, and I think that actually this isn't something that you have to kind of go into expecting to be 
perfect from the start. There are lots of little things that you can do that just get you towards a more 
open environment and the one thing we have to be conscious of is that the requirements for 
different types of research vary hugely, so there is no one fits all solution here. It's all about what's 
relevant to your specific project and you know the outcomes from that, and to just you know to, I'd 
almost encourage people to be a bit reflective about what is it about what I do that someone else 
would want to benefit from. But also, what do I benefit? How do I benefit from working more 
openly?

Chris Tibbs:
That's really, really, really good advice. Thank you very much for sharing your knowledge and 
experience and hopefully we can inspire some of our listeners to start thinking about, you know, 
managing and sharing their data. Thank you everyone for listening and thank you Eilis. Take care.
Bye bye.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20240320