WEBVTT

1
00:00:01.770 --> 00:00:10.920
Sara Gonzales: Welcome to Best Practices in Research Data Management and Data Sharing. My name is Sara Gonzales, and I'm the Data Librarian at the Galter Health Sciences Library and Learning Center.

2
00:00:11.490 --> 00:00:21.480
Sara Gonzales: And also, as we start this class, I just like to mention that this class is a part of the Galter DataLab, which you can learn more about by checking out the DataLab tab.

3
00:00:21.900 --> 00:00:33.750
Sara Gonzales: On the Galter Health Sciences Library website. And here at DataLab, we are really concerned with connecting faculty, staff, and students of the Feinberg School of Medicine with data related resources.

4
00:00:34.350 --> 00:00:40.980
Sara Gonzales: And we do that in a couple of different ways. Firstly, we're working on an update to the institutional repository DigitalHub.

5
00:00:41.400 --> 00:00:47.670
Sara Gonzales: If you're a current user of that it will change over the next few months to a new, more robust and scalable system.

6
00:00:48.360 --> 00:00:59.610
Sara Gonzales: And in addition, we offer the Data Clinic and the Data Clinic is basically a kind of primary care model or a first contact model, through which you can get in touch with us at the Library.

7
00:01:00.120 --> 00:01:08.790
Sara Gonzales: If you have any questions about working with data in any of the ways you see here, so from collection and management to cleaning, analysis, and even some initial visualization.

8
00:01:09.390 --> 00:01:19.560
Sara Gonzales: If you would like a quick consultation or to schedule something with our librarians, you can fill out the contact form or email dataclinic@northwestern.edu. Our final

9
00:01:20.520 --> 00:01:27.420
Sara Gonzales: Initiative within data lab is Education and Community Engagement, and that is where classes like this one

10
00:01:27.990 --> 00:01:36.390
Sara Gonzales: come to the fore. So under the Data Management tab on the Galter classes page, you can see many classes all related to data management topics.

11
00:01:36.930 --> 00:01:51.630
Sara Gonzales: So if you'd like to explore those educational opportunities, that's another way that we want to connect with you. So here, let's launch into one of those very classes on data management and data sharing and we'll start today's presentation.

12
00:01:54.450 --> 00:02:02.520
Sara Gonzales: So what we're going to cover today are the best practices in research data management, we'll also talk a bit about the benefits of data sharing,

13
00:02:03.090 --> 00:02:11.430
Sara Gonzales: As well as where you might like to share your data, and then finally we'll cover the resources available through Northwestern to help with data management and data sharing.

14
00:02:12.600 --> 00:02:19.320
Sara Gonzales: So let's start with a definition. "Research data" actually has this kind of federally accepted definition.

15
00:02:19.680 --> 00:02:26.730
Sara Gonzales: Which is: "it's the recorded factual material commonly accepted in the scientific community as necessary to validate research findings."

16
00:02:27.210 --> 00:02:34.860
Sara Gonzales: And the important thing to remember is that research data can take on many formats. It could be analog or digital and can exist in many styles.

17
00:02:35.490 --> 00:02:51.780
Sara Gonzales: So lots of different things, from electronic lab notebooks, to spreadsheets, to health indicators, all of these things can count as different types of data, but much of it is electronic and needs to be managed in certain ways, and we'll begin to talk about those ways throughout the presentation.

18
00:02:53.280 --> 00:02:59.190
Sara Gonzales: So the next question is, what is research data management? So this really concerns the organization of data.

19
00:02:59.910 --> 00:03:06.150
Sara Gonzales: All throughout its life cycle from its first entry, from when a project is just being conceptualized and data begins to be collected,

20
00:03:06.570 --> 00:03:12.180
Sara Gonzales: All the way through to analysis and dissemination, and finally archiving of your valuable research results.

21
00:03:13.170 --> 00:03:29.220
Sara Gonzales: The whole data management process, then, really aims to make your entire research process as efficient as it possibly can, through organization techniques. And this will help you meet the expectations and requirements of your University, of research funders, and even of legislation.

22
00:03:31.020 --> 00:03:36.990
Sara Gonzales: And if you're concerned about data management, that's why you've  looked into this class today, but just to emphasize that point,

23
00:03:37.560 --> 00:03:48.120
Sara Gonzales: research data is an extremely valuable asset. We invest a lot of time and a lot of funds into collecting this data, so we want to make sure we get the best use out of it that we possibly can.

24
00:03:48.780 --> 00:03:53.430
Sara Gonzales: And when we are managing data  well, we're maximizing its effective use and value.

25
00:03:54.120 --> 00:04:07.380
Sara Gonzales: We're ensuring its provenance and authenticity, in that we know who's created data records how they've been maintained throughout their lifecycle, and we can answer those pertinent questions of provenance, use, and changes over time.

26
00:04:08.460 --> 00:04:14.340
Sara Gonzales: We're also ensuring the appropriate use of data information, so we ensure that only authorized people are able to

27
00:04:15.060 --> 00:04:24.210
Sara Gonzales: See or manipulate data files at certain points, points in their lives, which helps us to honor our obligations to IRB  and to our institutions, and to funders, etc.

28
00:04:25.110 --> 00:04:37.380
Sara Gonzales: Also, we can facilitate and data sharing with better managed data. And finally, we can ensure sustainability and accessibility in the long term of our data assets when we manage them well. So that means they can be reused or referenced by someone in the future.

29
00:04:39.390 --> 00:04:45.120
Sara Gonzales: So, one important topic to cover right at the beginning is the requirements around data management.

30
00:04:45.600 --> 00:04:57.210
Sara Gonzales: Or that can affect the way you approach data management. So one of these is to actually be aware of the retention requirements of research data at Northwestern. This kind of falls in the realm of

31
00:04:58.650 --> 00:05:04.230
Sara Gonzales: Retention schedules and records management as that pertains to research projects,

32
00:05:04.980 --> 00:05:10.260
Sara Gonzales: In the sense that research data is considered a record of note and something that must be kept for future access.

33
00:05:10.830 --> 00:05:19.920
Sara Gonzales: So the actual retention policy of Northwestern is that research data has to be maintained for a minimum of three years after final reports for the project have been submitted.

34
00:05:20.430 --> 00:05:27.540
Sara Gonzales: But really, that is only a minimum retention period. Listed below are some reasons why research data might need to be retained even longer.

35
00:05:28.080 --> 00:05:34.320
Sara Gonzales: And that would be if you might need to protect intellectual property that results from the work which will have its own retention requirements.

36
00:05:34.890 --> 00:05:41.250
Sara Gonzales: If students are involved, data must be retained until the student's degree is awarded and any resulting papers are published.

37
00:05:42.180 --> 00:05:47.910
Sara Gonzales: Sometimes funding awards or particular contracts with Northwestern specifically require a longer retention period.

38
00:05:48.660 --> 00:05:56.460
Sara Gonzales: And finally it sometimes federal oversight or regulations, or sponsor policies, or even journal publication guidelines might require a longer retention period.

39
00:05:57.210 --> 00:06:06.270
Sara Gonzales: Therefore, if we choose the longest possible period that any of all these different stakeholders might apply to data retention, then you're also meeting the needs of

40
00:06:06.930 --> 00:06:18.660
Sara Gonzales: The other stakeholders who might have shorter retention periods. So with that in mind, we know potentially keeping data for a long time and, therefore, the better managed it is, the more accessible and useful it is into those future years.

41
00:06:20.280 --> 00:06:26.160
Sara Gonzales: So knowing that, and knowing that we might have a lot of work to do to get data into shape, let's consider one possible scenario,

42
00:06:26.490 --> 00:06:35.760
Sara Gonzales: Which is that we might have been collecting data in a project and perhaps it could have been managed a bit well, from the beginning, and we'd like to kind of take stock in the middle,

43
00:06:36.240 --> 00:06:43.080
Sara Gonzales: And see how we could do a better job with managing data. Well, the way to take that stock, one approach that can be taken,

44
00:06:43.470 --> 00:06:51.030
Sara Gonzales: Is actually take a data inventory of where you are now so you'll know what kind of state your data is in, and then how you can manage it better going forward.

45
00:06:51.570 --> 00:06:59.700
Sara Gonzales: A data inventory is really relatively simple, in that it just seeks to answer these questions you see here, kind of the five W's and the "How" question about your data.

46
00:07:00.270 --> 00:07:10.200
Sara Gonzales: Answer: Who's interacting with it, What it is, Where it is, and a few additional questions. And then the way you approach recording this data can be completely

47
00:07:11.370 --> 00:07:21.210
Sara Gonzales: Up to your team in terms of what's easiest for you. You can record this in a small home grown database, you can record it in a spreadsheet, even any kind of Word doc or Google Doc.

48
00:07:21.540 --> 00:07:25.050
Sara Gonzales: Anything that works best for your team to just record the answers to these questions.

49
00:07:25.920 --> 00:07:30.840
Sara Gonzales: So we'll go through a couple of the details here that you might want to include in your data inventory, as you're getting up to speed.

50
00:07:31.500 --> 00:07:45.990
Sara Gonzales: The What questions about data typically include things like what types of file names are you giving the data files, do you have descriptions for individual files or grouping of files? Do you know how many you have, and the rate at which they're going to grow?

51
00:07:47.190 --> 00:07:52.440
Sara Gonzales: The Who questions around data are largely questions of data creation, and access and ownership.

52
00:07:52.830 --> 00:08:01.020
Sara Gonzales: So this is where all our stakeholders are involved, and you might want to think about who's accessing data at what levels, and what permissions are necessary.

53
00:08:01.470 --> 00:08:10.980
Sara Gonzales: So think about all the people who might need to interact in some way with your data. This could include, in addition to your PI's, you might have officials in the university,

54
00:08:11.520 --> 00:08:19.560
Sara Gonzales: Colleagues within your department or in your cores or consortia, the funders, at some level, might need to access the data as well as publishers.

55
00:08:20.130 --> 00:08:27.660
Sara Gonzales: And so as we're thinking about these stakeholders, as I mentioned, it's really important to think about the different authentication levels and access levels that they might all need.

56
00:08:28.200 --> 00:08:38.100
Sara Gonzales: Can these be taken care of through database passwords or other kind of passwords or restrictions or things that you might implement? These are important things to think about in the Who questions.

57
00:08:39.270 --> 00:08:42.690
Sara Gonzales: There Where questions about data largely have to do with storage.

58
00:08:43.140 --> 00:08:49.770
Sara Gonzales: And as you know already, there's many places you could potentially store data. And there's really no one that's better than another.

59
00:08:50.100 --> 00:08:57.810
Sara Gonzales: Unless of course you have privacy restrictions, but we probably all use variations of these different systems at different points. So we've got

60
00:08:58.290 --> 00:09:06.750
Sara Gonzales: Institutional servers, or also services like Box. There's other file sharing services. And then there's personal or sharing drives that we might have like Google.

61
00:09:07.710 --> 00:09:21.900
Sara Gonzales: People store data on their own computers, which is fine, and they might also be doing backups. So it's completely acceptable to utilize all of these different storage venues. But what's important is knowing what is on which storage areas, and also

62
00:09:23.640 --> 00:09:26.190
Sara Gonzales: How those are being backed up and maintained for the future.

63
00:09:28.980 --> 00:09:39.630
Sara Gonzales: The How questions around data are really kind of getting to the heart of the project. And here you might consider questions like your collection techniques for data, and the instruments you're using

64
00:09:40.140 --> 00:09:45.930
Sara Gonzales: To create your data or data files, how you're naming files, the kind of workflows you have set up for these procedures,

65
00:09:46.710 --> 00:09:50.160
Sara Gonzales: Any metadata you're recording, any analysis tools and software that you're using.

66
00:09:50.580 --> 00:10:02.910
Sara Gonzales: All of these things really have a very direct impact on the data files you're creating. So the more information you can collect and record somewhere about these techniques, and instruments, and workflows, really the better.

67
00:10:03.600 --> 00:10:11.610
Sara Gonzales: And finally, the When questions about data really have to do with the life that you foresee for your data files in the future.

68
00:10:12.000 --> 00:10:21.180
Sara Gonzales: So can we get record information that helps future users know what happened to data at what points? So are we recording who created files, who modified them,

69
00:10:21.990 --> 00:10:29.310
Sara Gonzales: Who did what at what certain points in the files' lifecycle retention periods, that we already mentioned, really important to consider.

70
00:10:30.000 --> 00:10:35.310
Sara Gonzales: And also file format migration for files that really may last long term, and also the long term storage solutions.

71
00:10:36.120 --> 00:10:42.120
Sara Gonzales: So that's kind of a quick overview of some of these 5 W's and How questions about your data, that you can record in that inventory.

72
00:10:42.540 --> 00:10:50.220
Sara Gonzales: And for each of these questions, as our kind of diagram illustrates, it's important to think about the entire life cycle of those data files.

73
00:10:50.730 --> 00:10:58.560
Sara Gonzales: So once you've got this information recorded in a central place and your entire team understands what's going on with data, then moving forward you can start to implement

74
00:10:59.340 --> 00:11:10.500
Sara Gonzales: Some strict procedures about how to manage data in the future so that hopefully future inventories won't show the procedures are getting left behind, or that there's corrections to be made.

75
00:11:12.450 --> 00:11:22.500
Sara Gonzales: So one good procedure to implement right off the bat, really as early on in the project as possible, is data backups. So decide how you're going to do them, and in what formats.

76
00:11:23.220 --> 00:11:34.050
Sara Gonzales: The best practice recommendation for making backups of your data is to have at least three copies, and these would be your original, your local external, and your off site external.

77
00:11:34.860 --> 00:11:43.320
Sara Gonzales: Your original copy could literally be on your own machine, that's perfectly acceptable. You know, we're all working with our files usually initially on our own computers.

78
00:11:44.010 --> 00:11:55.380
Sara Gonzales: Local external might be something like your institutional drives. In Feinberg School of Medicine they're called the FSMRes files; so this is a drive that is outside your personal machine.

79
00:11:55.920 --> 00:12:04.620
Sara Gonzales: But it is local to the extent that it's not located too far away; if you had to drive to those servers, for whatever reason, you could. They're mostly local.

80
00:12:05.550 --> 00:12:13.410
Sara Gonzales: Off site external exists in a couple of formats. In older times that might be something like you see here, a kind of actual removable hard drive.

81
00:12:13.860 --> 00:12:20.490
Sara Gonzales: that might be driven away to a different location 30 miles away or something, to to be preserved in case of a local catastrophe.

82
00:12:21.180 --> 00:12:31.800
Sara Gonzales: Increasingly off site external is getting to be Cloud Storage. This can be okay in some contexts, of course, there's always potential safety concerns there with the privacy of the data.

83
00:12:32.190 --> 00:12:41.730
Sara Gonzales: But increasingly cloud storage solutions are being more and more useful and efficacious. And I'll get into the details about that as we go along.

84
00:12:42.630 --> 00:12:54.120
Sara Gonzales: And also it's really important to backup consistently. So whether you're actually kind of reminding yourself to do it, or if you're setting an auto-backup, just ensure that happens on a regular basis, be it monthly, weekly, etc.

85
00:12:55.230 --> 00:13:03.840
Sara Gonzales: Also, it's important to be able to roll back your changes for at least one month. So if you've got a month's worth saved data, some of that might be very duplicative, but at least

86
00:13:04.440 --> 00:13:08.610
Sara Gonzales: If a catastrophe happens you know you've got at least a month's worth of saved files.

87
00:13:09.390 --> 00:13:15.270
Sara Gonzales: If for your particular project or your needs, you need to save data for more than one month, then you can go ahead and do that.

88
00:13:16.230 --> 00:13:24.150
Sara Gonzales: There's also optical storage options. So especially for these backups, you can use something, as I said, like a removable drive or even CDs or DVDs.

89
00:13:24.660 --> 00:13:35.490
Sara Gonzales: The only caveat there is that these types of media are not recommended for much longer term storage than just maybe a few months or a couple of years, because these media will break down over time.

90
00:13:36.540 --> 00:13:43.830
Sara Gonzales: And so, as I say, the cloud is good for longer term storage as long as you're using the appropriate cloud storage for sensitive data.

91
00:13:44.160 --> 00:13:50.520
Sara Gonzales: So there's definitely some cloud storage solutions that aren't appropriate for that, but some are approved in certain contexts.

92
00:13:51.060 --> 00:13:57.570
Sara Gonzales: So there's a little bit more detail about that here. First and foremost, if you have access to the FSM file storage,

93
00:13:57.930 --> 00:14:04.200
Sara Gonzales: They're really your, your kind of first contacts and first grouping of people to work with to find storage solutions for you.

94
00:14:04.800 --> 00:14:14.820
Sara Gonzales: So if you've got data files that are exceeding your current capacity on the FSMRes files, you can ask FSMIT to increase that storage and they will work with you.

95
00:14:15.690 --> 00:14:27.240
Sara Gonzales: So this is secure storage, as I mentioned, they're kind of local servers. So they're secure for storing data with PHI; and if you need to request additional storage, if you download today's slides, we've actually got

96
00:14:28.260 --> 00:14:33.360
Sara Gonzales: Links to make the request for storage, or you can email fsmhelp@northwestern.edu

97
00:14:34.560 --> 00:14:41.550
Sara Gonzales: In addition, speaking of those backup options again. So as I mentioned, the FSMRes files will always be your kind of first step.

98
00:14:41.910 --> 00:14:45.300
Sara Gonzales: But for the cloud storage, here's where we've got those couple of different recommendations.

99
00:14:45.780 --> 00:14:56.040
Sara Gonzales: You can use Box for files you're sharing just for normal workflows, for the things that aren't too sensitive, but Box is not currently recommended for storing sensitive data.

100
00:14:56.670 --> 00:15:06.690
Sara Gonzales: And there are things like Box Labs that they're working on right now, for a potential sensitive part in the future, but at the moment Box is not recommended for any kind of sensitive data.

101
00:15:07.440 --> 00:15:18.900
Sara Gonzales: Whereas Northwestern SharePoint, on the other hand, has been approved for sensitive data, and you can check out the link there to learn more about this option. So that is a kind of a sensitive data-approved cloud option.

102
00:15:19.680 --> 00:15:27.960
Sara Gonzales: In addition, a nice resource that we have access to here at Northwestern is called CrashPlan Pro, and this is a backup and restoration service.

103
00:15:28.470 --> 00:15:33.030
Sara Gonzales: It's free. It's cloud-based, but it is secure and it's available to all FSM departments.

104
00:15:33.720 --> 00:15:39.960
Sara Gonzales: So what's nice about it is that you can individually select files or folders from your machine to backup, or your whole hard drive.

105
00:15:40.410 --> 00:15:47.490
Sara Gonzales: And also you can restore yourself at any time. So this is perfect for those situations where there might be kind of a catastrophic failure of your machine.

106
00:15:47.910 --> 00:15:58.020
Sara Gonzales: If you're using this service you can have everything on your hard drive backed up, and then restore it to your new machine when you like. So it's a great, great service that normally has a cost, but we have free access to it here.

107
00:16:00.480 --> 00:16:11.400
Sara Gonzales: Okay, so let's move on to some other procedures and best practices in research data management. Again, thinking about organizing those files so that things don't get to the stage of the the messy files and the data inventory.

108
00:16:12.150 --> 00:16:20.010
Sara Gonzales: So firstly, it can help to employ something like standard operating procedures, and this is something that you might have seen in clinical contexts before.

109
00:16:20.670 --> 00:16:32.400
Sara Gonzales: I've spoken with clinical research coordinators who say these can be a great help. And this is basically a document that serves as a step by step listing of the steps you need to do to complete any given procedure.

110
00:16:33.360 --> 00:16:37.050
Sara Gonzales: So it can be a daily operational practice, it can be something that's only done once in a while.

111
00:16:37.350 --> 00:16:46.260
Sara Gonzales: But the point of it is that it lists procedures in a very systematic and easy-to-understand way so that anyone who has to complete that procedure knows what to do,

112
00:16:47.010 --> 00:16:51.120
Sara Gonzales: who is responsible for completing it, and really those kind of best steps to follow.

113
00:16:51.840 --> 00:16:57.930
Sara Gonzales: So this can be helpful in a lot of different contexts, for instance, if you need to make up a procedure for naming your files,

114
00:16:58.230 --> 00:17:04.500
Sara Gonzales: for storing your files, for doing your backups. You can write up any of those as this kind of Standard Operating Procedure or SOP.

115
00:17:05.040 --> 00:17:14.100
Sara Gonzales: So there's a couple of elements in this type of document that will help to make it successful for you and your team, and also for new people joining the team so they can get up to speed right away.

116
00:17:15.030 --> 00:17:23.040
Sara Gonzales: First of all, near the top you'll want to name and define the procedure. The definition will help newcomers to the team to understand why it's important.

117
00:17:23.850 --> 00:17:35.820
Sara Gonzales: It should have a date at the top and that should be reviewed regularly so that we know it's not kind of referring to any outdated practices; and also regularly updated procedures are more likely to be followed.

118
00:17:36.450 --> 00:17:44.040
Sara Gonzales: And then also it should list who maintains this procedure. So you want the name or the title of the person responsible for really keeping up and enforcing the procedure.

119
00:17:44.520 --> 00:17:51.240
Sara Gonzales: And also you should have a little definition section. So here you can define any terminology that might be used

120
00:17:51.660 --> 00:17:58.320
Sara Gonzales: In enforcing this procedure. So again, very helpful for newcomers to the team so they can understand some of the terms that you're using.

121
00:17:59.250 --> 00:18:10.290
Sara Gonzales: So an example of a standard operating procedure can be seen here, and this I just wrote up as a complete example for a standard operating procedure for systematic review file naming convention.

122
00:18:11.550 --> 00:18:23.910
Sara Gonzales: So really, any kind of procedure that you like can be written up in this way, and if this template serves for you, you can modify this one definitely; and also in the slides I have a link to additional

123
00:18:24.660 --> 00:18:31.950
Sara Gonzales: SOP's many of them in a clinical research context, that you can take and adapt for your purposes. But as you see this one here

124
00:18:32.520 --> 00:18:38.880
Sara Gonzales: Contains all the elements for an effective standard operating procedure. It's got its title at the top and also a date.

125
00:18:39.390 --> 00:18:49.980
Sara Gonzales: We've got the purpose, so, why this procedure needs to be followed. We've got the responsibility, so, either the person's name or the title of the person who is responsible for keeping up this procedure.

126
00:18:50.550 --> 00:18:56.220
Sara Gonzales: We've got the definition section. And then finally, the five listed steps for how to carry out the procedure.

127
00:18:56.910 --> 00:19:04.230
Sara Gonzales: These documents really don't need to be any more complicated than that, unless the procedure you're outlining does have more steps, or more complicated elements.

128
00:19:04.710 --> 00:19:10.620
Sara Gonzales: So again, feel free to take this template or find some of the others to download from the slides, if they might be helpful for you.

129
00:19:12.690 --> 00:19:24.390
Sara Gonzales: Okay, let's talk then, next, about another best practice in data management, which is file naming. And this can be really more than half the journey when it comes to managing files effectively. An effective name of a file

130
00:19:25.260 --> 00:19:38.310
Sara Gonzales: can help you with findability, it can help you with filing things away at the end of the day, with sharing files in the future; just all of these things are made much easier when a file is well-named, so let's explore how to do that.

131
00:19:39.360 --> 00:19:42.390
Sara Gonzales: So within the kind of

132
00:19:44.520 --> 00:19:53.580
Sara Gonzales: Breakdown a file here, we've basically got the initial part, the period, and then the file extension. So everything I'm referring to today actually refers to everything before the file extension.

133
00:19:54.090 --> 00:20:01.380
Sara Gonzales: And what I'm advocating for as a best practice to formulate this part of your file name is to make it a multi-part file name.

134
00:20:02.100 --> 00:20:08.940
Sara Gonzales: So we might have heard accepted wisdom in the past that says just make a filename short, or you don't have to

135
00:20:09.900 --> 00:20:15.360
Sara Gonzales: Repeat information that might come from your folder where the file is stored or that kind of thing. Just keep everything short

136
00:20:15.870 --> 00:20:27.420
Sara Gonzales: But that is actually kind of outdated information, because if we imagine that a file leaves its original location, we can no longer use the information from the folder to identify it. So we really need kind of robust

137
00:20:28.770 --> 00:20:40.680
Sara Gonzales: file names consisting of multiple parts that tell you something about the project without you even having to click all the way into the file. So what I mean by that, then, we can talk a bit about what these file names will contain.

138
00:20:41.820 --> 00:20:46.680
Sara Gonzales: Really, they should say something important about the project or about the data collection instance that that file represents.

139
00:20:47.280 --> 00:20:56.310
Sara Gonzales: And then once you kind of break that down into small component parts, you can either break those sections up with an underscore or you can run them together in various ways. And we'll see an example of that.

140
00:20:57.480 --> 00:21:09.660
Sara Gonzales: Also, as I say, it really should be descriptive. So those multiple parts of your file name should be understandable by others, and especially even people who are not part of your project or who will see these files, much further on in the future.

141
00:21:10.530 --> 00:21:19.290
Sara Gonzales: So that's kind of a really good best practice recommendation to follow. Just think about who might see this who's never heard of the project before, and can they understand something from the file name.

142
00:21:20.370 --> 00:21:24.900
Sara Gonzales: Some additional things you can do, is you can put a date in the file name, especially if

143
00:21:25.890 --> 00:21:30.660
Sara Gonzales: The file represents data that was collected on a particular date, and you really need to record that information.

144
00:21:31.140 --> 00:21:42.390
Sara Gonzales: So if you use this format of the four digit year, two digit month, and two digit day and make that your first element, then all of the files that you name in a similar way will line up in perfect date order in your folder.

145
00:21:43.740 --> 00:21:50.550
Sara Gonzales: Also, you can add version information. So sometimes systems will do this for us. You can actually do that from Box or from Google Docs.

146
00:21:51.120 --> 00:22:01.200
Sara Gonzales: But you can append version information yourself to any file that you might have had to clean up or do a manipulation to, by just adding "v01" or "v02" as that last element in the file name.

147
00:22:01.920 --> 00:22:07.650
Sara Gonzales: And then that leading zero is important because otherwise we get the phenomenon in a folder where

148
00:22:08.550 --> 00:22:16.260
Sara Gonzales: Something that has 1 at the end. Then there's another file that has 10 at the end, and 10 lines up under 1 and kind of forgets about the 2 through 9 in between.

149
00:22:17.010 --> 00:22:30.990
Sara Gonzales: Simply because this operating system is not thinking about the files numerically that way. So when you use the 01 or 02, then that takes care of that problem. And if you will have more than 99 versions of your file, you can use two leading zeros to take care of that.

150
00:22:32.160 --> 00:22:37.710
Sara Gonzales: Also, once you've got your multi-part filename structure put into place, it's important to do periodic quality control.

151
00:22:38.400 --> 00:22:45.180
Sara Gonzales: Just make sure that new staff are trained in the file naming convention and then check on their work every once in a while, just with a spot-check

152
00:22:45.600 --> 00:22:50.430
Sara Gonzales: Or a quick QC to make sure everyone's naming things consistently. And then finally,

153
00:22:51.210 --> 00:22:57.930
Sara Gonzales: And really importantly, once you agree on how files should be named within your team, you can write up that standard operating procedure or SOP

154
00:22:58.380 --> 00:23:07.440
Sara Gonzales: And then save that in a place that everyone knows about, and everyone can reference, so everyone can remind themselves how to name files or new team members can learn how to name files.

155
00:23:11.940 --> 00:23:16.110
Sara Gonzales: So let's take a look at how this actually works in practice with a couple of examples.

156
00:23:16.860 --> 00:23:28.200
Sara Gonzales: So here I've got a fictional study that's looking at the effects of stress on health and it is resulting from a kind of interview situation where the interviewer and interviewee are in the room together.

157
00:23:28.680 --> 00:23:32.490
Sara Gonzales: And the interviewer is asking their questions verbally of the interviewee.

158
00:23:33.150 --> 00:23:40.410
Sara Gonzales: So as you can see kind of looking at this multi-part filename example, we can pretty much tell almost all that information, simply from looking at these elements.

159
00:23:40.920 --> 00:23:48.540
Sara Gonzales: Here we've got the date that the interview took place as our first element, then we have a kind of abbreviated name of my entire study about stress and health.

160
00:23:49.260 --> 00:23:54.000
Sara Gonzales: I've got Survey1 to represent the survey instrument that was used, my questionnaire instrument.

161
00:23:54.660 --> 00:24:02.520
Sara Gonzales: I've got a quick little sequence of numbers here. This is my code for my de-identified interviewee. I've got the initials of the interviewer.

162
00:24:03.150 --> 00:24:11.580
Sara Gonzales: And then "verbal," that's a condition of data collection, and here we're saying that, again, that interviewer was in the room with the interviewee verbally asking the questions.

163
00:24:12.030 --> 00:24:21.810
Sara Gonzales: So here, with just a few elements in this file name broken up by underscores, we can tell a lot about this data collection instance without having to have clicked into the file.

164
00:24:22.500 --> 00:24:35.970
Sara Gonzales: Other things we can do, too, if we don't like the underscores: dashes are also acceptable within file names, but really no other special characters than that would really work. Or also we could run the elements together, so we can count on like the capitalized

165
00:24:38.400 --> 00:24:38.820
Sara Gonzales: Um

166
00:24:39.930 --> 00:24:51.480
Sara Gonzales: Things here to separate out our elements. The capitalized letters. Or we could run everything together and not even worry about the capitalizations or hyphens. So that's really up to you, that's just kind of a

167
00:24:52.830 --> 00:25:03.180
Sara Gonzales: The preference of the team. And also, this isn't a hard and fast order of exactly how elements should be placed; that also is just a completely individual team decision.

168
00:25:03.570 --> 00:25:12.420
Sara Gonzales: So here's another example of how you could arrange them. Here I've got the interviewees' code number first, so it can really be anything. And here I've got one of these version numbers appended at the end.

169
00:25:13.380 --> 00:25:25.260
Sara Gonzales: And then, same thing, I've just kind of changed it up again. Same elements, but in a different order, kind of maintaining the point that I can tell a lot about this data collection incident, just from looking at this file name without clicking into the file.

170
00:25:27.180 --> 00:25:34.350
Sara Gonzales: Okay, and in the same realm, now that we're naming files well, we might have established a convention that follows some of the recommendations I provided

171
00:25:34.860 --> 00:25:46.950
Sara Gonzales: Now the next thing to consider would be folder naming and organization. So, making sure that all these well-named files are now living in a structure that makes sense to the team, and that really enables the maximum findability.

172
00:25:47.640 --> 00:25:55.620
Sara Gonzales: So the couple of recommendations I have for this is to really think about, if you're making a folder hierarchy, to start at that top level and kind of work down.

173
00:25:56.100 --> 00:26:05.700
Sara Gonzales: This can be very challenging, especially when you're thinking about the breadth of your folder structure versus the depth, and not having those kind of the nested folders get too deep underneath.

174
00:26:06.180 --> 00:26:12.300
Sara Gonzales: But that's just really going to be a discussion that your team will need to have, there's really no rule about how best to do that.

175
00:26:13.050 --> 00:26:22.290
Sara Gonzales: Usually the things that you might think about putting at the top level should be something like the highest level topics or categories, or maybe it's the highest level procedures.

176
00:26:23.100 --> 00:26:35.700
Sara Gonzales: Kind of functional categories. Really, whatever works best for your team in terms of how everyone most conceptualizes the biggest ideas in your project, those would usually be your top levels, and then you can work down from there.

177
00:26:37.050 --> 00:26:43.470
Sara Gonzales: Also, once you've got that structure in place, much like with files, you can decide on a naming convention for folders.

178
00:26:44.640 --> 00:26:52.620
Sara Gonzales: It doesn't have to follow exactly the same one as files, but something similar is probably not a bad idea. You can incorporate the dates in the same format, just as I mentioned.

179
00:26:53.220 --> 00:27:01.020
Sara Gonzales: But whatever you choose, make sure to document that, most likely in a standard operating procedure, and store that in a place where the team has access to it.

180
00:27:02.040 --> 00:27:11.160
Sara Gonzales: I would recommend also filing everything. So we've probably all seen the phenomenon where you have an existing folder structure on your drive, but for whatever reason there's

181
00:27:11.790 --> 00:27:16.920
Sara Gonzales: Some files that don't fit anywhere, they haven't been filed away. Sometimes this happens just because we're in a hurry.

182
00:27:17.250 --> 00:27:25.020
Sara Gonzales: But other times it might be because you wanted to file something and try to, but actually found it didn't fit anywhere. So that's

183
00:27:25.950 --> 00:27:35.400
Sara Gonzales: That kind of brings up the question of the actual conceptual categories in your hierarchy. And unfortunately, if you've got a lot of stray files building up, that might be a sign

184
00:27:35.700 --> 00:27:42.480
Sara Gonzales: That you might need to examine that entire hierarchical structure again, and maybe insert a new category somewhere so the stray files can fit.

185
00:27:43.500 --> 00:27:50.940
Sara Gonzales: So that can be a pain to reorganize these hierarchies. But if you've got lots of things that don't fit, then really, it needs to be done.

186
00:27:51.720 --> 00:28:01.470
Sara Gonzales: But finally, even as I advocate filing everything, there is one file that can sit on its own outside the highest level of the hierarchy. And this is a README file.

187
00:28:02.280 --> 00:28:05.880
Sara Gonzales: You might have seen README files in the context like GitHub, or other sites like that.

188
00:28:06.360 --> 00:28:13.200
Sara Gonzales: Where someone who's just getting used to software or to a project can read this file to learn more about it, and then they'll know how to move forward.

189
00:28:13.560 --> 00:28:23.100
Sara Gonzales: So you can do the same thing to explain your folder organization. The README file can just explain how things are organized, maybe when things were organized or when changes were made.

190
00:28:23.640 --> 00:28:31.890
Sara Gonzales: It can even follow the example of an SOP or Standard Operating Procedure, if you like. So it's got a date so you know when it was last updated,

191
00:28:32.640 --> 00:28:42.840
Sara Gonzales: The person to report to or ask questions of, all these things can be helpful in a README. But really, the point is just to explain the organization; especially helpful for newcomers to the team.

192
00:28:44.460 --> 00:28:51.510
Sara Gonzales: Okay. And another recommendation then in the best practices for data management area is to think about your file formats.

193
00:28:52.020 --> 00:28:57.390
Sara Gonzales: So the best practice is to choose a file format that will ensure the longest term access to your files possible.

194
00:28:57.900 --> 00:29:04.470
Sara Gonzales: So what this means in a practical context is to try and choose open source formats over proprietary wherever possible.

195
00:29:05.160 --> 00:29:16.020
Sara Gonzales: This is hard to do; of course, a lot of us work with proprietary formats. But as you begin to think about the longer life of your data files, or if you might have something in a repository, or for long term storage at your institution,

196
00:29:16.860 --> 00:29:25.080
Sara Gonzales: Maybe before that step to think about whether you can convert your file format to something that's non-proprietary, and that will just have a kind of a longer longevity.

197
00:29:25.950 --> 00:29:30.150
Sara Gonzales: These non-proprietary formats tend to last longer because they have a wider buy-in.

198
00:29:30.630 --> 00:29:43.680
Sara Gonzales: They have a distributed developer community, and usually across the globe. They're usually open source. So that just makes them a little bit more likely to be around for a long time than proprietary formats. And the kind of most obvious example of this I can point out,

199
00:29:45.480 --> 00:29:53.130
Sara Gonzales: I'm sorry, to move over to another topic, is to choose loss-less over lossy files. So in this kind of same realm.

200
00:29:53.580 --> 00:30:00.210
Sara Gonzales: The example of this I can point out that I think we've all seen is the difference between a TIFF and a JPEG in digital images.

201
00:30:00.810 --> 00:30:10.410
Sara Gonzales: So the JPEG is lossy because it experiences a lot of compression and it kind of loses bits every time the file is compressed and uncompressed, kind of opened and closed.

202
00:30:10.830 --> 00:30:18.510
Sara Gonzales: Whereas the TIFF is loss-less; it doesn't lose those bits because it doesn't experience the compression. The trade off is that it also makes it a much larger file.

203
00:30:19.200 --> 00:30:24.420
Sara Gonzales: So with the difficulties inherent in this, that kind of hopefully arms you with a bit of information to think about

204
00:30:24.960 --> 00:30:39.180
Sara Gonzales: Can you use a non proprietary format, can you choose a loss-less one? Do you have the storage? Is it necessary for long term access? So these are all important things to keep in mind as you're choosing your file formats and especially for longer term preservation.

205
00:30:41.280 --> 00:30:48.750
Sara Gonzales: Okay, so let's move now into another topic in the kind of data management umbrella, which is metadata.

206
00:30:50.580 --> 00:30:58.050
Sara Gonzales: Metadata. we've all probably heard the common definition: It's documentation that describes data, or it's data about data.

207
00:30:58.890 --> 00:31:03.330
Sara Gonzales: And what it really means in practice is that it's just describing and documenting your data,

208
00:31:03.810 --> 00:31:18.540
Sara Gonzales: So that you understand really important details of the work. And this could be yourself, if you need to look into it a little bit later, a couple years after finishing the project say, or this could even be descriptions and documentation that helps a future user understand the files.

209
00:31:19.650 --> 00:31:24.960
Sara Gonzales: So maybe someone who's downloading them from a repository, or in that kind of context, or a collaborator with whom you're sharing data.

210
00:31:25.590 --> 00:31:29.670
Sara Gonzales: So really this metadata is just kind of a formalized outline and definition

211
00:31:29.970 --> 00:31:42.450
Sara Gonzales: Of the data descriptors that you might be using to talk about your data on a day-to-day basis. Your kind of keywords, your descriptions, things that might go into an abstract if you publish on this data. So it's just formalizing those kind of descriptions.

212
00:31:44.370 --> 00:31:54.270
Sara Gonzales: And there's a couple of different senses, too, in which we can think about metadata. We're certainly going to talk about metadata in the sense of the Data Deposit, so if you might put your data into a repository.

213
00:31:54.720 --> 00:32:04.740
Sara Gonzales: But actually metadata is something that we work with from the very beginning of a project, even in our very earliest data collection files. And metadata in that sense

214
00:32:05.430 --> 00:32:13.650
Sara Gonzales: Kind of takes on this idea of tidy data. So you might have heard about this, and if you're good at data wrangling in R and in situations like that,

215
00:32:14.160 --> 00:32:20.280
Sara Gonzales: There's a whole slew of kind of training opportunities you could take advantage of that will help you to tidy up your data.

216
00:32:20.760 --> 00:32:28.560
Sara Gonzales: But there's a lot of things we can do, even in just setting up project spreadsheets, that can make our data tidier right from the beginning, and hopefully maybe

217
00:32:28.920 --> 00:32:39.630
Sara Gonzales: kind of help us avoid those problems where we have to tidy up data later on. So how can we make data tidy from the beginning? So what's kind of things to do, versus things not to do?

218
00:32:40.230 --> 00:32:45.600
Sara Gonzales: What I'm showing here is an example of very untidy data. So these are all the things not to do.

219
00:32:46.320 --> 00:32:55.110
Sara Gonzales: What we can see in this is a good example of maybe almost half a dozen ways of managing data pretty badly right from the collection stage.

220
00:32:55.560 --> 00:33:02.280
Sara Gonzales: So firstly, we're looking at a single worksheet in an Excel spreadsheet, and this one worksheet contains eight data tables.

221
00:33:02.760 --> 00:33:13.710
Sara Gonzales: So to the extent that you can, it's best to avoid doing that. Every single data table, every kind of discrete spreadsheet that you're collecting, should exist in its own worksheet, or even separate files.

222
00:33:14.580 --> 00:33:22.140
Sara Gonzales: Another thing we see here is that we've got these kind of calculations, a little bit of math and things going on, right next to all the tables.

223
00:33:22.620 --> 00:33:29.160
Sara Gonzales: And that also is something to be avoided. So calculations, formulas, even these kind of beginning data analyses,

224
00:33:29.550 --> 00:33:40.080
Sara Gonzales: They should not live in the same file as your raw data files. You might have analysis files in which you'll do something like this. But to the extent that you can, it's best to leave that raw data raw.

225
00:33:40.410 --> 00:33:49.110
Sara Gonzales: So don't include those formulas or calculations right with that data file. In addition, here, there's a couple of irregularities that you might see,

226
00:33:49.590 --> 00:33:54.870
Sara Gonzales: It's a little hard to see because these columns are somewhat thin, but you might notice that the actual

227
00:33:55.800 --> 00:34:06.780
Sara Gonzales: column headers, here, the names of the columns in data collection are not always consistently named. So here B3, this cell, is missing the word "plot" that seems to appear in all the others.

228
00:34:07.260 --> 00:34:15.900
Sara Gonzales: We've got "Bug1" and "Bug2," but sometimes "Bug1" is just called "Bug." So there's a couple of ways here that you can see these headers are very inconsistent.

229
00:34:16.440 --> 00:34:26.040
Sara Gonzales: That's also something to avoid as you're setting up your data files. Try to always name those headers consistently, and name them according to a convention, even, very much like a file name. So

230
00:34:26.520 --> 00:34:31.650
Sara Gonzales: don't use two different ways to refer to the same concept like "Bug" and "Bug1"; always name them consistently.

231
00:34:32.820 --> 00:34:42.030
Sara Gonzales: Additionally, we see here a lot of instances as people were kind of observing for these bugs,  lot of instances of zero come up. And also we've got this "control."

232
00:34:42.840 --> 00:35:00.000
Sara Gonzales: So there's things about that, that just kind of send up a red flag, that make us wonder about whether the people who were collecting this data were using something consistent to represent the null value, or, when, just when an instance was not observed at all, just when there was no reading taken.

233
00:35:01.140 --> 00:35:08.790
Sara Gonzales: So null is different from zero. Zero would mean I went out looking for bugs on this day and didn't see any, versus null, which means that I didn't take any readings at all.

234
00:35:09.300 --> 00:35:14.910
Sara Gonzales: So it's very important to differentiate those two things, and to always express null in a consistent way.

235
00:35:15.570 --> 00:35:27.150
Sara Gonzales: Whether you choose the word "null" or N/A, or to choose a value that would never actually occur in a real life scenario like -999. These are all valid, but just make sure you choose one and consistently apply it.

236
00:35:28.170 --> 00:35:44.940
Sara Gonzales: Also, one small thing that might be hard to spot in this, is that on line 25 here we've got this small highlight. And because we have no additional documentation with this spreadsheet, we actually don't know if that highlight is supposed to convey information to us as readers of this file.

237
00:35:46.260 --> 00:35:51.840
Sara Gonzales: Does it mean something important happened to this data? Was it just a stray keystroke that highlighted something for no reason?

238
00:35:52.380 --> 00:36:03.660
Sara Gonzales: We really don't know. And what that really helps us point to is the fact that any kind of highlighting, conditional formatting, you know, making the different cells change color based on some kind of context,

239
00:36:04.350 --> 00:36:12.060
Sara Gonzales: That should never be used in a raw data file. Simply because, kind of, much like the formulas and calculations, that can just be confusing to new users,

240
00:36:12.390 --> 00:36:22.380
Sara Gonzales: It could also be corrupted as the file is opened and closed many times, or as different users check out this file and maybe might accidentally take that formatting or highlighting out and not realize it was important.

241
00:36:22.800 --> 00:36:28.440
Sara Gonzales: So it's very, very important to not use highlighting or conditional formatting in your raw data files.

242
00:36:29.640 --> 00:36:29.850
Okay.

243
00:36:31.110 --> 00:36:37.290
Sara Gonzales: So let's, then, take a look at, if those are all the things to avoid in setting up a tidy data spreadsheet,

244
00:36:37.800 --> 00:36:45.600
Sara Gonzales: how can we do it in a more tidy way? So here I've got an example of a tidily set up spreadsheet, just below. So as you can see, there's,

245
00:36:46.110 --> 00:36:59.490
Sara Gonzales: It's avoiding the problems we saw in the last one. We've got NA used or "null" very consistently. We've got column headers, which are really very specifically named, and in fact kind of split up in multiple elements, much like a well-named file.

246
00:37:00.660 --> 00:37:06.690
Sara Gonzales: So here, really we've got a lot of good practice recommendations going on. So here's the best practices, just kind of outlined above.

247
00:37:07.320 --> 00:37:21.630
Sara Gonzales: Whenever we're getting ready to set up a spreadsheet, we should always remember how spreadsheet programs are kind of natively setting these things up, which is that the columns represent the variables and the rows represent the instances of observation.

248
00:37:22.770 --> 00:37:30.240
Sara Gonzales: Also, those column headers, make sure they're named consistently and also document them well. So even though this is a very consistent and clear naming they're using,

249
00:37:30.690 --> 00:37:38.010
Sara Gonzales: because I'm not familiar with their terminology, I have no idea what that actually refers to. And this is the type of thing that can be documented in a data dictionary.

250
00:37:38.580 --> 00:37:48.000
Sara Gonzales: So that data dictionary should contain a few key elements: it should have the kind of code name of your variable, just like this. It should also have the human readable name and a definition

251
00:37:48.600 --> 00:38:00.060
Sara Gonzales: So this can get to be a very long document, but it's really important to have, because it's the best way to keep track of, you know, studies that can have thousands of variables, just to make sure they're well defined and accessible to all.

252
00:38:01.230 --> 00:38:10.590
Sara Gonzales: Also, as you can see in this good example here, we're not storing multiple tables in one worksheet. We're not storing any graphs, visualizations, calculations,

253
00:38:11.430 --> 00:38:18.870
Sara Gonzales: functions, none of those are directly in the spreadsheet as well. We're not using a cell or column earmarked for one purpose for a different purpose.

254
00:38:19.380 --> 00:38:26.730
Sara Gonzales: So we're not kind of falling into the temptation to combine up variables into one column; everything that can be discretely separated out, at all, is.

255
00:38:27.960 --> 00:38:41.790
Sara Gonzales: And we're not using that formatting, highlights, the "null" value is well defined, and as I recommended, hopefully the creators of the spreadsheet have a good data dictionary. So those are the best practices for tidier data in spreadsheets.

256
00:38:43.530 --> 00:38:48.630
Sara Gonzales: But now let's go to the other side of the coin of metadata, which is thinking about metadata for preservation and access.

257
00:38:49.020 --> 00:38:57.990
Sara Gonzales: So again, thinking about those descriptors we're using for data, and how we can apply them in very standardized ways to make data more findable and accessible in the future.

258
00:38:59.130 --> 00:39:05.760
Sara Gonzales: Maybe through a context like what you see here, like through DigitalHub, which is the institutional repository of Feinberg School of Medicine.

259
00:39:07.080 --> 00:39:18.300
Sara Gonzales: So to kind of get into the mind frame of thinking about this metadata for preservation and access, we can kind of think about filing and what it means to really tag a digital object with information.

260
00:39:18.780 --> 00:39:25.830
Sara Gonzales: So the way information organization used to work is that you would take any one information object, be it a paper,

261
00:39:27.000 --> 00:39:41.430
Sara Gonzales: a printed out spreadsheet, etc., you'd have to decide the one thing that it's most "about" and then file it in a place that's designated for that subject. Whereas when we're working with our digital files, this kind of tagging metadata functions completely differently.

262
00:39:42.600 --> 00:39:48.300
Sara Gonzales: As we know, we can tag a digital object with really as many tags, hashtags, anything that you might like.

263
00:39:49.080 --> 00:39:55.050
Sara Gonzales: Just it's important to remember that kind of the more descriptors we use, sometimes that kind of dilutes the findability of that item.

264
00:39:55.800 --> 00:40:04.320
Sara Gonzales: Because if we tag it with things that are, maybe that our resource is only tangentially about, then that thing is brought up in a search that might not be so relevant.

265
00:40:04.770 --> 00:40:09.780
Sara Gonzales: So it is important with subject tags, especially, to tag your item with the thing that it's most "about."

266
00:40:10.470 --> 00:40:16.380
Sara Gonzales: But even more important than that, when we think about metadata, is that we want the things we tag our digital objects with

267
00:40:16.800 --> 00:40:25.320
Sara Gonzales: to be very standardized, and this helps us to find them in the future. And that might be subject terms, so we might turn to an online vocabulary to help us choose those terms.

268
00:40:25.770 --> 00:40:36.900
Sara Gonzales: And it could also be the way we express an organization's name, or even our own name. These are all things that we can make more standardized so it helps to bring together everything tagged in a similar way in future searches.

269
00:40:37.620 --> 00:40:43.290
Sara Gonzales: So what I mean by standardizing our own names is actually something, using something like this, which is the ORCiD ID.

270
00:40:43.920 --> 00:40:50.280
Sara Gonzales: So you might have heard about this. We actually all have access to make a free ORCiD account through Northwestern. And this is the link.

271
00:40:50.760 --> 00:41:00.060
Sara Gonzales: And what this is is a kind of researcher ID that will help differentiate you from any other researcher in the world; especially helpful if you might have a similar name to other researchers.

272
00:41:00.570 --> 00:41:06.990
Sara Gonzales: So here you get a free account, you'll be issued your individual ORCiD number that you can append to all your research outputs,

273
00:41:07.290 --> 00:41:22.590
Sara Gonzales: and this helps just in the future, in these kind of repository contexts, it helps everything that's, to which you've contributed be brought together that much easier in a search. Also what's nice about this is that even if you leave the institution your ORCiD ID remains yours for life.

274
00:41:24.120 --> 00:41:28.950
Sara Gonzales: And then here, in the kind of standardization area for subject terms,

275
00:41:29.940 --> 00:41:36.840
Sara Gonzales: this is the kind of thing I'm referring to, like how we can make the subject terms that we tag our outputs with more standardized.

276
00:41:37.260 --> 00:41:42.930
Sara Gonzales: So you're probably familiar with MeSH or Medical Subject Headings, we see these in PubMed, so

277
00:41:43.710 --> 00:41:53.190
Sara Gonzales: catalogers who work with PubMed might automatically assign these to your papers, but you can feel free to check these out as well. And if you decide to put something into an institutional repository,

278
00:41:53.670 --> 00:42:03.630
Sara Gonzales: perhaps conference objects, your posters, your presentations, any grey literature, you can tag these yourself with Medical Subject Heading terms, if you

279
00:42:04.080 --> 00:42:11.100
Sara Gonzales: would like to search them from this website. So that can be really helpful just in standardizing the way you describe your resources.

280
00:42:11.700 --> 00:42:19.350
Sara Gonzales: If the Medical Subject Headings are not necessarily what you're looking for, there's actually a host of controlled kind of vocabularies and thesauri that you can choose from.

281
00:42:19.920 --> 00:42:31.230
Sara Gonzales: Library of Congress Subject Headings is very useful. It's the generalist kind of list of subjects, and there's also some that are very specific to certain fields, so you see a couple of examples of those here.

282
00:42:31.890 --> 00:42:48.150
Sara Gonzales: You can also check out the website FAIRsharing.org. It actually has a slew of metadata schemas that you can check out, some of them, you know, extremely attuned to very specific subjects. So if you're looking for a controlled list of terminology, that can be a really good place to start.

283
00:42:50.310 --> 00:42:59.670
Sara Gonzales: Okay, and kind of our final thing I'll cover here in the realm of research data management, before we move on to sharing, is a data management plan. And this actually is kind of our good gateway

284
00:43:00.660 --> 00:43:06.930
Sara Gonzales: because even though it talks a bit about data management, kind of the reason we would do one is to kind of think ahead to data sharing.

285
00:43:07.560 --> 00:43:14.100
Sara Gonzales: So you might have seen in certain proposals, funders are having additionally

286
00:43:14.880 --> 00:43:20.130
Sara Gonzales: Kind of enforced requirements for having a data management plan submitted along with your grant application.

287
00:43:20.700 --> 00:43:27.330
Sara Gonzales: So all this is, is usually a one or two page document, that asks the researcher to outline of what types of data they'll create,

288
00:43:28.050 --> 00:43:34.170
Sara Gonzales: the standards and metadata with which they'll describe that data, and then how this data is going to be accessed and shared in the future.

289
00:43:34.530 --> 00:43:39.900
Sara Gonzales: Can it be reused? Can it be given a kind of Creative Commons License or something that allows

290
00:43:40.470 --> 00:43:48.540
Sara Gonzales: future users to know what they can do with the data? Has it been de-identified? Is it safe to be reused, and is it being archived for the long term anywhere?

291
00:43:49.110 --> 00:43:59.610
Sara Gonzales: So those are kind of the basic bones of any data management plan, but also if you do need to write one and it's for a specific funder, you can check out something called the DMP tool.

292
00:44:00.150 --> 00:44:06.030
Sara Gonzales: This again is a great resource that we have access to through Northwestern. You can actually log in with your net ID.

293
00:44:06.480 --> 00:44:15.210
Sara Gonzales: And once you're in this system you can indicate which funder, or to which funder you're applying for funds, and sometimes specifically even the grant itself.

294
00:44:15.630 --> 00:44:22.380
Sara Gonzales: And the DMP tool will basically spin up a kind of template that will include the sections that that funder requires,

295
00:44:23.010 --> 00:44:29.640
Sara Gonzales: that have to be completed for their data management plan. So that's really handy to kind of use that tool and the guidance that gives

296
00:44:30.240 --> 00:44:33.390
Sara Gonzales: if you're kind of starting from scratch and writing a data management plan.

297
00:44:33.930 --> 00:44:44.610
Sara Gonzales: You can also check out the LibGuide that's referenced here for a little bit more information, and of course you can reach out to us at the Galter DataLab as well. I'm always happy to help you in constructing a data management plan for your application.

298
00:44:46.740 --> 00:44:56.280
Sara Gonzales: Okay, so now let's switch gears completely over to the data sharing portion of today's talk. So here we'll just kind of show a bit of the history and why we're concerned with

299
00:44:57.180 --> 00:45:06.270
Sara Gonzales: Data sharing. So as we have probably seen and become aware of in recent years, there's federal funder publication sharing requirements. So, a lot of us

300
00:45:06.720 --> 00:45:14.760
Sara Gonzales: Who are working on NIH grants do have to make sure that the papers that we're publishing, that are funded by those grants, that they get shared publicly,

301
00:45:15.030 --> 00:45:23.640
Sara Gonzales: Usually through PubMed. And we've got great librarians at Galter who usually help us with that process of making sure we're compliant with that requirement.

302
00:45:24.210 --> 00:45:31.440
Sara Gonzales: And actually lots of different federal funders have similar requirements. So you can actually check out something called the Sparc Tool here.

303
00:45:31.830 --> 00:45:39.090
Sara Gonzales: Which is sparcopen.org, and through that you can search a couple of things, both article sharing requirements and data sharing requirements.

304
00:45:39.420 --> 00:45:49.200
Sara Gonzales: So do keep an eye on that, it's datasharing.sparcopen.org. So that lets you know, agency by agency, what the article sharing and what the data sharing requirements are.

305
00:45:50.280 --> 00:45:58.230
Sara Gonzales: But kind of a little bit more pertinent to what we're talking about today is really those data sharing requirements. And this has really grown out of a memo that came through

306
00:45:58.830 --> 00:46:08.460
Sara Gonzales: The White House Office of Science and Technology Policy back in 2013 saying that federal agencies had to create plans for increased public access to research data.

307
00:46:09.090 --> 00:46:16.800
Sara Gonzales: So that mandate went out to the agencies, and the agencies in turn, for the past seven or eight years, have had increasing requirements

308
00:46:17.190 --> 00:46:22.500
Sara Gonzales: for their grant funded projects, about how data should be accessible and how it should be shared.

309
00:46:23.250 --> 00:46:30.990
Sara Gonzales: And for the NIH, in particular, there's been a lot of recent work on this, and in late, late last year, a new data access plan was released.

310
00:46:31.380 --> 00:46:40.200
Sara Gonzales: So I want to make sure everyone's aware of that. It's not actually applicable to awards until just about two years from now, after January 25 2023,

311
00:46:40.710 --> 00:46:46.110
Sara Gonzales: but once that happens, the requirements, kind of boiled down to just the minimum, are listed here.

312
00:46:46.620 --> 00:46:51.780
Sara Gonzales: So there is going to be that two-page data management and sharing plan that has to be submitted with your grant application.

313
00:46:52.380 --> 00:47:01.140
Sara Gonzales: And it really does encourage data sharing, to the extent that it's possible. And of course the NIH does recognize that there's going to be privacy restrictions in many cases.

314
00:47:01.470 --> 00:47:08.670
Sara Gonzales: But to the extent that data could be de-identified, or non-identifiable aspects of it can be shared, that's going to be increasingly required.

315
00:47:09.420 --> 00:47:19.830
Sara Gonzales: And also the data should be shared as soon as possible, ideally no later than the first date of publication of any publications resulting from that grant, or the end of the performance period.

316
00:47:20.730 --> 00:47:29.400
Sara Gonzales: So that's kind of the most important requirement coming up. And as I say, if you need any help with formulating that data management plan, we at the Galter library can help you with that.

317
00:47:30.270 --> 00:47:41.940
Sara Gonzales: So as I said, if you want to check out the requirements agency-by-agency in case your grant is not on NIH, do check out that Sparc list; that's datasharing.sparcopen.org/data.

318
00:47:44.010 --> 00:47:54.360
Sara Gonzales: Okay, there's also data sharing requirements that come from our publishers. So these requirements, much like the federal agencies, they can actually be very different from publisher to publisher.

319
00:47:55.080 --> 00:48:04.020
Sara Gonzales: Publishers are publishing online their guidelines and requirements for data submissions. So I've just included a few here from some of the publishers that are known to

320
00:48:04.290 --> 00:48:17.280
Sara Gonzales: To really request data along with publications. So you can read about each of those there as well. And also, MIT keeps up a good list of journal requirements for data sharing, if you'd like to check out that LibGuide as well.

321
00:48:18.300 --> 00:48:25.800
Sara Gonzales: And then to kind of give an example of what can be required in these various data sharing policies by publishers

322
00:48:26.640 --> 00:48:34.080
Sara Gonzales: I've got listed here from SpringerNature, their kind of level 1-4 listing of the different types of data sharing their various journals might require.

323
00:48:34.470 --> 00:48:46.020
Sara Gonzales: So it really runs the gamut from everything from: dataset sharing to a public repository is encouraged, to all the way at level four, that data sharing is absolutely required and all but the most rare cases.

324
00:48:46.620 --> 00:48:57.030
Sara Gonzales: And this might mean that a dataset citation is required, meaning that the dataset is probably required to be deposited to a repository that will provide that citation,

325
00:48:57.360 --> 00:49:01.800
Sara Gonzales: and also that will provide a DOI or Digital Object Identifier for that deposited data.

326
00:49:02.670 --> 00:49:08.640
Sara Gonzales: So all that information is kind of minted and produced from the repository, and that would need to be included in the paper.

327
00:49:09.270 --> 00:49:16.530
Sara Gonzales: Also Data Availability Statement is usually required, and a journal-approved or public repository might have to be used.

328
00:49:17.130 --> 00:49:29.820
Sara Gonzales: So as we're thinking about this, about seeing a future where you might need to actually deposit some version of your data into a repository, in our next few slides we'll talk a bit more about what that means and how we can effectively do that.

329
00:49:30.990 --> 00:49:38.880
Sara Gonzales: So here, these are actually some criteria from PLOS, and this publisher and others might kind of let,

330
00:49:39.810 --> 00:49:45.360
Sara Gonzales: let submitters know that data sharing is required, but they might not necessarily direct you

331
00:49:45.810 --> 00:49:52.980
Sara Gonzales: to exactly where they'd like you to share the data. But they might give you criteria like these to follow for how you should choose your repository.

332
00:49:53.490 --> 00:50:05.190
Sara Gonzales: So the reason I share PLOS's criteria is because they're actually extremely good criteria, that would be great for anyone to follow if they're looking somewhere, for that ideal repository, in which to place data.

333
00:50:06.000 --> 00:50:11.790
Sara Gonzales: And specifically, if they're motivated to have this data be open access, kind of openly accessible.

334
00:50:12.420 --> 00:50:17.550
Sara Gonzales: So it can meet these criteria, such as the repository offers open access to all,

335
00:50:17.910 --> 00:50:22.350
Sara Gonzales: meaning you wouldn't necessarily have to register or sign up for an account, or certainly not pay,

336
00:50:22.680 --> 00:50:31.200
Sara Gonzales: but just would be able to go to this repository, find the Open Access records and download the files immediately. That's generally the definition of Open Access.

337
00:50:31.890 --> 00:50:37.140
Sara Gonzales: Also a good repository should assign that DOI, as I mentioned, the Digital Object Identifier.

338
00:50:37.560 --> 00:50:47.790
Sara Gonzales: And this is a permanent identifier that lives with the digital object for as long as it's on the web. So it's always findable. So this digital object can basically never be lost, and it can always be found and cited.

339
00:50:48.960 --> 00:50:56.460
Sara Gonzales: Also a good repository should allow for the data to be made available under licenses, and ideally Creative Commons Licenses.

340
00:50:57.270 --> 00:51:03.150
Sara Gonzales: And the concept of licenses can kind of be a lot to absorb, especially if you're kind of new to depositing data.

341
00:51:03.600 --> 00:51:12.390
Sara Gonzales: But the purpose of them is to let the secondary user, the next person who comes along, know what they can and can't do with that deposited dataset.

342
00:51:13.200 --> 00:51:18.180
Sara Gonzales: So this means that you could say that a deposited dataset could not be used for a commercial purpose,

343
00:51:19.140 --> 00:51:25.200
Sara Gonzales: or it can be reused with certain restrictions, or it can be reused with no restrictions as long as the original creator is cited.

344
00:51:25.590 --> 00:51:40.170
Sara Gonzales: All these kind of things, these are the options that the CC licenses offer. So if you take a look at the Creative Commons online, there's some helpful resources there about how you can choose the license you want, and that would be best to serve your deposited material.

345
00:51:41.250 --> 00:51:55.680
Sara Gonzales: Also a good repository should have a long term management plan for its own data. So if you do a bit of research into who's maintaining the repository, the kind of likelihood of its longevity, then you'll know that the data deposited there will be safe for the future.

346
00:51:56.700 --> 00:52:06.780
Sara Gonzales: The repository that you use should have some kind of acceptance within your research community. So hopefully through word of mouth or from asking around you might find what are the repositories

347
00:52:07.110 --> 00:52:13.440
Sara Gonzales: That are being most used within your discipline. And then finally, I mentioned before, the website FAIRsharing.org;

348
00:52:14.010 --> 00:52:19.470
Sara Gonzales: this is a place where you can find those metadata standards listed that I mentioned, but also

349
00:52:20.010 --> 00:52:34.590
Sara Gonzales: FAIRsharing.org has listings of repositories in various disciplines as well, so if local solutions aren't working or generalist ones are not appropriate, either, you can see if there's an existing repository for your discipline listed at FAIRsharing.org.

350
00:52:37.140 --> 00:52:47.790
Sara Gonzales: Okay, so in the kind of data sharing area here, as we kind of wrap up today's talk, we can consider a bit or you can kind of check out at your leisure, these

351
00:52:48.570 --> 00:52:56.760
Sara Gonzales: citations we have here for the benefits of data sharing. So there's again reasons why you might have to do it, it might be required by a funder.

352
00:52:57.090 --> 00:53:05.310
Sara Gonzales: But there's altruistic reasons to share data as well. So these kind of larger studies, bringing together lots of different data sets from different cohorts,

353
00:53:06.240 --> 00:53:18.030
Sara Gonzales: are really helping us to make breakthroughs in scientific research. So, if you like, you can check out some of this information on your own about the breakthroughs that are being enabled by big data and data sharing.

354
00:53:19.740 --> 00:53:26.460
Sara Gonzales: But kind of on a more local level here, there's other contexts in which we can think about the benefits of sharing data.

355
00:53:26.970 --> 00:53:31.770
Sara Gonzales: So we should remember, first of all, that not only can we be data share-ers, but we can be data harvesters.

356
00:53:32.370 --> 00:53:35.970
Sara Gonzales: So we as researchers can benefit from data sets that have already been shared.

357
00:53:36.480 --> 00:53:42.720
Sara Gonzales: So listed in the bottom half of this slide, you can see some examples of places where you can see nationally sponsored data sets

358
00:53:43.170 --> 00:53:52.650
Sara Gonzales: that are already available online. So you may have to create an account and register for some of these services, but once you do, you have access to some of these kind of shared

359
00:53:53.220 --> 00:54:02.700
Sara Gonzales: studies and data that can be put together into synthetic cohorts, or just kind of other resources to help you do big data studies of your own.

360
00:54:03.150 --> 00:54:09.720
Sara Gonzales: So these studies already exist. They're already published and it's data that we all have access to. So it's really great to take advantage of that.

361
00:54:10.650 --> 00:54:27.630
Sara Gonzales: The benefits: again, there's a kind of affordability aspect in that many of these data sets are federally sponsored and free, and also maybe a secondary data analysis or synthetic cohort project, or other related things, could be a good first research project for a graduate student.

362
00:54:28.800 --> 00:54:37.950
Sara Gonzales: Also, a lot of these datasets are actually longitudinal. So what you can do here is study changes over time or study changes occurring to large populations.

363
00:54:38.490 --> 00:54:52.410
Sara Gonzales: So, lots of possibilities there with these big government-funded datasets. And again, really, we can just leverage those federally-funded clinical trial studies and data and get the benefit out of the data that we've supported

364
00:54:54.570 --> 00:55:04.320
Sara Gonzales: So the other side of the coin, then, aside from being a harvester, is the sharing "win." So this is actually the benefits that accrue back to you from sharing your data.

365
00:55:05.130 --> 00:55:14.220
Sara Gonzales: So there was a study actually released, it's getting to be quite a few years old now, but to show that there was a statistically significant association with an increase in citations

366
00:55:14.580 --> 00:55:25.830
Sara Gonzales: for people who have shared data or made their data publicly available. So it's a kind of an immediate increase in visibility and it helps also to reinforce

367
00:55:26.400 --> 00:55:32.220
Sara Gonzales: the reproducibility of various studies if you actually make your data publicly available.

368
00:55:32.670 --> 00:55:38.910
Sara Gonzales: So knowing that that benefit might accrue to you, that's just kind of another motivator, or a good reason to share data, if you're able to.

369
00:55:39.780 --> 00:55:46.260
Sara Gonzales: And then knowing that that can happen, that kind of citation increase or the kind of bigger footprint you might have on the web from sharing data,

370
00:55:46.830 --> 00:55:54.690
Sara Gonzales: you might think about the capabilities that repositories can offer you to see that impact in action. So what I've got listed here

371
00:55:55.140 --> 00:56:06.210
Sara Gonzales: are various repositories, some of them actually Northwestern-based, that will show you the impact of both your publication and data deposits in real time. And will show you

372
00:56:06.690 --> 00:56:13.230
Sara Gonzales: The metrics on that. So, all these repositories will show you some kind of mixture of the following:

373
00:56:13.740 --> 00:56:23.310
Sara Gonzales: the views of the resources, the number of downloads, the number of citations, and also, increasingly, things like Alt-metrics. So we've got the social media hits,

374
00:56:23.730 --> 00:56:35.700
Sara Gonzales: that you might see from your resources, the news impact, that type of thing. So all these various repositories offer that in some shape or form. So this is a good thing to know about, and a good way to keep track of your impact.

375
00:56:37.680 --> 00:56:44.520
Sara Gonzales: Okay, here, when this class is offered in person, I might take a break at this point to offer a data repository exercise.

376
00:56:44.880 --> 00:56:54.540
Sara Gonzales: And you can certainly feel free to experiment this with your own if you've got time. So here I've got examples of even more repositories that you can check out

377
00:56:54.930 --> 00:57:08.970
Sara Gonzales: that are all a little bit different from each other. So some, as you might recognize, are those repositories that hold nationally-funded datasets. Some are generalist, such as Zenodo and Figshare, they'll really accept any kind of research from all over the globe.

378
00:57:10.140 --> 00:57:16.620
Sara Gonzales: And then we've got additional ones, especially local ones here at Northwestern. So these are all different resources you can check out

379
00:57:16.920 --> 00:57:26.700
Sara Gonzales: that will show you deposited datasets and other types of research outputs. So what I would just emphasize in the exercise is that you can do searches in these various

380
00:57:27.270 --> 00:57:37.380
Sara Gonzales: repositories, you can try making a sample record. But the point of that will be to see what you like and don't like about the interfaces, what's helpful to you as a user.

381
00:57:38.130 --> 00:57:48.300
Sara Gonzales: So if you find a dataset, can you tell right away: Can you cite it? Can you reuse it with any restrictions? Can you download it immediately or not? Do you have to apply or email to the original owner?

382
00:57:48.870 --> 00:57:56.130
Sara Gonzales: That type of thing. So as you work with the interfaces of these various repositories, you'll see that they do all differ in subtle ways.

383
00:57:57.060 --> 00:58:07.860
Sara Gonzales: I don't really recommend one any better than another, but it's just important to be aware of those differences, especially if you might be a user of repositories in the future, either as a depositor or a searcher.

384
00:58:09.780 --> 00:58:22.170
Sara Gonzales: Also I'll mention as well, as I said right at the top, that in the realm of repositories here at FSM we're working to do an upgrade to DigitalHub, and that is our local institutional repository.

385
00:58:22.860 --> 00:58:30.990
Sara Gonzales: And it'll be similar in some respects. It will do a lot of a lot of the same functions, such as minting that Digital Object Identifier for each deposit,

386
00:58:31.740 --> 00:58:40.860
Sara Gonzales: encouraging the users to provide robust metadata to describe records, and also offering licenses so secondary users know what they can and can't do with the data.

387
00:58:42.390 --> 00:58:47.580
Sara Gonzales: And so it's going to be built on a newer system, though, that's going to be a bit more streamlined and a bit easier to use.

388
00:58:47.970 --> 00:58:55.770
Sara Gonzales: It's actually Python-based, as opposed to a kind of polyglot software base we had for the older one, and this new product is called InvenioRDM.

389
00:58:56.160 --> 00:59:06.870
Sara Gonzales: And you can check out the website there, if you're interested in it. It's actually very similar to the Zenodo.org software stack. So if you're familiar with that, it'll actually be somewhat similar.

390
00:59:08.730 --> 00:59:16.020
Sara Gonzales: And that is the end of today's presentation. It's a lot of information to take in, and I would definitely encourage you to check out some of these links

391
00:59:16.350 --> 00:59:25.920
Sara Gonzales: for some more specific information about some of the many topics that we looked into today. So everything from data management plans, to the various repositories,

392
00:59:26.910 --> 00:59:41.490
Sara Gonzales: to our local library websites, to what you see listed here: the policies and procedures for researchers, so everything from IRB, to retention policies, to data and backup procedures and policies.

393
00:59:42.510 --> 00:59:52.410
Sara Gonzales: And then finally, here's a little bit more on confidentiality and data security. So I did not get into this in this presentation because it's actually a huge topic all on its own.

394
00:59:53.250 --> 01:00:03.000
Sara Gonzales: When we talk about de-identifying data, if you would like to share clinical data at some point, um, it's actually, data security falls within the realm of FSMIT.

395
01:00:03.930 --> 01:00:16.620
Sara Gonzales: Individual researchers need to make their own plans to de-identify data, but once they've got a plan, they can run that past FSMIT to see if that will be sufficient to de-identify datasets prior to deposit.

396
01:00:17.220 --> 01:00:25.590
Sara Gonzales: So just to kind of get the ball rolling, and if you've got preliminary questions about that that you would like to ask of FSMIT, all the links are here.

397
01:00:28.170 --> 01:00:33.330
Sara Gonzales: Okay, so here, just the final couple of slides. This is my contact information.

398
01:00:33.780 --> 01:00:42.690
Sara Gonzales: And as I say, I am available through the Galter DataLab as well. And there's many other classes through DataLab, some of which breakdown in more detail some of the things that we talked about, kind of,

399
01:00:43.320 --> 01:00:49.980
Sara Gonzales: at the survey level today. So please do look into any of those additional classes that might help out with more detailed information.

400
01:00:51.000 --> 01:00:57.150
Sara Gonzales: And thank you very much. This class has been partially funded by the NLM and we acknowledge and thank them for their support.