Another week, another Twitter discussion. Starting with a positive story of how open data can benefit the research process. Dr Werner goes from asking about relevant papers to getting an open data set that allowed them to answer a question they were working on.

For me, this is how science should work. We all have our own projects and ideas of how we should collect data, but before you start that new project, what if someone else has a data set that allowed you to answer that question? It might contain some variables they did not explore in the original publication or you want to approach the analysis in a different way.

This provides the opportunity to efficiently answer your research question without having to apply for funding, get ethical clearance, and collect the data. You can start your project a step or two further than if you had to perform an intermediary study.

However, as open data becomes more prevalent and there are more opportunities to reanalyse data from another researcher, this presents a problem it seems we do not yet have conventions for in the psychology community. How do we provide appropriate credit to the original authors and should they be involved?

In this post, I want to explore the role of open data and wrestle with how we can appropriately provide credit. My perspective is we are incentivising providing open data in the first place, but lagging behind when it comes to conventions on how we can adequately credit those who provide open data.

Should all research data be openly available?

Before I start, just to make it clear, I’m not interested in the straw man argument that open data should be mandatory for all research. There were some responses starting with arguing whether you can share the data or what happens when you work with vulnerable populations. There are plenty of areas of psychology where open data would not be appropriate and you are absolutely within your rights to explain you cannot share the data. For example, if you work on rare populations that could make it easy to identify participants, or the data are sensitive in areas like clinical psychology or many qualitative projects. Although you could think of something like providing a synthetic data set in its place (see Quintana, 2020 for a primer), we’ll keep the discussion here to primary research data.

In many areas of psychology though, I cannot see a reason why open data should not be an opt-out system rather than opt-in. For example, working with response time data or providing participants with a bunch of personality questionnaires. In these scenarios, there is little reason not to share the data beyond hoarding the data like a research Smaug. For full disclosure, I’ve worked on projects in the psychology of religion, prosocial behaviour, and addictive behaviour. For preprints or publications where I could control the availability of the data, I have always provided open data and listed where it is available (see my research page here).

When discussing the role of credit, let’s assume you can share the data and you have permission from the participants to do so. Cummings et al. (2015) showed how participant’s are more than happy to consent to you sharing their data providing you are responsible. This recent paper by Bottesini et al. (2022) showed 76% of participants said researchers should share their data after completing a study earlier in the project. There is also a great tutorial article by Meyer (2018) which explores the dos and don’ts for sharing data and the ethics around it. In this discussion, we’re focusing on you as the researcher making the decision to share your data, or you as the researcher reanalysing someone else’s data and want to provide appropriate credit.

Why should you share your data?

Assuming you can share the data, why would you go out of your way to give other researchers access to your data?

For me, I still have a naïvely romantic view of science consistent with Mertonian norms. The first norm is communality/communism, where the ownership of scientific methods and results should be shared freely. These norms have been critiqued though and researchers recognise a career in science is often at odds with the idea of science (Anderson et al., 2010). Ultimately though, I share as much as I can for the warm fuzzy feeling of its how I think science should work.

Like Dr Werner’s tweet at the start, sharing data also provides the opportunity for science to progress faster. In theory, as scientists, we are interested in progress and learning more about our area of research regardless of whether we’re the ones publishing it or whether we’re reading the publication from someone else. Open data allows us to ask and answer research questions more efficiently than if you had to collect the data yourself.

In addition to reusing the data to make scientific progress, the rise of open data in psychology has followed the reproducibility crisis / credibility revolution. We know some studies could not be replicated, and that could be down to several explanations such as false positives, differences in methods, or mistakes. Open data provides another level of accountability where you are more likely to double check your findings if someone else can poke around in the data, and other researchers have the opportunity to spot and highlight any mistakes.

There’s also whether the data is really yours to keep secret. Many researchers speak of data like their own property. As much as we spend time and effort designing studies and collecting data, ultimately its the participants’ data. Often, our research is supported by a third-party like a funding agency, charity, or university. In the vast majority of cases, they will mandate and check how you are going to share and curate the data as an efficient use of their funds. In this scenario - although many researchers still talk about the data as their property - you pretty much have no say in whether you want to share the data. Over the last few days, I have seen different accounts on whether/how you can licence research data to state its reuse terms, but I do not know enough about it to add here.

If the warm fuzzy feeling and quality control aspects do not convince you, then we can turn to the old fashioned incentive structures in science. Sharing data might lead to your research being cited more. For example, McKiernan et al. (2016) found open practices in general can lead to increases in citations and media attention, and Colavizza et al. (2020) specifically showed open data was associated with an approximate 25% increase in citations compared to articles that did not share their data.

Until this point, there is nothing new here. I just wanted to outline where we currently stand on open data to segue to my main concern in how we should incentivise and provide credit for open data.

How should you credit open data?

In the Twitter discussions, my perception is most people support open data, but they are worried about being disadvantaged if another researcher uses that data to publish an article. At the moment, academic currency is still built around publications in high impact journals and the number of citations they generate. Its hard to get and retain a secure academic position, so every little helps when it comes to building your profile, particularly as an early career researcher.

Working through the poll from Dr Jacobson, I will explore each option and where I think there are open problems we still need to solve as incentive structures catch up to initiatives born out of the credibility revolution.

As one final point to clarify, I do not think all uses of open data have the same level of contention. For example, taking open data to calculate values that were not reported in the original article to be included in a meta-analysis or a power analysis. I recently benefited from this when I was working on a replication for a student dissertation and the authors did not report means/SDs per condition, but they did provide open data which meant we could calculate them ourselves (shout out to Seli et al., 2016). Unless you needed to ask a question, I do not think anyone would think twice about using open data in this way and just citing the article. From what I can tell, its writing an empirical journal article using secondary data which people have a problem with.

Author approval required

In the final results, 7.1% of respondents voted that when reusing open data, you must get the prior approval of the original authors. Concerns seemed to range from it would simply be rude to omit the original authors to more pragmatic views where you might misinterpret the variables without inside information.

Although building a collaboration could benefit both parties, requiring author approval creates so many problems for me and I am not sure where you draw the line. You would not think about asking for permission to cite an article or trying to replicate a study based on their methods, so I am not sure why you would need their approval if the data are openly available. The worst case scenario would be you find something contradictory to their claims or pet theory, and they veto your attempts to publish the findings.

Author contact required

9.8% of respondents voted that you do not need the approval of the original authors, but you must contact the authors to give them a heads up. This is closer to my own views but I think mandating author contact would still be a little much. I think its the polite thing to do and can open avenues for a collaboration, but it should not be mandatory since the data has been openly shared. Like citations or replications, you would not email the authors every time.

Freely use with a citation

The modal response, with 63.7% (myself included) voting that if researchers openly share their data, all you need to do is cite the original article. I commented on the original poll that this would be my approach, but I would normally reach out and give the original authors a heads up. I just would not see it as a requirement.

Citing the original article is the minimum you can do. Without citing the original source, at best its rude and at worst its academic misconduct. Citations are the traditional credit system in science and in theory, highlighting the source of the data should be enough. However, I do think we are missing something here. Although a citation is acknowledging the source of the data and people value citations, it does not feel quite enough for me. So, after all this, I think there are some open questions that still need resolving.

How can we appropriately credit open data?

First, people will often collect large batches of data with many variables and work through publishing smaller components to maximise its value. This means one data set might turn into several articles. Which article should you cite as the source of the data? Choosing one article might dilute the impact if the citations are spread out.

One alternative is publishing the data set as a standalone data paper. For example, there is the Journal of Open Psychology Data (full disclosure: I’m on the advisory board) which provides a short article describing the data set and potential reuse opportunities, along with the data and full code book. Sharing data is hard and researchers often dump their files and scripts without a codebook or being curated. Hardwicke et al. (2018) showed how even in journals that require data sharing, the computational reproducibility of the article was poor. At JOPD, the data papers are peer-reviewed to check everything is fully described with a code book. This means you can gain both credit for the data as a standalone publication and there is a level of quality control for those reusing the data.

Second, even if your data set is cited, we know that not all citations are the same. Seeing the number of citations to a given article provides very little information, we just know it has been mentioned in some way. For regular citations, Scite codes for whether an article is supportive of your paper, just mentions your paper, or refutes your paper in some way.

Perhaps we need something similar to acknowledge when an article has cited your open data? A citation is the minimum requirement, but a carefully curated data set another researcher values enough to reuse is worth more than any individual article in my eyes. This means you should be able to highlight on your CV or promotion application how many times another research has reused an open data set you have curated.

It takes time for incentive structures to catch up with new norms, so it will be important for hiring and promotion committees to recognise the value of open data and your contribution to the wider scentific community. As part of the PsyTeachR team (2022), we recently made a similar plea for more recognition for open-source tutorials. Like open data, writing tutorials does not quite fit into traditional incentive structures, so it will be important for early career researchers to trust their time and effort curating open data will be valued down the line when it is not another publication on the CV. Many of the Twitter responses revolved around this issue of not gaining credit for a data set that had spent time and effort on, so I think open data should be on a pedestal when it comes to listing your achievements.

Finally, for the researchers reusing open data, I do not think it is clear how their contributions should be valued. Author contributions normally follow the heuristic of first author did the most work, the second author the next most, and so on. This can mask individual contributions, so I’m a big fan of the CRediT system which requires all authors to tick which roles they fulfilled towards the publication.

If open data provides a collaboration opportunity, there is a recognised role within the CRediT system for data curation. So, if you are invited to join a publication, you can outline this role as your contribution to the article in addition to any other responsibilities. Like point two above, I think curating open data should be a highly valued role on publications which you can highlight on your CV since it takes a lot of time and effort to do well.

However, if you only cite the original article when reusing open data, there are roles on the CRediT system which might be unambiguous. For example, I would argue reusing open data would not count towards conceptualization (defined as “formulation or evolution of overarching research goals and aims”) as you did not guide the overarching research goals/aims by designing the study and controlling which variables you collected. However, someone could argue the line of reasoning for their reanalysis equates to conceptualisation of this article.

This might present some power imbalances if a more senior researcher reanalyses the data and uses their reputation to publish in a higher impact journal than an early career researcher. This means it will be important to clearly state when you are performing a secondary data analysis and provide full credit to the original authors, potentially as its own sub-section in the method.

As open data becomes more prevalent, we are going to experience these discussions more and more. Current incentive structures and credit systems are built around the idea of researchers collecting data and writing their primary research articles. Therefore, we need to establish community norms on how we can rightly credit the role of open data in psychology research. We can publish more standalone data papers off the back of open data, highlight when your open data has been reused on your CV, and transparently outline author contributions for data curation and when you are only reanalysing data from another researcher. But is this enough to recognise the value of contributing open data?

I’m a huge proponent of how open data can benefit the wider psychology community and I think I have covered the main controversies and current systems for crediting open scholarship practices. If you think I have missed something or have other ideas, I would love to hear them. So, either add a comment or reply to my post on Twitter.