I’m pleased to announce the Community Data Science Collective Dataverse. Our dataverse is an archival repository for datasets created by the Community Data Science Collective. The dataverse won’t replace work that collective members have been doing for years to document and distribute data from our research. What we hope it will do is get our data — like our published manuscripts — into the hands of folks in the “forever” business.
Over the past few years, the Community Data Science Collective has published several papers where an important part of the contribution is a dataset. These include:
- Consider The Redirect: A Missing Dimension of Wikipedia Research (blog post) — A paper about why it’s important for Wikipedia research to take redirect pages into account. Alongside the paper, we published code to build a dataset of redirects plus the dataset of redirects itself.
- Page Protection: Another Missing Dimension of Wikipedia Research — A follow-up paper that discusses page protection. Alongside the paper, we published code and a dataset of page protection spells.
- A Longitudinal Dataset of Five Years of Public Activity in the Scratch Online Community (blog post) — A large dataset of social interaction data from the website than runs the Scratch online community.
Recently, we’ve also begun producing replication datasets to go alongside our empirical papers. So far, this includes:
- Starting Online Communities: Motivations and Goals of Wiki Founders (blog post) — A paper about why people set up to create new online communities.
- The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users (blog post) — A description and evaluation of a system to help onboard newcomers to Wikipedia.
In the case of each of the first groups of papers where the dataset was a part of the contribution, we uploaded code and data to a website we’ve created. Of course, even if we do a wonderful job of keeping these websites maintained over time, eventually, our research group will cease to exist. When that happens, the data will eventually disappear as well.
The text of our papers will be maintained long after we’re gone in the journal or conference proceedings’ publisher’s archival storage and in our universities’ institutional archives. But what about the data? Since the data is a core part — perhaps the core part — of the contribution of these papers, the data should be archived permanently as well.
Toward that end, our group has created a dataverse. Our dataverse is a repository within the Harvard Dataverse where we have been uploading archival copies of datasets over the last six months. All five of the papers described above are uploaded already. The Scratch dataset, due to access control restrictions, isn’t listed on the main page but it’s online on the site. Moving forward, we’ll be populating this new datasets we create as well as replication datasets for our future empirical papers. We’re currently preparing several more.
The primary point of the CDSC Dataverse is not to provide you with way to get our data although you’re certainly welcome to use it that way and it might help make some of it more discoverable. The websites we’ve created (like for the ones for redirects and for page protection) will continue to exist and be maintained. The Dataverse is insurance for if, and when, those websites go down to ensure that our data will still be accessible.
This post was also published on Benjamin Mako Hill’s blog Copyrighteous.