A round-up of our recent research

Data (Alice Design, cc-by, via the noun project)

We try to keep this blog updated with new research and presentations from members of the group, but we often fall behind. With that in mind, this post is more of a listicle: 22 things you might not have seen from the CDSC in the past year! We’ve included links to (hopefully un-paywalled copies) of just about everything.

Papers and book chapters

Presentations and panels

  • Champion, Kaylea. (2020) How to build a zombie detector: Identifying software quality problems. Seattle Gnu/Linux Users Conference, November, 2020.
  • Hwang, Sohyeon and Aaron Shaw. (2020) Heterogeneous practices in collective governance. Presented at Collective Intelligence 2020 (CI 2020). Boston & Copenhagen (Virtually held).
  • Shaw, Aaron. The importance of thinking big: Convergence, divergence, and independence among wikis and peer production communities. WIkiResearch Showcase. January 20, 2021.
  • TeBlunthuis Nathan E., Benjamin Mako Hill. Aaron Halfaker. “Algorithmic flags and Identity-Based Signals in Online Community Moderation” Session on Social media 2, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E.., Aaron Shaw, *Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Culture and fairness, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E., Aaron Shaw, Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Collective Action, ACM Conference on Collective Intelligence (CI 2020), Boston, MA, June 18, 2020.

Apply to Join the Community Data Science Collective!

It’s Ph.D. application season and the Community Data Science Collective is recruiting! As always, we are looking for talented people to join our research group. Applying to one of the Ph.D. programs that the CDSC faculty members are affiliated with is a great way to do that.

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. programs we’re affiliated with, and what we’re looking for when we review Ph.D. applications. It’s close to the deadline for some of our programs, but we hope this post will still be useful to prospective applicants now and in the future.

CDSC members at the CDSC group retreat in August 2020 (pandemic virtual edition). Left to right by row, starting at top: Charlie, Mako, Aaron, Carl, Floor, Gabrielle, Stef, Kaylea, Tiwalade, Nate, Sayamindu, Regina, Jeremy, Salt, and Sejal.

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly) quantitative social scientists and designers pursuing research about the organization of online communities, peer production, online communities, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington, Carleton College, the University of North Carolina, Chapel Hill, Purdue University, and a few other places. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

Although we have people at other places, this year the group includes four faculty principal investigators (PIs) who are actively recruiting PhD students: Aaron Shaw (Northwestern University), Benjamin Mako Hill (University of Washington in Seattle), Sayamindu Dasgupta (University of North Carolina at Chapel Hill), and Jeremy Foote (Purdue University). Each of these PIs advise Ph.D. students in Ph.D. programs at their respective universities. Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty person has specific areas of expertise and interests. The reasons you might choose to apply to one Ph.D. program or to work with a specific faculty member could include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. programs have different application deadlines, requirements, and procedures!

Who is actively recruiting this year?

Given the disruptions and uncertainties associated with the COVID19 pandemic, the faculty PIs are more constrained in terms of whether and how they can accept new students this year. If you are interested in applying to any of the programs, we strongly encourage you to reach out the specific faculty in that program before submitting an application.

Ph.D. Advisors

Sayamindu Dasgupta head shot
Sayamindu Dasgupta

Sayamindu Dasgupta is is an Assistant Professor in the School of Information and Library Science at UNC Chapel Hill. Sayamindu’s research focus includes data science education for children and informal learning online—this work involves both system building and empirical studies.

Benjamin Mako Hill

Benjamin Mako Hill is an Assistant Professor of Communication at the University of Washington. He is also an Adjunct Assistant Professor at UW’s Department of Human-Centered Design and Engineering (HCDE) and Computer Science and Engineering (CSE). Although many of Mako’s students are in the Department of Communication, he also advises students in the Department of Computer Science and Engineering, HCDE, and the Information School—although he typically has limited ability to admit students into those programs. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning.

Aaron Shaw. (Photo credit: Nikki Ritcher Photography, cc-by-sa)

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs. Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current research projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and empirical research methods.

Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s current research focuses on how individuals decide when and in what ways to contribute to online communities, and how understanding those decision-making processes can help us to understand which things become popular and influential. Most of his research is done using data science methods and agent-based simulations.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include experience consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solve matter a lot. We think doctoral education is less about executing a task that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat.

Update on the COVID-19 Digital Observatory

A few months ago we announced the launch of a COVID-19 Digital Observatory in collaboration with Pushshift and with funding from Protocol Labs. As part of this effort over the last several months, we have aggregated and published public data from multiple online communities and platforms. We’ve also been hard at work adding a series of new data sources that we plan to release in the near future.

Transmission electron microscope image of SARS-CoV-2—also known as 2019-nCoV, the not-so-novel-anymore virus that causes COVID-19 (Source: NIH NIAID via Wikimedia Commons, cc-sa 2.0)

More specifically, we have been gathering Search Engine Response Page (SERP) data on a range of COVID-19 related terms on a daily basis. This SERP data is drawn from both Bing and Google and has grown to encompass nearly 300GB of compressed data from four months of daily search engine results, with both PC and mobile results from nearly 500 different queries each day.

We have also continued to gather and publish revision and pageview data for COVID-related pages on English Wikipedia which now includes approximately 22GB of highly compressed data (several dozen gigabytes of compressed revision data each day) from nearly 1,800 different articles—a list that has been growing over time.

In addition, we are preparing releases of COVID-related data from Reddit and Twitter. We are almost done with two datasets from Reddit: a first one that includes all posts and comments from COVID-related subreddits, and a second that includes all posts or comments which include any of a set of COVID-related terms.

For the Twitter data, we are working out details of what exactly we will be able to release, but we anticipate including Tweet IDs and metadata for tweets that include COVID-related terms as well as those associated with hashtags and terms we’ve identified in some of the other data collection. We’re also designing a set of random samples of COVID-related Twitter content that will be useful for a range of projects.

In conjunction with these dataset releases, we have published all of the code to create the datasets as well as a few example scripts to help people learn how to load and access the data we’ve collected. We aim to extend these example analysis scripts in the future as more of the data comes online.

We hope you will take a look at the material we have been releasing and find ways to use it, extend it, or suggest improvements! We are always looking for feedback, input, and help. If you have a COVID-related dataset that you’d like us to publish, or if you would like to write code or documentation, please get in touch!

All of the data, code, and other resources are linked from the project homepage. To receive further updates on the digital observatory, you can also subscribe to our low traffic announcement mailing list.

Sohyeon Hwang awarded NSF Graduate Research Fellowship

Congratulations to Sohyeon Hwang, who will be awarded a prestigious Graduate Research Fellowship (a.k.a., GRFP) from the U.S. National Science Foundation!

photo of Sohyeon Hwang standing somewhere
Sohyeon Hwang standing somewhere.

The award will support Sohyeon’s proposed doctoral research on the complexity of governance practices in online communities. This work will focus on the ways communities heterogeneously fill the gap between rules-as-written (de jure) and rules-as-practiced (de facto) to impact the credibility and effectiveness of online governance work. The main components of this project will center around understanding the significance and role of shared (or conversely, localized) rules across communities; the automated tools utilized by these communities; and how users perceive, experience, and practice heterogeneity in online governance practices.

Sohyeon is a first year Ph.D. student in the Media, Technology & Society Program at Northwestern, advised by Aaron Shaw, and began working with the Community Data Science Collective last summer. She completed her undergraduate degree at Cornell University, where she double-majored in government and information science, focusing on Cold War era politics in the former and data science in the latter.

Sohyeon is currently pursuing graduate coursework, and her ongoing research includes a project comparing governance across several of the largest language editions of Wikipedia as well as work with Dr. Ágnes Horvát developing a project on multi-platform information spread. Recently, she has also taken a lead role in the efforts by CDSC and Pushshift to create a Digital Observatory for COVID-19 information resources.

Launching the COVID-19 Digital Observatory

The Community Data Science Collective, in collaboration with Pushshift and others, is launching a new collaborative project to create a digital observatory for socially produced COVID-19 information. The observatory has already begun the process of collecting, and aggregating public data from multiple online communities and platforms. We are publishing reworked versions of these data in forms that are well-documented and more easily analyzable by researchers with a range of skills and computation resources. We hope that these data will facilitate analysis and interventions to improve the quality of socially produced information and public health.

Transmission electron microscope image of SARS-CoV-2—also known as 2019-nCoV, the virus that causes COVID-19 (Source: NIH NIAID via Wikimedia Commons, cc-sa 2.0).

During crises such as the current COVID-19 pandemic, many people turn to the Internet for information, guidance, and help. Much of what they find is socially produced through online forums, social media, and knowledge bases like Wikipedia. The quality of information in these data sources varies enormously and users of these systems may receive information that is incomplete, misleading, or even dangerous. Efforts to improve this are complicated by difficulties in discovering where people are getting information and in coordinating efforts to focus on refining the more important information sources. There are number of researchers with the skills and knowledge to address these issues, but who may struggle to gather or process social data. The digital observatory facilitates data collection, access, and analysis.

Our initial release includes several datasets, code used to collect the data, and some simple analysis examples. Details are provided on the project page as well as our public Github repository. We will continue adding data, code, analysis, documentation, and more. We also welcome collaborators, pull-requests, and other contributions to the project.

What’s the goal for this project?

Our hope is that the public datasets and freely licensed tools, techniques, and knowledge created through the digital observatory will allow researchers, practitioners, and public health officials to more efficiently gather, analyze, understand, and act to improve these crucial sources of information during crises. Ultimately this will support ongoing responses to COVID-19 and contribute to future preparedness to respond to crisis events through analyses conducted after the fact.

How do I get access to the digital observatory?

The digital observatory data, code, and other resources will exist in a few locations, all linked from the project homepage. The data we collect, parse, and publish lives at covid19.communitydata.org/datasets. The code to collect, parse, and output those datasets lives in our Github repository, which also includes some scripts for getting started with analysis. We will integrate additional data and data collection resources from Pushshift and adjacent projects as we go. For more information, please check out the project page.

Stay up to date!

To receive updates on the digital observatory, please subscribe to our low traffic announcement mailing list. You will be the first to know about new datasets and other resources (and we won’t use or distribute addresses for any other reason).

Modeling the ecological dynamics of online organizations

Do online communities compete with each other over resources or niches? Do they co-evolve in symbiotic or even parasitic relationships? What insights can we gain by applying ecological models of collective behavior to the study of collaborative online groups?

A colorful pisaster ochraceus (a.k.a., pisaster), a sea star species whose presence or absence can radically alter the ecology of an intertidal community. Our research will adapt theories created to explain the population dymamics of organisms like the pisaster in the context of online communities and human organizations (photo: Multi-Agency Rocky Intertidal Network).


We  are delighted to announce that a Community Data Science Collective (CDSC) team led by Nate TeBlunthuis and Jeremy Foote has just started work on a three-year grant from the U.S. National Science Foundation to study the ecological dynamics of online communities! Aaron Shaw and Benjamin Mako Hill are principal investigators for the grant.

The projects supported by the award will extend the study of peer production and online communities by analyzing how aspects of communities’ environments impact their growth, patterns of participation, and survival. The work draws on recent research on various biological systems, organizational ecology, and human computer interaction (HCI). In general, we adapt these approaches to inform quantitative and computational analysis of populations of peer production communities and other online organizations.

As a major goal, we want to explain the conditions under which certain ecological dynamics emerge versus when they do not. For example, prior work has suggested that communities interact in ways that are both competitive and mutalistic. But what leads two communities to become competitors and others to benefit each other?  We aim to understand when these patterns to arise. We are also interested in how community leaders might pursue effective strategies for survival given circumstances in the surrounding environment.

The grant promises to support a number of projects within the CDSC. Nate and Jeremy led the proposal writing as well as two key pilot studies that informed the development of the proposal. Other group members are now involved in planning and developing multiple studies under the grant.

The grant was awarded by the NSF Cyber-Human Systems (CHS) program within the Directorate for Information and Intellligent Systems (IIS) and the award is shared by Northwestern and the University of Washington (award numbers IIS-1910202 and IIS-1908850)

We’ve published the description of the proposal that we submitted to the NSF, although some details will shift as we carry out the project. The best place to stay up-to-date about the work will be to follow [the CDSC Twitter account (@ComDataSci)or the CDSC blog.

Apply to Join the Community Data Science Collective!

It’s Ph.D. application season and the Community Data Science Collective is recruiting! As always, we are looking for talented people to join our research group. Applying to one of the Ph.D. programs that Aaron, Mako, Sayamindu, and Jeremy are affiliated with is a great way to do that.

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. programs we’re affiliated with, and what we’re looking for when we review Ph.D. applications. It’s close to the deadline for some of our programs, but we hope this post will still be useful to prospective applicants now and in the future.

Members of the CDSC at a group retreat in Evanston, Illinois in September 2019. Clockwise from far right is: Salt, Jackie, Floor, Sejal, Nick, Kaylea, Sohyeon, Aaron, Nate, Jeremy, Mako, Jim, Charlie, and Regina. Sayamindu and Sneha are will us in spirit!

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly) quantitative social scientists and designers pursuing research about the organization of online communities, peer production, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington, the University of North Carolina, Chapel Hill, and Purdue University. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

The group currently includes four faculty principal investigators (PIs): Aaron Shaw (Northwestern University), Benjamin Mako Hill (University of Washington in Seattle), Sayamindu Dasgupta (University of North Carolina at Chapel Hill), and Jeremy Foote (Purdue University). The PIs advise Ph.D. students in Ph.D. programs at their respective universities. Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty person has specific areas of expertise and unique interests. The reasons you might choose to apply to one Ph.D. program or to work with a specific faculty member include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. programs have different application deadlines, requirements, and procedures!

Ph.D. Advisors

Sayamindu Dasgupta head shot
Sayamindu Dasgupta

Sayamindu Dasgupta is is an Assistant Professor in the School of Information and Library Science at UNC Chapel Hill. Sayamindu’s research focus includes data science education for children and informal learning online—this work involves both system building and empirical studies.

Benjamin Mako Hill

Benjamin Mako Hill is an Assistant Professor of Communication at the University of Washington. He is also an Adjunct Assistant Professor at UW’s Department of Human-Centered Design and Engineering (HCDE) and Computer Science and Engineering (CSE). Although most of Mako’s students are in the Department of Communication, he also advises students in the Department of Computer Science and Engineering and HCDE—although he typically has limited ability to admit students into those programs. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning.

Aaron Shaw. (Photo credit: Nikki Ritcher Photography, cc-by-sa)

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs. Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current research projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and empirical research methods.

Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s current research focuses on how individuals decide when and in what ways to contribute to online communities, and how understanding those decision-making processes can help us to understand which things become popular and influential. Most of his research is done using data science methods and agent-based simulations.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include experience consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up pre-certified in all the ways or knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solve matter a lot. We think doctoral education is less about executing a task that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat.

New Grant for Studying “Underproduction” in Software Infrastructure

Earlier this year, a team led by Kaylea Champion were announced as recipients of a generous grant from the Ford and Sloan Foundations to support research into into peer produced software infrastructure. Now that the project is moving forward in earnest, we’re thrilled to tell you about it.

In the foreground, the photo depicts a rusted sign with "To rapid transit" and an arrow. The sign is marked with tagging-style graffiti. In the background are rusted iron girders, part of the infrastructure of the L train.
Rapid Transit. Photo by Anthony Doudt, via flickr. CC BY-NC-ND 2.0

The project is motivated by the fact that peer production communities have produced awesome free (both as in freedom and beer) resources—sites like Wikipedia that gather the world’s knowledge, and software like Linux that enables innovation, connection, commerce, and discovery. Over the last two decades, these resources have become key elements of public digital infrastructure that many of us rely on every day. However, some pieces of digital infrastructure we rely on most remain relatively under-resourced—as security vulnerabilities like Heartbleed in OpenSSL reveal. The grant from Ford and Sloan aims will support a research effort to understand how and why some software packages that are heavily used receive relatively little community support and maintenance.

We’re tackling this challenge by seeking to measure and model patterns of usage, contribution, and quality in a population of free software projects. We’ll then try to identify causes and potential solutions to the challenges of relative underproduction. Throughout, we’ll draw on both insight from the research community and on-the-ground observations from developers and community managers. We aim to create practical guidance that communities and software developers can actually use as well as novel research contributions. Underproduction is, appropriately enough, a challenge that has not gotten much attention from researchers previously, so we’re excited to work on it.

Although Kaylea Champion is leading the project, the team working on the project includes Benjamin Mako Hill, Aaron Shaw, and collective affiliate Morten Warncke-Wang who did pioneering work on underproduction in Wikipedia.

Exceedingly Reproducible Research: A Proposal

The reproducibility movement in science has sought to increase our confidence in scientific knowledge by having research teams disseminate their data, instruments, and code so that other researchers can reproduce their work. Unfortunately, all approaches to reproducible research to date suffer from the same fundamental flaw: they seek to reproduce the results of previous research while making no effort to reproduce the research process that led to those results. We propose a new method of Exceedingly Reproducible Research (ERR) to close this gap. This blog post will introduce scientists to the error of their ways, and to the ERR of ours.

Even if a replication appears to have succeeded in producing tables and figures that appear identical to those in the original, they differ in that they are providing answers to different questions. An example from our own work illustrates the point.

Rise and Decline on Wikia
Figure 1: Active editors on Wikia wikis over time (taken from TeBlunthuis, Shaw, and Hill 2018)

Figure 1 above shows the average number of contributors (in standardized units) to a series of large wikis drawn from Wikia. It was created to show the life-cycles of large online communities and published in a paper last year.

Rise and Decline on Wikia

Figure 2: Replication of Figure 1 from TeBlunthuis, Shaw, and Hill (2018)

Results from a replication are shown in Figure 2. As you can see, the plots have much in common. However, deeper inspection reveals that the similarity is entirely superficial. Although the dots and lines fall in the same places on the graphs, they fall there for entirely different reasons.

Tilting at windmills in Don Quixote.

Figure 1 reflects a lengthy exploration and refinement of a (mostly) original idea and told us something we did not know. Figure 2 merely tells us that the replication was “successful.” They look similar and may confuse a reader into thinking that they reflect the same thing. But they are as different as night as day. We are like Pierre Menard who reproduced two chapters of Don Quixote word-for-word through his own experiences: the image appears similar but the meaning is completely changed. In that we made no attempt to reproduce the research process, our attempt at replication was doomed before it began.

How Can We Do Better?

Scientific research is not made by code and data, it is made by people. In order to replicate a piece of work, one should reproduce all parts of the research. One must retrace another’s steps, as it were, through the garden of forking paths.

In ERR, researchers must conceive of the idea, design the research project, collect the data, write the code, and interpret the results. ERR involves carrying out every relevant aspect of the research process again, from start to finish. What counts as relevant? Because nobody has attempted ERR before, we cannot know for sure. However, we are reasonably confident that successful ERR will involve taking the same courses as the original scientists, reading the same books and articles, having the same conversations at conferences, conducting the same lab meetings, recruiting the same research subjects, and making the same mistakes.

There are many things that might affect a study indirectly and that, as a result, must also be carried out again. For example, it seems likely that a researcher attempting to ERR must read the same novels, eat the same food, fall prey to the same illnesses, live in the same homes, date and marry the same people, and so on. To ERR, one must  have enough information to become the researchers as they engage in the research process from start to finish.

It seems likely that anyone attempting to ERR will be at a major disadvantage when they know that previous research exists. It seems possible that ERR can only be conducted by researchers who never realize that they are engaged in the process of replication at all. By reading this proposal and learning about ERR, it may be difficult to ever carry it out successfully.

Despite these many challenges, ERR has important advantages over traditional approaches to reproducibility. Because they will all be reproduced along the way, ERR requires no replication datasets or code. Of course, to verify that one is “in ERR” will require access to extensive intermediary products. Researchers wanting to support ERR in their own work should provide extensive intermediary products from every stage of the process. Toward that end, the Community Data Science Collective has started creating videos of our lab meetings in the form of PDF flipbooks well suited to deposition in our university’s institutional archives. A single frame is shown in Figure 3. We have released our video_to_pdf tool under a free license which you can use to convert your own MP4 videos to PDF.

Frame from Video
Figure 3: PDF representation of one frame of a lab meeting between three members of the lab, produced using video_to_pdf. The full lab meeting is 25,470 pages (an excerpt is available).

With ERR, reproduction results in work that is as original as the original work. Only by reproducing the original so fully, so totally, and in such rigorous detail will true scientific validation become possible. We do not so much seek stand on the shoulders of giants, but rather to inhabit the body of the giant. If to reproduce is human; to ERR is divine.

Benjamin Mako Hill is a Research Symbiont!

In exciting news, Benjamin Mako Hill was just announced as a winner of a 2019 Research Symbiont Award.  Mako received the second annual General Symbiosis Award which “is given to a scientist working in any field who has shared data beyond the expectations of their field.” The award was announced at a ceremony in Hawaii at the Pacific Symposium in Biocomputing.

The award presentation called out Mako’s work on the preparation of the Scratch research dataset that includes the first five years of longitudinal data from the Scratch online community. Andrés Monroy-Hernández worked with Mako on that project. Mako’s nomination also mentioned his research groups’ commitment to the production of replication datasets as well as his work with Aaron Shaw on datasets of redirects and page protection from Wikipedia. Mako was asked to talk about this work in his a short video he recorded that was shown at the award ceremony.

Plush salmon with lamprey parasite.
A photo of the award itself: a plush fish complete with a parasitic lamprey.

The Research Symbionts Awards are given annually to recognize “symbiosis” in the form of data sharing. They are a companion award to the Research Parasite Awards which recognize superb examples of secondary data reuse. The award includes money to travel to the Pacific Symposium Computing (unfortunately, Mako wasn’t able to take advantage of this!) as well the plush fish with parasitic lamprey shown here.

In addition to the award given to Mako, Dr. Leonardo Collado-Torres was announced as the recipient of the health-specific Early Career Symobiont award for his work on Recount2.