A round-up of our recent research

Data (Alice Design, cc-by, via the noun project)

We try to keep this blog updated with new research and presentations from members of the group, but we often fall behind. With that in mind, this post is more of a listicle: 22 things you might not have seen from the CDSC in the past year! We’ve included links to (hopefully un-paywalled copies) of just about everything.

Papers and book chapters

Presentations and panels

  • Champion, Kaylea. (2020) How to build a zombie detector: Identifying software quality problems. Seattle Gnu/Linux Users Conference, November, 2020.
  • Hwang, Sohyeon and Aaron Shaw. (2020) Heterogeneous practices in collective governance. Presented at Collective Intelligence 2020 (CI 2020). Boston & Copenhagen (Virtually held).
  • Shaw, Aaron. The importance of thinking big: Convergence, divergence, and independence among wikis and peer production communities. WIkiResearch Showcase. January 20, 2021.
  • TeBlunthuis Nathan E., Benjamin Mako Hill. Aaron Halfaker. “Algorithmic flags and Identity-Based Signals in Online Community Moderation” Session on Social media 2, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E.., Aaron Shaw, *Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Culture and fairness, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E., Aaron Shaw, Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Collective Action, ACM Conference on Collective Intelligence (CI 2020), Boston, MA, June 18, 2020.

COVID-19 Digital Observatory awarded Open Innovation Grant from Protocol Labs Research

Last week, Protocol Labs Research announced their COVID-19 Open Innovation Grant recipients and we thrilled to announce that the Community Data Science Collective’s COVID-19 Digital Observatory is among the awarded projects!

Protocol Labs works to improve internet technologies through open source protocols, systems, and tools. The organization initially grew out of efforts to apply blockchain tools to support distributed file sharing infrastructure. Their research group, Protocol Labs Research, created the COVID-19 Open Innovation Grants program “to surface and support open-source projects working on tools to help humanity through present and future pandemics.”

Among the ten projects supported under the program, others aim to develop open source medical devices (such as an origami respirator!), contact tracing infrastructure, device development and testing, and engineering collaboration. We feel grateful and humbled to be in the company of these diverse efforts to apply open collaboration to the response to COVID-19!

In the case of the COVID-19 Digital Observatory, we plan  to use the funds provided by the award to build out the resources we have already started to aggregate and release. In particular, we will build additional infrastructure to process and archive data from Reddit and other social media sources as well as search engine results pages (SERPs) for COVID-related queries.

In addition to folks in the collective, the proposal was successful through the efforts of Jason Baumgartner from Pushshift, who is co-leading the observatory work, as well as Marysia Galent, Research Administrator at Northwestern University, whose expert guidance helped make the grant application possible.

Sohyeon Hwang awarded NSF Graduate Research Fellowship

Congratulations to Sohyeon Hwang, who will be awarded a prestigious Graduate Research Fellowship (a.k.a., GRFP) from the U.S. National Science Foundation!

photo of Sohyeon Hwang standing somewhere
Sohyeon Hwang standing somewhere.

The award will support Sohyeon’s proposed doctoral research on the complexity of governance practices in online communities. This work will focus on the ways communities heterogeneously fill the gap between rules-as-written (de jure) and rules-as-practiced (de facto) to impact the credibility and effectiveness of online governance work. The main components of this project will center around understanding the significance and role of shared (or conversely, localized) rules across communities; the automated tools utilized by these communities; and how users perceive, experience, and practice heterogeneity in online governance practices.

Sohyeon is a first year Ph.D. student in the Media, Technology & Society Program at Northwestern, advised by Aaron Shaw, and began working with the Community Data Science Collective last summer. She completed her undergraduate degree at Cornell University, where she double-majored in government and information science, focusing on Cold War era politics in the former and data science in the latter.

Sohyeon is currently pursuing graduate coursework, and her ongoing research includes a project comparing governance across several of the largest language editions of Wikipedia as well as work with Dr. Ágnes Horvát developing a project on multi-platform information spread. Recently, she has also taken a lead role in the efforts by CDSC and Pushshift to create a Digital Observatory for COVID-19 information resources.

New Grant for Studying “Underproduction” in Software Infrastructure

Earlier this year, a team led by Kaylea Champion were announced as recipients of a generous grant from the Ford and Sloan Foundations to support research into into peer produced software infrastructure. Now that the project is moving forward in earnest, we’re thrilled to tell you about it.

In the foreground, the photo depicts a rusted sign with "To rapid transit" and an arrow. The sign is marked with tagging-style graffiti. In the background are rusted iron girders, part of the infrastructure of the L train.
Rapid Transit. Photo by Anthony Doudt, via flickr. CC BY-NC-ND 2.0

The project is motivated by the fact that peer production communities have produced awesome free (both as in freedom and beer) resources—sites like Wikipedia that gather the world’s knowledge, and software like Linux that enables innovation, connection, commerce, and discovery. Over the last two decades, these resources have become key elements of public digital infrastructure that many of us rely on every day. However, some pieces of digital infrastructure we rely on most remain relatively under-resourced—as security vulnerabilities like Heartbleed in OpenSSL reveal. The grant from Ford and Sloan aims will support a research effort to understand how and why some software packages that are heavily used receive relatively little community support and maintenance.

We’re tackling this challenge by seeking to measure and model patterns of usage, contribution, and quality in a population of free software projects. We’ll then try to identify causes and potential solutions to the challenges of relative underproduction. Throughout, we’ll draw on both insight from the research community and on-the-ground observations from developers and community managers. We aim to create practical guidance that communities and software developers can actually use as well as novel research contributions. Underproduction is, appropriately enough, a challenge that has not gotten much attention from researchers previously, so we’re excited to work on it.

Although Kaylea Champion is leading the project, the team working on the project includes Benjamin Mako Hill, Aaron Shaw, and collective affiliate Morten Warncke-Wang who did pioneering work on underproduction in Wikipedia.

Benjamin Mako Hill is a Research Symbiont!

In exciting news, Benjamin Mako Hill was just announced as a winner of a 2019 Research Symbiont Award.  Mako received the second annual General Symbiosis Award which “is given to a scientist working in any field who has shared data beyond the expectations of their field.” The award was announced at a ceremony in Hawaii at the Pacific Symposium in Biocomputing.

The award presentation called out Mako’s work on the preparation of the Scratch research dataset that includes the first five years of longitudinal data from the Scratch online community. Andrés Monroy-Hernández worked with Mako on that project. Mako’s nomination also mentioned his research groups’ commitment to the production of replication datasets as well as his work with Aaron Shaw on datasets of redirects and page protection from Wikipedia. Mako was asked to talk about this work in his a short video he recorded that was shown at the award ceremony.

Plush salmon with lamprey parasite.
A photo of the award itself: a plush fish complete with a parasitic lamprey.

The Research Symbionts Awards are given annually to recognize “symbiosis” in the form of data sharing. They are a companion award to the Research Parasite Awards which recognize superb examples of secondary data reuse. The award includes money to travel to the Pacific Symposium Computing (unfortunately, Mako wasn’t able to take advantage of this!) as well the plush fish with parasitic lamprey shown here.

In addition to the award given to Mako, Dr. Leonardo Collado-Torres was announced as the recipient of the health-specific Early Career Symobiont award for his work on Recount2.

Awards and citations at computing conferences

I’ve heard a surprising “fact” repeated in the CHI and CSCW communities that receiving a best paper award at a conference is uncorrelated with future citations.
Although it’s surprising and counterintuitive, it’s a nice thing to
think about when you don’t get an award and its a nice thing to say to
others when you do. I’ve thought it and said it myself.

It also seems to be untrue. When I tried to check the “fact”
recently, I found a body of evidence that suggests that computing papers
that receive best paper awards are, in fact, cited more often than
papers that do not.

The source of the original “fact” seems to be a CHI 2009 study by Christoph Bartneck and Jun Hu titled “Scientometric Analysis of the CHI Proceedings.”
Among many other things, the paper presents a null result for a test of
a difference in the distribution of citations across best papers
awardees, nominees, and a random sample of non-nominees.

Although the award analysis is only a small part of Bartneck and Hu’s
paper, there have been at least two papers have have subsequently
brought more attention, more data, and more sophisticated analyses to
the question.  In 2015, the question was asked by Jaques Wainer, Michael
Eckmann, and Anderson Rocha in their paper “Peer-Selected ‘Best Papers’—Are They Really That ‘Good’?

Wainer et al. build two datasets: one of papers from 12 computer
science conferences with citation data from Scopus and another papers
from 17 different conferences with citation data from Google Scholar.
Because of parametric concerns, Wainer et al. used a non-parametric
rank-based technique to compare awardees to non-awardees.  Wainer et al.
summarize their results as follows:

The probability that a best paper
will receive more citations than a non best paper is 0.72 (95% CI =
0.66, 0.77) for the Scopus data, and 0.78 (95% CI = 0.74, 0.81) for the
Scholar data. There are no significant changes in the probabilities for
different years. Also, 51% of the best papers are among the top 10% most
cited papers in each conference/year, and 64% of them are among the top
20% most cited.

The question was also recently explored in a different way by Danielle H. Lee in her paper on “Predictive power of conference‐related factors on citation rates of conference papers” published in June 2018.

Lee looked at 43,000 papers from 81 conferences and built a
regression model to predict citations. Taking into an account a number
of controls not considered in previous analyses, Lee finds that the
marginal effect of receiving a best paper award on citations is
positive, well-estimated, and large.

Why did Bartneck and Hu come to such a different conclusions than later work?

Distribution of citations (received by 2009) of CHI papers published between 2004-2007 that were nominated for a best paper award (n=64), received one (n=12), or were part of a random sample of papers that did not (n=76).

My first thought was that perhaps CHI is different than the rest of
computing. However, when I looked at the data from Bartneck and Hu’s
2009 study—conveniently included as a figure in their original study—you
can see that they did find a higher mean among the award
recipients compared to both nominees and non-nominees. The entire
distribution of citations among award winners appears to be pushed
upwards. Although Bartneck and Hu found an effect, they did not find a statistically significant effect.

Given the more recent work by Wainer et al. and Lee, I’d be willing
to venture that the original null finding was a function of the fact
that citations is a very noisy measure—especially over a 2-5
post-publication period—and that the Bartneck and Hu dataset was small
with only 12 awardees out of 152 papers total. This might have caused
problems because the statistical test the authors used was an omnibus
test for differences in a three-group sample that was imbalanced heavily
toward the two groups (nominees and non-nominees) in which their
appears to be little difference. My bet is that the paper’s conclusions
on awards is simply an example of how a null effect is not evidence of a
non-effect—especially in an underpowered dataset.

Of course, none of this means that award winning papers are better.
Despite Wainer et al.’s claim that they are showing that award winning
papers are “good,” none of the analyses presented can disentangle the
signalling value of an award from differences in underlying paper
quality. The packed rooms one routinely finds at best paper sessions at
conferences suggest that at least some additional citations received by
award winners might be caused by extra exposure caused by the awards
themselves. In the future, perhaps people can say something along these
lines instead of repeating the “fact” of the non-relationship.


This post was originally posted on Benjamin Mako Hill’s blog Copyrighteous.

Sayamindu Dasgupta Joining the University of North Carolina Faculty

Sayamindu Dasgupta head shotThe School of Information and Library Sciences (SILS) at the University of North Carolina in Chapel Hill announced this week that the Community Data Science Collective’s very own Sayamindu Dasgupta will be joining their faculty as a tenure-track assistant professor. The announcement from SILS has much more detail and makes very clear that UNC is thrilled to have him join their faculty.

UNC has has every reason to be excited. Sayamindu has been making our research collective look good for several years. Much of this is obvious in the pile of papers and awards he’ s built. In less visible roles, Sayamindu has helped us build infrastructure, mentored graduate and undergraduate students in the group, and has basically just been  joy to have around.

Those of us that work in the Community Data Lab at UW is going to miss having Sayamindu around. Chapel Hill is very, very lucky to have him.