Community Data Science Collective

September 18, 2023September 14, 2023

FOSSY Fun, Finished

The CDSC hosted the Science of Community Track on July 15th at FOSSY this year — it was an awesome day of learning and conversation with a fantastic group of senior scholars, industry partners, students, practitioners, community members, and more! We are so grateful and eager to build on the discussions we began.

If you missed the sessions, watch this space! Most sessions were recorded, and we’ll post links and materials as they’re released.

Special thanks to Molly de Blanc for all the long distance organizing work; Shauna Gordon McKeon for stepping in to help share some closing thoughts on the Science of Community track at the very last minute, and to the FOSSY organizing team for convening such a warm, welcoming inaugural event (indeed, the warmth was palpable as it nearly hit 100° F on Friday and Saturday in Portland).

One tangible result of a free software conference: new laptop stickers!

July 11, 2023July 10, 2023

Meet us at FOSSY!

The Free and Open Source Software Yearly conference (FOSSY) is in less than a week and we will be there!

We will be running the Science of Community track on Saturday July 15.

Two photos. In one is Kaylea Chamption, who has purple hair and a blue shirt. In the other is Sejal Khatri, Benjamin Mako Hill, and Aaron Shaw. — Kaylea Champion, and Benjamin Mako Hill and Aaron Shaw with Sejal Khatri (who won’t be at FOSSY)

The Science of Community track is inspired by the CDSC Science of Community Dialogues, which aim to bring together practitioners and researchers to discuss scholarly work that is relevant to the efforts of practitioners. As researchers, we get so much from the communities we work with and study and we want them to also learn from the research they so generously take part in. While the Dialogues cover a broad range of topics and communities, FOSSY presentations focus on how that work related to free and open source software communities, projects, and practitioners.

At FOSSY, we will have a number of really amazing researchers presenting their work. We wanted to share some highlights from the schedule.

Sophia Vargas, from Google’s Open Source Programs Office, will be presenting on how metrics can help us understand contributor burnout. Professor Shoji Kajita, from Kyoto University, will discuss research data management for FOSS communities. Mariam Guizani, from Oregon State University, will cover research on the why and how of corporate participation in FOSS. We will additionally have lightning talks by Adam Hyde, Anita Sarma, Shauna Gordon-McKeon, and incoming Northwestern Ph.D. student Matthew Gaughan.

We are really excited about our workshop “Let’s Get Real: Putting Research Findings Into Practice.” This workshop, designed for FOSS contributors and practitioners, will help guide you on how to get the most out of the incredible research on and relevant to FOSS. If you want to learn how to navigate the sheer volume of interesting research work happening or how to understand what it means, this is the session for you! Our workshop will be led by Kaylea Chamption and Professors Aaron Shaw and Benjamin Mako Hill. You can read more on our wiki.

Due to scheduling issues, Eriol Fox will be presenting their talk, “Community lead user research and usability in Science and Research OSS: What we learned,” in the Wildcard Track. We recommend going!

We hope to see you at FOSSY. Even if you can’t make it to our sessions, we’ll be at the conference so stop by and say hello!

May 24, 2023May 24, 2023

Community Data Science Collective at ICA 2023

The International Communication Association (ICA)’s 73nd annual conference is coming up soon. This year, the conference takes place in Toronto, Canada, and a subset of our collective is showing up to present work in person. We are looking forward to meeting up, talking about research, and hanging out together!

ICA takes place from Wednesday, May 24, to Monday, May 29, and CDSC members will take various roles in a number of different conference programs, including chairing, presentations, and co-organizing of preconference. Here is the list of our participation by the time order, so feel free to join us!

Thursday, May 25

We start off with a presentation by Yibin Fan on Thursday at 10:45 am in the International Living Learning Centre of Toronto Metropolitan University on Political Communication Graduate Student Preconference. In a panel on The Causes and Outcomes of Political Polarization and Violence, Yibin will present a paper entitled “Does Incidental Political Discussion Make Political Expression Less Polarized? Evidence From Online Communities”.

Later on Thursday, another preconference on New Frontiers in Global Digital Inequalities Research will take place in M – Room Linden (Sheraton) from 1:30pm to 5pm, Floor Fiers will work as a co-chair with a number of scholars from international, various academic institutions, and they will give a presentation on “The Gig Economy: A Site of Opportunity Vs. a Site of Risk?”. This preconference is affiliated with Communication and Technology Division and Communication Law & Policy Division of ICA.

Friday, May 26

On Friday, Carl Colglazier will present in a panel on Disinformation, Politics and Social Media at 3:00pm in M – Room York (Sheraton), and the presentation is entitled as “The Effects of Sanctions on Decentralized Social Networking Sites: Quasi-Experimental Evidence From the Fediverse”.

Sunday, May 28

On Sunday we will be actively taking different roles in various sessions. In the morning, Yibin Fan will serve as the moderator for a research paper panel on Political Deliberation and Expression affiliated by Political Communication Division at 9:00 am in 2 – Room Simcoe (Sheraton).

Then there comes our highlighting paper that won the Top Paper Award by Computational Methods Division: Nathan TeBlunthuis will present a methodological research paper entitled as “Automated Content Misclassification Causes Bias in Regression: Can We Fix It? Yes We Can!” in a panel on Debate, Deliberation and Discussion in the Public Sphere at noon in M – Room Maple East (Sheraton). This is a project on which Nate collaborates with Valerie Hase at LMU Munich and Chung-hong Chan at University of Mannheim. Congratulations to them for getting the Top Paper Award!

Last but not least, we will finish off our ICA 2023 by seeing our faculty members serving as the chair and discussants in the Computational Methods Research Escalator Session at 1:30pm in M – Room Maple West (the same room as Nate’s presentation!). The session is junior scholars who are inexperienced in publishing to have connections with more senior researchers in the field. As the Call for Papers by Computational Methods Division says, “Research escalator papers provide an opportunity for less experienced researchers to obtain feedback from more senior scholars about a paper-in-progress, with the goal of making the paper ready for submission to a conference or journal.” Aaron Shaw, together with Matthew Weber at Rutgers University, will serve as the chairs for the session. Benjamin Mako Hill and Jeremy Foote, together with a bunch of scholars from other institutions, will work as discussants for improving the research presented here.

We look forward to sharing our research and connecting with you at ICA!

May 11, 2023

Community Dialogue on Digital Inequalities

Join the Community Data Science Collective (CDSC) for our 5th Science of Community Dialogue! This Community Dialogue will take place on May 19 at 10:00 am PDT (18:00 UTC). This Dialogue focuses on digital inequalities and online community participation. Professor Hernan Galperin (University of Southern California) will join Floor Fiers (Northwestern University) to present recent research on topics including:

Inequalities in online access and participation
Differentiated participation in online communities
Causes and consequences of online inequalities
Digital skills as a barrier to online participation
Combating digital discrimination

A full session descriptions is on our website. Register online

What is a Dialogue?

The Science of Community Dialogue Series is a series of conversations between researchers, experts, community organizers, and other people who are interested in how communities work, collaborate, and succeed. You can watch this short introduction video with Aaron Shaw.

What is the CDSC?

The Community Data Science Collective (CDSC) is an interdisciplinary research group made of up of faculty and students at the University of Washington Department of Communication, the Northwestern University Department of Communication Studies, the Carleton College Computer Science Department, and the Purdue University School of Communication.

Learn more

If you’d like to learn more or get future updates about the Science of Community Dialogues, please join the low volume announcement list.

April 28, 2023May 8, 2023

CDSC @ FOSSY: Call for Proposals

Help us build a dynamic and exciting program to facilitate conversations between free and open source software (FOSS) researchers and practitioners! Submit a session proposal for FOSSY! The deadline for submissions is ~~May 14~~ May 18 (Edit: The deadline has been moved.).

A photo of a classroom full of people on laptops. The seating is tiered, with people in the back higher up. — A Spring 2015 CDSC workshop. Photo by Sage Ross, licensed Creative Commons Attribution Share Alike

Although scholars publish hundreds of papers about free and open source software, online governance, licensing, and other topics very relevant to FOSS communities each year, much of this work never makes it out of academic journals and conferences and back to FOSS communities. At the same time, FOSS communities have a range of insights, questions, and data that researchers studying FOSS could benefit from enormously.

This gap between research and communities inspired us to propose a track at FOSSY, the Free and Open Source Software Yearly conference. Our track is called FOSS Research for All: The Science of Community.

The goal of this track is to build bridges between FOSS communities and scientific research conducted with and about FOSS communities. We hope to provide opportunities for community members to hear about exciting results from researchers; opportunities for researchers to learn from the FOSS community members; and spaces for the FOSS community to think together about how to improve FOSS projects by leveraging research insights and research.

This track will include opportunities for:

researchers to talk with practitioners (about their research)
practitioners to talk with researchers (about their needs)
researchers to talk with other researchers (for learning and collaboration)

FOSSY runs July 13 – 16, and we are hoping to have 2-3 days of content. Towards that end, we are seeking proposals! If you are a researcher with work of relevance to FOSS community members; a FOSS community member with experiences or opportunities relevant to research; or just want to be involved in this conversation, please consider proposing in one of the following formats:

Short Talks. Do you have a recent project to share in some depth? A topic that needs time to unpack? Take 20 minutes to present your thoughts. Following each presentation there will be time for group discussion to help participants apply your work to their practice.
Lightning Talks. Want to make a focused point, pitch, or problem report to a great audience? Bring your 5-minute talk to our lightning round.
Panels. We will be facilitating dialogue between researchers and community members. Would you be willing to share your thoughts as a panelist? Let us know your expertise and a few notes on your perspective so that we can develop a diverse and engaging panel. Contact us directly (details below) if you are interested in being on a panel.

If you have an idea that doesn’t fit into these formats, let’s chat! You can reach out to Kaylea (kaylea@uw.edu), Molly (molly.deblanc@northwestern.edu) or submit your idea as a proposal via the FOSSY form.

Submissions are non-archival, so we welcome ongoing, completed, and already published research work. Non-archival means that presentation of work at FOSSY does not constitute a publication. It’s just a way to get your work out there! Work that synthesizes or draws across a body of published papers is particularly welcome.

What kind of research are you looking to have presented?

We are interested in any topic related to FOSS communities! This might include research from computing (including software engineering, computer security, social computing, HCI), the social sciences and humanities (including management, philosophy, law, economics, sociology, communication, and more), information sciences, and beyond.

For example: how to identify undermaintained FOSS packages and what to do about it; community growth and how to find success in small communities; effective rule making and enforcement in online communities.

If it involves FOSS or is of interest to FOSS practitioners, we welcome it! We are eager to help you put your results into the hands of practitioners who can use your findings to inform their own community’s practices and policies on social, governance, and technical topics.

We hope to welcome scholars and researchers from across academia, government, industry, or wherever else you are from!

Who will I be speaking with?

We expect a multi-disciplinary audience. FOSSY will be bringing together free and open source software practitioners including community managers, designers, legal experts, non-profit and project leaders, technical developers, technical writers, and researchers.

The track is being organized by:

Kaylea Champion, Community Data Science Collective and the University of Washington

Molly de Blanc, Community Data Science Collective and Northwestern University

Benjamin Mako Hill, Community Data Science Collective and the University of Washington

Aaron Shaw, Community Data Science Collective and Northwestern University

April 19, 2023April 18, 2023

Kaylea to present at ‘Women in Data Science’ Conference

Women in Data Science Puget Sound is part of a 50+-country conference series founded and organized in cooperation with Stanford University’s Data Science coalition. Anyone may attend, regardless of gender: events feature a speaker lineup composed of women in data science. The Puget Sound event is Tuesday, April 25 at the Expedia HQ in Seattle, and numerous affiliated regional and online events are scheduled in the coming weeks.

If you’re in the Seattle area, you might like to catch CDSC member Kaylea presenting a workshop! Here’s the pitch for attending her beginner-friendly session:

Let’s Re-think Political Bias & Build Our Own Classifier

How can we think about political bias without falling into assumptions about who's on what side and what that means?

Data science and ML offer us an alternative: we can parse political speech about a topic and use NLP/ML techniques to classify articles we scrape from the web.

In this hands-on workshop, we'll parse the Congressional Record, build a classifier, scrape search results, and analyze texts. You'll walk away with your own example of how to use data science to analyze political framing.

The full lineup of speakers for the Puget Sound conference is posted here. Tickets for the single-day event are $80 (see this link to request a discount code for half off).

Topics on the schedule for this event look juicy if quant work is your jam: AI, BERT, hypergraphs, visualization, forecasting, quantum computing, causal inference, survival analysis, writing better code and career management, with examples ranging from search, sales, and supply chain to economic disparity, DNA sequencing and saving wildlife!

April 18, 2023April 17, 2023

Excavating online futures past

Cover of Kevin Driscoll's book, The Modem World.

The International Journal of Communication (IJOC) has just published my review of Kevin Driscoll’s The Modem World: A Prehistory of Social Media (Yale UP, 2022).

In The Modem World, Driscoll provides an engaging social history of Bulletin Board Systems (BBSes), an early, dial-up precursor to social media that predated the World Wide Web. You might have heard of the most famous BBSes—likely Stuart Brand’s Whole Earth ‘Lectronic Link, or the WELL—but, as Driscoll elaborates, there were many others. Indeed, thousands of decentralized, autonomous virtual communities thrived around the world in the decades before the Internet became accessible to the general public. Through Driscoll’s eyes, these communities offer a glimpse of a bygone sociotechnical era and that prefigured and shaped our own in numerous ways. The “modem world” also suggests some paths beyond our current moment of disenchantment with the venture-funded, surveillance capitalist, billionaire-backed platforms that dominate social media today.

The book, like everything of Driscoll’s that I’ve ever read, is both enjoyable and informative and I recommend it for a number of reasons. I also (more selfishly) recommend the book review, which was fun to write and is just a few pages long. I got helpful feedback along the way from Yibin Fan, Kaylea Champion, and Hannah Cutts.

Because IJOC is an open access journal that publishes under a CC-BY-NC-ND license, you can read the review without paywalls, proxies, piracy, etc. Please feel free to send along any comments or feedback! For example, at least one person (who I won’t name here) thinks I should have emphasized the importance of porn in Driscoll’s account more heavily! While porn was definitely an important part of the BBS universe, I didn’t think it was such a central component of The Modem World. Ymmv?

Shaw, A. (2023). Kevin Driscoll, The Modem World: A Prehistory of Social Media. International Journal Of Communication, 17, 4. Retrieved from https://ijoc.org/index.php/ijoc/article/view/21215/4162

March 31, 2023April 6, 2023

The social structure of new wiki communities

A new paper that our that our group has published seeks to test whether the kind of communication patterns associated with successful offline teams also predict success in online collaborative settings. Surprisingly, we find that it does not. In the rest of this blog post, we summarize that research and unpack that result.

Many of us have been part of a work team where everyone clicked. Everyone liked and respected each other, maybe you even hung out together outside of work. In a team like that, when someone asks you to cover a shift, or asks you to stay late to help them finish a project, you do it.

This anecdotal experience that many of us have is borne out by research. When members of work groups in corporate settings feel integrated into a group, and particularly when their identity is connected to their group membership, they are more willing to contribute to the group’s goals. Integrative groups (where there isn’t a strong hierarchy and where very few people are on the periphery) are also able to communicate and coordinate their work better.

One way to measure whether a group is “integrative” is to look at the group’s conversation networks, as shown in the figure below. Groups where few people are on the periphery (like on the left) usually perform better along a number of dimensions, such as creativity and productivity.

Examples of two possible configurations of a work group. The work group on the left is much more “integrative,” and we would expect it to be more creative and productive.

In our new paper, we set out to look for evidence that early online wiki communities at Fandom.com work the same way as work groups. When communities are getting started, there are lots of reasons to think that they would also benefit from integrative networks. Their members typically don’t know each other and communicate mostly via text—conditions that should make building a shared identity tough. In addition, they are volunteers who can easily leave at any time. The research on work groups made us think that integrative social structures would be especially important in making new wikis successful.

Communication network of the Spongebob wiki after 700 edits

In order to measure the social structure of these communities, we created communication networks for almost 1,000 wikis for the talk that happened during their firs 700 main page edits. Connections between people were based on who talked to whom on Talk pages. These are wiki pages connected to each page and each registered user on a wiki. We connected users who talked to each other at least a few times on the same talk pages, and looked at whether how integrative a communication network was predicted 1) how much people contributed and 2) how long a wiki remained active.

Surprisingly, we found that no matter how we measured communication networks, and no matter how we measured success, integrative network measures were not good at predicting that a wiki would survive or be productive. While a few of our control variables helped to predict productivity and survival, none of the network measures (nor all of them taken together) helped much to predict either of our success measures, as shown in Figures 5 and 6 from the paper.

Figure 5. Estimated coefficients predicting the productivity of a wiki.

Figure 6. Estimated coefficients predicting how quickly a wiki will become inactive.

So, what is going on here?

We have a few possible explanations for why communication network structures don’t seem to matter. One is that group identity for wiki members may not be influenced much by network structure. In a work group, it can be painfully obvious if you are on the periphery and not included in conversations or activities. Even though wiki conversations are technically all public and visible, in practice it’s very easy for group members to be unaware of conversations happening in other parts of the site. This idea is supported by research led by Sohyeon Hwang, which showed that people can build identity in an online community even without personal relationships.

Another complementary explanation for how groups coordinate work without integrative communication networks is that wiki software helps to organize what needs to be done without explicit communication. Much of this happens just because the central artifact of the community—the wiki—is continuously updated, so it is (relatively) clear what has been done and what needs to be done. In addition, there are opportunities for stigmergy. Stigmergy occurs when actors modifying the environment as a way of communicating. Then, others make decisions based on observing the environment. The canonical example is ants who leave pheremone trails for other ants to find and follow.

In wikis, this can be accomplished in a few ways. For example, contributors can create a link to a page that doesn’t yet exist. By default, these show up as red links, suggesting to others that a page needs to be created.

A final possible explanation for our results is based on how easy it is to join and leave online communities. It may be that integrative structures are so important because they help groups to overcome and navigate conflicts; in online communities contributors may be more likely to simply disengage instead of trying to resolve a conflict.

As we conclude in the paper:

Why do communication networks—important predictors of group performance outcomes across diverse domains—not predict productivity or survival in peer production? Our findings suggest that the relationship of communication structure to effective collaboration and organization is not universal but contingent. While all groups require coordination and undergo social influence, groups composed of different types of people or working in different technological contexts may have different communicative needs. Wikis provide a context where coordination via stigmergy may suffice and where the role of cheap exit as well as the difficulty of group-level conversation may lead to consensus-by-attrition.

We hope that others will help us to study some of these mechanisms more directly, and look forward to talking more with researchers and others interested in how and why online groups succeed.

The full citation for the paper is: Foote, Jeremy, Aaron Shaw, and Benjamin Mako Hill. 2023. “Communication Networks Do Not Predict Success in Attempts at Peer Production.” Journal of Computer-Mediated Communication 28 (3): zmad002. https://doi.org/10.1093/jcmc/zmad002.

We have also released replication materials for the paper, including all the data and code used to conduct the analyses.

March 22, 2023April 5, 2023

Effects of Algorithmic Flagging on Fairness: Quasi-experimental Evidence from Wikipedia

Many online platforms are adopting machine learning as a tool to maintain order and high quality information in the face of massive influxes of of user generated content. Of course, machine learning algorithms can be inaccurate, biased or unfair. How do signals from machine learning predictions shape the fairness of online content moderation? How can we measure an algorithmic flagging system’s effects?

In our paper published at CSCW 2021, I (Nate TeBlunthuis) together with Benjamin Mako Hill and Aaron Halfaker analyzed the RCFilters system: an add-on to Wikipedia that highlights and filters edits that a machine learning algorithm called ORES identifies as likely to be damaging to Wikipedia. This system has been deployed on large Wikipedia language editions and is similar to other algorithmic flagging systems that are becoming increasingly widespread. Our work measures the causal effect of being flagged in the RCFilters user interface.

Screenshot of Wikipedia edit metadata on Special:RecentChanges with RCFilters enabled. Highlighted edits with a colored circle to the left side of other metadata are flagged by ORES. Different circle and highlight colors (white, yellow, orange, and red in the figure) correspond to different levels of confidence that the edit is damaging. RCFilters does not specifically flag edits by new accounts or unregistered editors, but does support filtering changes by editor types.

Our work takes advantage of the fact that RCFilters, like many algorithmic flagging systems, create discontinuities in the relationship between the probability that a moderator should take action and whether a moderator actually does. This happens because the output of machine learning systems like ORES is typically a continuous score (in RCFilters, an estimated probability that a Wikipedia edit is damaging), while the flags (in RCFilters, the yellow, orange, or red highlights) are either on or off and are triggered when the score crosses some arbitrary threshold. As a result, edits slightly above the threshold are both more visible to moderators and appear more likely to be damaging than edits slightly below. Even though edits on either side of the threshold have virtually the same likelihood of truly being damaging, the flagged edits are substantially more likely to be reverted. This fact lets us use a method called regression discontinuity to make causal estimates of the effect of being flagged in RCFilters.

Charts showing the probability that an edit will be reverted as function of ORES scores in the neighborhood of the discontinuous threshold that triggers the RCfilters flag. The jump in the increase in reversion chances is larger for registered editors compared to unregistered editors at both thresholds.

To understand how this system may effect the fairness of Wikipedia moderation, we estimate the effects of flagging on edits on different groups of editors. Comparing the magnitude of these estimates lets us measure how flagging is associated with several different definitions of fairness. Surprisingly, we found evidence that these flags improved fairness for categories of editors that have been widely perceived as troublesome—particularly unregistered (anonymous) editors. This occurred because flagging has a much stronger effect on edits by the registered than on edits by the unregistered.

We believe that our results are driven by the fact algorithmic flags are especially helpful for finding damage that can’t be easily detected otherwise. Wikipedia moderators can see the editor’s registration status in the recent changes, watchlists, and edit history. Because unregistered editors are often troublesome, Wikipedia moderators’ attention is often focused on their contributions, with or without algorithmic flags. Algorithmic flags make damage by registered editors (in addition to unregistered editors) much more detectable to moderators and so help moderators focus on damage overall, not just damage by suspicious editors. As a result, the algorithmic flagging system decreases the bias that moderators have against unregistered editors.

This finding is particularly surprising because the ORES algorithm we analyzed was itself demonstrably biased against unregistered editors (i.e., the algorithm tended to greatly overestimate the probability that edits by these editors were damaging). Despite the fact that the algorithms were biased, their introduction could still lead to less biased outcomes overall.

Our work shows that although it is important to design predictive algorithms to not have such biases, it is equally important to study fairness at the level of the broader sociotechnical system. Since we first published a preprint of our paper, a followup piece by Leijie Wang and Haiyi Zhu replicated much of our work and showed that differences between different Wikipedia communities may be another important factor driving the effect of the system. Overall, this work suggests that social signals and social context can interact with algorithmic signals and together these can influence behavior in important and unexpected ways.

The full citation for the paper is: TeBlunthuis, Nathan, Benjamin Mako Hill, and Aaron Halfaker. 2021. “Effects of Algorithmic Flagging on Fairness: Quasi-Experimental Evidence from Wikipedia.” Proceedings of the ACM on Human-Computer Interaction 5 (CSCW): 56:1-56:27. https://doi.org/10.1145/3449130.

We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

March 9, 2023March 9, 2023

Community Dialogue on Accountable Governance and Data

Our fourth Community Dialogue covered topics on accountable governance and data leverage as a tool for accountable governance. It featured Amy X. Zhang (University of Washington) and recent CDSC graduate Nick Vincent (Northwestern, UC Davis).

Designing and Building Governance in Online Communities (Amy X. Zhang)

This session discussed different methods of engagement between communities and their governance structures, different models of governance, and empirical work to understand tensions within communities and governance structures. Amy presented PolicyKit, a tool her team built in response to what they learned from their research, which will also help to continue to better understand governance.

Can We Solve Emerging Problems in Technology and AI By Giving Communities Data Leverage? (Nick Vincent)

Nick Vincent looked at the question of how to hold governance structures accountable through collective action. He asked how groups can leverage control of data and the potential implications of data leverage on social structures and technical development.

If you are interested in attending a future Dialogue, sign up for our very low-volume mailing list.