The Community Data Science Collective (CDSC) is an interdisciplinary research group made of up of faculty and students at the University of Washington Department of Communication, the Northwestern University Department of Communication Studies, the Carleton College Computer Science Department, and the Purdue University School of Communication.
In The Modem World, Driscoll provides an engaging social history of Bulletin Board Systems (BBSes), an early, dial-up precursor to social media that predated the World Wide Web. You might have heard of the most famous BBSes—likely Stuart Brand’s Whole Earth ‘Lectronic Link, or the WELL—but, as Driscoll elaborates, there were many others. Indeed, thousands of decentralized, autonomous virtual communities thrived around the world in the decades before the Internet became accessible to the general public. Through Driscoll’s eyes, these communities offer a glimpse of a bygone sociotechnical era and that prefigured and shaped our own in numerous ways. The “modem world” also suggests some paths beyond our current moment of disenchantment with the venture-funded, surveillance capitalist, billionaire-backed platforms that dominate social media today.
The book, like everything of Driscoll’s that I’ve ever read, is both enjoyable and informative and I recommend it for a number of reasons. I also (more selfishly) recommend the book review, which was fun to write and is just a few pages long. I got helpful feedback along the way from Yibin Fan, Kaylea Champion, and Hannah Cutts.
Because IJOC is an open access journal that publishes under a CC-BY-NC-ND license, you can read the review without paywalls, proxies, piracy, etc. Please feel free to send along any comments or feedback! For example, at least one person (who I won’t name here) thinks I should have emphasized the importance of porn in Driscoll’s account more heavily! While porn was definitely an important part of the BBS universe, I didn’t think it was such a central component of The Modem World. Ymmv?
Many online platforms are adopting machine learning as a tool to maintain order and high quality information in the face of massive influxes of of user generated content. Of course, machine learning algorithms can be inaccurate, biased or unfair. How do signals from machine learning predictions shape the fairness of online content moderation? How can we measure an algorithmic flagging system’s effects?
In our paper published at CSCW 2021, I (Nate TeBlunthuis) together with Benjamin Mako Hill and Aaron Halfaker analyzed the RCFilters system: an add-on to Wikipedia that highlights and filters edits that a machine learning algorithm called ORES identifies as likely to be damaging to Wikipedia. This system has been deployed on large Wikipedia language editions and is similar to other algorithmic flagging systems that are becoming increasingly widespread. Our work measures the causal effect of being flagged in the RCFilters user interface.
Our work takes advantage of the fact that RCFilters, like many algorithmic flagging systems, create discontinuities in the relationship between the probability that a moderator should take action and whether a moderator actually does. This happens because the output of machine learning systems like ORES is typically a continuous score (in RCFilters, an estimated probability that a Wikipedia edit is damaging), while the flags (in RCFilters, the yellow, orange, or red highlights) are either on or off and are triggered when the score crosses some arbitrary threshold. As a result, edits slightly above the threshold are both more visible to moderators and appear more likely to be damaging than edits slightly below. Even though edits on either side of the threshold have virtually the same likelihood of truly being damaging, the flagged edits are substantially more likely to be reverted. This fact lets us use a method called regression discontinuity to make causal estimates of the effect of being flagged in RCFilters.
To understand how this system may effect the fairness of Wikipedia moderation, we estimate the effects of flagging on edits on different groups of editors. Comparing the magnitude of these estimates lets us measure how flagging is associated with several different definitions of fairness. Surprisingly, we found evidence that these flags improved fairness for categories of editors that have been widely perceived as troublesome—particularly unregistered (anonymous) editors. This occurred because flagging has a much stronger effect on edits by the registered than on edits by the unregistered.
We believe that our results are driven by the fact algorithmic flags are especially helpful for finding damage that can’t be easily detected otherwise. Wikipedia moderators can see the editor’s registration status in the recent changes, watchlists, and edit history. Because unregistered editors are often troublesome, Wikipedia moderators’ attention is often focused on their contributions, with or without algorithmic flags. Algorithmic flags make damage by registered editors (in addition to unregistered editors) much more detectable to moderators and so help moderators focus on damage overall, not just damage by suspicious editors. As a result, the algorithmic flagging system decreases the bias that moderators have against unregistered editors.
This finding is particularly surprising because the ORES algorithm we analyzed was itself demonstrably biased againstunregistered editors (i.e., the algorithm tended to greatly overestimate the probability that edits by these editors were damaging). Despite the fact that the algorithms were biased, their introduction could still lead to less biased outcomes overall.
Our work shows that although it is important to design predictive algorithms to not have such biases, it is equally important to study fairness at the level of the broader sociotechnical system. Since we first published a preprint of our paper, a followup piece by Leijie Wang and Haiyi Zhu replicated much of our work and showed that differences between different Wikipedia communities may be another important factor driving the effect of the system. Overall, this work suggests that social signals and social context can interact with algorithmic signals and together these can influence behavior in important and unexpected ways.
The full citation for the paper is: TeBlunthuis, Nathan, Benjamin Mako Hill, and Aaron Halfaker. 2021. “Effects of Algorithmic Flagging on Fairness: Quasi-Experimental Evidence from Wikipedia.” Proceedings of the ACM on Human-Computer Interaction 5 (CSCW): 56:1-56:27. https://doi.org/10.1145/3449130.
Our fourth Community Dialogue covered topics on accountable governance and data leverage as a tool for accountable governance. It featured Amy X. Zhang (University of Washington) and recent CDSC graduate Nick Vincent (Northwestern, UC Davis).
Designing and Building Governance in Online Communities (Amy X. Zhang)
This session discussed different methods of engagement between communities and their governance structures, different models of governance, and empirical work to understand tensions within communities and governance structures. Amy presented PolicyKit, a tool her team built in response to what they learned from their research, which will also help to continue to better understand governance.
Can We Solve Emerging Problems in Technology and AI By Giving Communities Data Leverage? (Nick Vincent)
Nick Vincent looked at the question of how to hold governance structures accountable through collective action. He asked how groups can leverage control of data and the potential implications of data leverage on social structures and technical development.
Inequality and discrimination in the labor market is a persistent and sometimes devastating problem for job seekers. Increasingly, labor is moving to online platforms, but labor inequality and discrimination research often overlooks work that happens on such platforms. Do research findings from traditional labor contexts generalize to the online realm? We have reason to think perhaps not, since entering the online labor market requires specific technical infrastructure and skills (as we showed in this paper). Besides, hiring processes for online platforms look significantly different: these systems use computational structures to organize labor at a scale that exceeds any hiring operation in the traditional labor market.
To understand what research on patterns of inequality and discrimination in the gig economy is out there and to identify remaining puzzles, I (Floor) systematically gathered, analyzed, and synthesized studies on this topic. The result is a paper recently published in New Media & Society.
I took a systematic approach in order to capture all the different strands of inquiry across various academic fields. These different strands might use different methods and even different language but, crucially, still describe similar phenomena. For this review, Siying Luo (research assistant on this project) and I gathered literature from five academic databases covering multiple disciplines. By sifting through many journal articles and conference proceedings, we identified 39 studies of participation and success in the online labor market.
I found three approaches to the study of inequality and discrimination in the gig economy. All address distinct research questions drawing on different methods and framing (see the table below for an overview).
Approach 1 asks who does and who does not engage in online labor. This strand of research takes into account the voices of both those who have pursued such labor and those who have not. Five studies take this approach, of which three draw on national survey data and two others examine participation among a specific population (such as older adults).
Approach 2 asks who online contractors are. Some of this research describes the sociodemographic composition of contractors by surveying them or by analyzing digital trace data. Other studies focus on labor outcomes, identifying who among those that pursue online labor actually land jobs and generate an income. You might imagine a study asking whether male contractors make more money on an online platform than female contractors do.
Approach 3 asks what social biases exist in the hiring process, both on the side of individual users making hiring decisions and the algorithms powering the online labor platforms. Studies taking this approach tend to rely on experiments that test the impact of some manipulation in the contractor’s sociodemographic background on an outcome, such as whether they get featured by the platform or whether they get hired.
Extended pipeline of online participation inequalities
In addition to identifying these three approaches, I map the outcomes variables of all studies across an extended version of the so-called pipeline of participation inequalities (as coined and tested in this paper). This model breaks down the steps one needs to take before being able to contribute online, presenting them in the form of a pipeline. Studying online participation as stages of the pipeline allows for the identification of barriers since it reveals the spots where people face obstacles and drop out before fully participating. Mapping the literature on inequality and discrimination in the gig economy across stages of a pipeline proved helpful in understanding and visualizing what parts of the process of becoming an online contractor have been studied and what parts require more attention.
I extended the pipeline of participation inequalities to fit the process of participating in the gig economy. This form of online participation does not only require having the appropriate access and skills to participate, but also requires garnering attention and getting hired. The extended pipeline model has eleven stages: from having heard of a platform to receiving payment as well as reviews and ratings for having performed a job. The figure below shows a visualization of the pipeline with the number of studies that study an outcome variable associated with each stage.
When mapping the studies across the pipeline, we find that two stages have been studied much more than others. Prior literature primarily examines whether individuals who pursue work online are getting hired and receiving a payment. In contrast, the literature in this scoping review hardly examined earlier stages of the pipeline.
So, what should we take away?
After systematically gathering and analyzing the literature on inequality and discrimination in the online labor market, I want to highlight three takeaways.
One: Most of the research focuses on individual-level resources and biases as a source of unequal participation. This scoping review points to a need for future research to examine the specific role of the platform in facilitating inequality and discrimination.
Two: The literature thus far has primarily focused on behaviors at the end of the pipeline of participation inequalities (i.e., having been hired and received payment). Studying earlier stages is important as it might explain patterns of success in later stages. In addition, such studies are also worthwhile inquiries in their own right. Insights into who meets these conditions of participation and desired labor outcomes are valuable, for example, in designing policy interventions.
Three: Hardly any research looks at participation across multiple stages of the pipeline. Considering multiple stages in one study is important to identify the moments that individuals face obstacles and how sociodemographic factors relate to making it from one stage to the next.
Floor Fiers is PhD candidate at Northwestern University in the Media, Technology, and Society program. They received support and advice from other members of the collective. Most notably, Siying Luo contributed greatly to this project as a research assistant.
Wikipedia provides the best and most accessible single source of information on the largest number of topics in the largest number of languages. If you’re anything like me, you use it all the time. If you (also like me) use Wikipedia to inform your research, teaching, or other sorts of projects that result in shared, public, or even published work, you may also want to cite Wikipedia. I wrote a short tutorial to help people do that more accurately and effectively.
The days when teachers and professors banned students from citing Wikipedia are perhaps not entirely behind us, but do you know what to do if you find yourself in a situation where it is socially/professionally acceptable to cite Wikipedia (such as one of my classes!) and you want to do so in a responsible, durable way?
More specifically, what can you do about the fact that any Wikipedia page you cite can and probably will change? How do you provide a useful citation to a dynamic web resource that is continuously in flux?
This question has come up frequently enough in my classes over the years, that I drafted a short tutorial on doing better Wikipedia citations for my students back in 2020. It’s been through a few revisions since then and I don’t find it completely embarrassing, so I am blogging about it now in the hopes that others might find it useful and share more widely. Also, since it’s on my research group’s wiki, you (and anyone you know) can even make further revisions or chat about it with me on my user:talk page.
You might be thinking, "so wait, does this mean I can cite Wikipedia for anything"??? To which I would respond "Just hold on there, cowboy."
Wikipedia is, like any other information source, only as good as the evidence behind it. In that regard, nothing about my recommendations here make any of the information on Wikipedia any more reliable than it was before. You have to use other skills and resources to assess the quality of the information you’re citing on Wikipedia (e.g., the content/quality of the references used to support the claims made in any given article).
Like I said above, the problem this really tries to solve is more about how to best cite something on Wikipedia, given that you have some good reason to cite it in the first place.
One of the fun things about being in a large lab is getting to celebrate everyone’s accomplishments, wins, and the good stuff that happens. Here is a brief-ish overview of some real successes from 2022.
Graduations and New Positions
Our lab gained SIX new grad student members, Kevin Ackermann, Yibin Fang, Ellie Ross, Dyuti Jha, Hazel Chu, and Ryan Funkhouser. Kevin is a first year graduate student at Northwestern and Yibin and Ellie are first year students at University of Washington. Dyuti, Hazel, and Ryan joined us via Purdue and become Jeremy Foote’s first ever advisees. We had quite a number of undergraduate RAs. We also gained Divya Sikka from Interlake High School.
Nick Vincent became Dr. Nick Vincent, Ph.D (Northwestern). He will do a postdoc at the University of California Davis and University of Washington. Molly de Blanc earned their master’s degree (New York University). Dr. Nate TeBlunthius joined the University of Michigan as a post-doc, working with Professor Ceren Budak.
Kaylea Champion and Regina Cheng had their dissertation proposals approved and Floor Fiers finished their qualifying exams and is now a Ph.D. candidate. Carl Colglaizer finished his coursework.
Aaron Shaw started an appointment as the Scholar-in-Residence for King County, Washington, as well as Visiting Professor in the Department of Communication at the University of Washington.
As faculty, it is expected that Jeremy Foote, Mako Hill, Sneha Narayan, and Aaron Shaw taught classes. As a class teaching assistant, Kaylea won an Outstanding Teaching Award! Floor also taught a public speaking class. CDSC members were also teaching assistants, led workshops, and gave guest lectures in classes.
This list is far from complete, including some highlights!
Carl presented at ICA alongside Nicholas Diakopoulos, “Predictive Models in News Coverage of the COVID-19 Pandemic in the United States.”
Floor was present at the Easter Sociological Society (ESS), AoIR (Association of Internet Researchers), and ICA. They won a top paper award at National Communication Association (NCA): Walter, N., Suresh, S., Brooks, J. J., Saucier, C., Fiers, F., & Holbert, R. L. (2022, November). The Chaffee Principle: The Most Likely Effect of Communication…Is Further Communication. National Communication Association (NCA) National Convention, New Orleans, LA.
Kaylea had a whopping two papers at ICA, a keynote at the IEEE Symposium on Digital Privacy and Social Media, and presentations at CSCW Doctoral Consortium, a CSCW workshop, and the DUB Doctoral Consortium. She also participated in Aaron Swartz Day, SeaGL, CHAOSSCon, MozFest, and an event at UMASS Boston.
Molly also participated in Aaron Swartz Day, and a workshop at CSCW on volunteer labor and data.
Sohyeon was at GLF as a knowledge steward and presented two posters at the HCI+D Lambert Conference (one with Emily Zou and one with Charlie Kiene, Serene Ong, and Aaron). She also presented at ICWSM, had posters at ICSSI and IC2S2, and organized a workshop at CSCW. In addition to more traditional academic presentations, Sohyeon was on a fireside chat panel hosted by d/arc server, guest lectured at the University of Washington and Northwestern, and met with Discord moderators to talk about heterogeneity in online governance. Sohyeon also won the Half-Bake Off at the CDSC fall retreat.
We did a lot of public scholarship this year! Among presentations, leading workshops, and organizing public facing events, CDSC also ran the Science of Community Dialogue Series. Presenters from within CDSC include Jeremy Foote, Sohyeon Hwang, Nate TeBlunthius, Charlie Kiene, Kaylea Champion, Regina Cheng, and Nick Vincent. Guest speakers included Dr. Shruti Sannon, Dr. Denae Ford, and Dr. Amy X. Zhang. To attend future Dialogues, sign up for our low-volume email list!
These events are organized by Molly, with assistance from Aaron and Mako.
How can communities develop and understand accountable governance? So many online environments rely on community members in profound ways without being accountable to them in direct ways. In this session, we will explore this topic and its implications for online communities and platforms.
First, Nick Vincent (Northwestern, UC Davis) will discuss the opportunities for so-called “data leverage” and will highlight the potential to push back on the “data status quo” to build compelling alternatives, including the potential for “data dividends” that allow a broader set of users to economically benefit from their contributions.
The idea of “data leverage” comes out of a basic, but little discussed fact: Many technologies are highly reliant on content and behavioral traces created by everyday Internet users, and particularly online community members who contribute text, images, code, editorial judgement, rankings, ratings, and more.. The technologies that rely on these resources include ubiquitous and familiar tools like search engines as well as new bleeding edge “Generative AI” systems that produce novel art, prose, code and more. Because these systems rely on contributions from Internet users, collective action by these users (for instance, withholding content) has the potential to impact system performance and operators.
Next, Amy Zhang (University of Washington) will discuss how communities can think about their governance and the ways in which the distribution of power and decision-making are encoded into the online community software that communities use. She will then describe a tool called PolicyKit that has been developed with the aim of breaking out of common top-down models for governance in online communities to enable governance models that are more open, transparent, and democratic. PolicyKit works by integrating with a community’s platform(s) of choice for online participation (e.g., Slack, Github, Discord, Reddit, OpenCollective), and then provides tools for community members to create a wide range of governance policies and automatically carry out those policies on and across their home platforms. She will then conclude with a discussion of specific governance models and how they incorporate legitimacy and accountability in their design.
Wiki Education (a.k.a., WikiEdu) is an independent non-profit organization that promotes the integration of Wikipedia into education and classrooms. In pursuit of this mission, WikiEdu has created incredible resources for students and instructors, including tools that facilitate classroom assignments where students create and improve Wikipedia articles.
In courses at both Northwestern and the University of Washington, CDSC faculty and students have offered courses with Wikipedia assignments for over a decade. In the past two weeks, WikiEdu has featured the most recent instances of these courses on their blog.
The first WikiEdu post celebrated the work of a team of Northwestern students that included Carl Colglazier (TSB and CDSC Ph.D. student) and Hannah Yang (undergraduate Communication Studies major and former CDSC research assistant). The team, all members of the Online Communities & Crowds course I taught with CDSC Ph.D. student Sohyeon Hwang in Winter 2022, overhauled an article on Inclusive design in English Wikipedia. Since the article’s initial publication back in March, other Wikipedia editors have improved it further and it has attracted over 10,000 pageviews. Amazing work, team!
The second post celebrates UW Communication doctoral student Kaylea Champion, recipient of an Outstanding Teaching Award from the Communication Department on the strength of her work in another Winter 2022 undergraduate course on Online Communities (also taught by Benjamin Mako Hill) that features a Wikipedia assignment. Several of Kaylea’s students thought so highly of her work in the course that they collaborated in nominating her for the award. Kaylea enjoyed the experience enough that she’s about to offer the course again as the lead instructor at UW this upcoming Winter term. I should also note that Kaylea has been nominated for a university-wide award, but we won’t know the outcome of that process for a while yet. Congratulations, Kaylea!
The public recognition of CDSC students and teaching is gratifying and provides a great reminder of why assignments that ask students to edit Wikipedia are so valuable in the first place. Most fundamentally, editing Wikipedia engages students in the production of public, open access knowledge resources that serve a much greater and broader purpose than your typical term paper, pop quiz, or exam. When students develop encyclopedic materials on topics of their interest, motivated undergraduates like Hannah Yang can directly connect coursework with practical, real-world concerns in ways that build on the expertise of graduate students like Carl Colglazier. This kind of school work creates unusually high impact products. Kaylea Champion puts the idea eloquently in that WikiEdu post: “Instead of locking away my synthesis efforts in a paper no one but my instructors would read, the Wikipedia assignment pushed me to address the public.”
Just think, how many people ever read a word of most college (or high school or graduate school) term papers? By contrast, the Wikipedia articles created by our students have routinely been viewed over 100,000 times in aggregate by the end of the term in which we offer the course. Extrapolate this out over a decade and our students’ work has likely been read millions of times by now. As with other content on Wikipedia, this work will shape public discourse, including judicial decisions, scientific research, search engine results, and more. There’s absolutely nothing academic about that!
We had another Science of Community Dialogue! This most recent one was themed around informal learning, talking about communities as informal learning spaces and the sorts of tools and habits communities can adopt to help learners, mentors, and newcomers. We had presentations from Ruijia (Regina) Cheng (University of Washington, CDSC) and Dr. Denae Ford Robinson (Microsoft, University of Washington).
Regina Cheng covered three related research projects and relevant findings:
Ruijia Cheng and Benjamin Mako Hill. 2022. “Many Destinations, Many Pathways: A Quantitative Analysis of Legitimate Peripheral Participation in Scratch.” https://doi.org/10.1145/3555106
Ruijia Cheng, Sayamindu Dasgupta, and Benjamin Mako Hill. 2022. “How Interest-Driven Content Creation Shapes Opportunities for Informal Learning in Scratch: A Case Study on Novices’ Use of Data Structures.” https://doi.org/10.1145/3491102.3502124
Participants collaboratively put together three takeaways from Regina Cheng’s presentation.
We often talk about wanting to support “learning” in some general sense, but a critically important question to ask is “learning about what.” Let’s say we want people to learn three things A, B, and C. The kinds of actions or behaviors that support learning goal A often have no effect on B, and C. And sometimes they actively hurt it. We need to be more specific about what we want people to learn because there are tradeoffs.
Social support is wonderful in that users create examples and resources and answer questions. But it also has this narrowing effect. There’s a piling-on effect that makes it easier and easier (and more likely!) to learn the things that folks have learned before and less likely that people learn anything else.
Feedback is not about information transfer, it’s about relationships. To best promote learning, we should create rich, legitimate, inclusive social environment. These are perhaps good things to do anyway.
Dr. Denae Ford Robinson focused on free and open source software (FOSS) communities as a case study of learning communities. She covered theory, needs, and demonstrated tools designed to help with the mentorship and the learning process.
Community-driven settings like FOSS (and social-good oriented projects in particular) rely enormously on volunteers and/or people opting into participation in ways that create huge challenges related to promoting project sustainability: the most active participants are overloaded in a way that is a recipe for burnout.
The path to sustainability involves attracting, retaining, and then sustaining contributions and understanding these processes as both (a) part of the lifecycle of a user and (b) part of a set of dynamics and lifecycle within the community (e.g., dynamics of community growth).
Approach 1 involves providing new information to help maintainers understand how things are going in their communities. A lack of insight and easy access to data is a cause of inefficiency and burnout.
Approach 2 involves making specific, structured recommendations to maintainers based on the experience of others in the past to do things like add tags and to shape behavior.
Approach 3 involves automating aspects of identifying and recognizing work (and perhaps other tasks) as a way of promoting newcomer experiences and reducing the load on maintainers for doing that.
This event and some of the research presented in it were supported by multiple awards from the National Science Foundation (DGE-1842165; IIS-2045055; IIS-1908850; IIS-1910202), Northwestern University, the University of Washington, and Purdue University.