A systems approach to studying online communities

Systems theory is a broad and multidisciplinary scientific approach that studies how things (molecules or cells or organs or people or companies) interact with each other. It argues that understanding how something works requires understanding its relationships and interdependencies.

For example, if we want to predict whether a new online community will grow, an individual perspective might focus on who the founder is, what software it is running on, how well it is designed, etc. A systems approach would argue that it is at least as important to understand things like how many similar communities there are, how active they are, and whether the platform is growing or shrinking.

In a paper just published in Media and Communication, I (Jeremy) argue that 1) it is particularly important to use a systems lens to study online communities, 2) that online communities provide ideal data for taking these approaches, and 3) that there is already really neat research in this area and there should be more of it.

The role of platforms

So, why is it so important to study online communities as interdependent “systems”? The first reason is that many online communities have a really important interdependence with the platforms that they run on. Platforms like Reddit or Facebook provide the servers and software for millions of communities, which are run mostly independently by the community managers and moderators.

However, this is an ambivalent relationship and often the goals and desires of at least some moderators are at odds with those of the platform, and things like community bans from the platform side or protests from the community side are not uncommon. The ways that platform decisions influence communities and how communities can work together to influence platforms are inherently systems questions.

Low barriers to entry and exit

A second feature of online communities is the relative ease with which people can join or leave them. Unlike offline groups, which at least require participants to get dressed, do their hair, and show up somewhere, online community participants can participate in an online community literally within seconds of knowing that it exists.

Similarly, people can leave incredibly easily, and most people do. This figure shows the number of comments made per person across 100 randomly selected subreddits (each line represents a subreddit; axes are both log-scaled). In every case, the vast majority of people only commented once while a few people made many comments.

Fuzzy boundaries

Finally, it’s often really difficult to draw clear boundaries around where one online community ends and another begins. For example, is all of Wikipedia one “community”? It might make sense to think of a language edition, a WikiProject, or even a single page as a community, and researchers have done all of the above. Even on platforms like Reddit, where there is a clearl delineation between communities, there are dependencies, with people and conversations moving across and between communities on similar topics.

In other words, online communities are semi-autonomous, interdependent, contingent organizations, deeply influenced by their environments. Online community scholars have often ignored this larger context, but systems theory gives us a rich set of tools for studying these interdependencies. One reason that it is so ideal is because online communities provide ideal data.

Data from Online Communities

Systems theory is not new – many of the main concepts were developed in the 1950s and 1960s or earlier. Organizational communication researchers saw how applicable these ideas were, and many researchers proposed treating organizations as systems.

However, it was really tough to get the data needed to do systems-based research. To study a group or organization as a system, you need to know about not only the internal workings of the group, but how it relates to other groups, how it is influenced by and influences its environment, etc. Gathering data about even one group was difficult and expensive; getting the data to study many groups and how they interact with each other over time was impossible.

The internet has entered the chat

Online communities provide the kind of data that these earlier researchers could have only dreamed of. Instead of data about one organization, platforms store data about thousands of organizations. And this is not just high-level data about activity levels or participation; on the contrary, we often have longitudinal, full-text conversations of millions of people as they interact within and move between communities.

Systems Approaches

In part, this article is a call for researchers to think more explicitly about online communities as systems, and to apply systems theory as a way of understanding how online communities work and how we can design research projects to understand them better. It is also an attempt to highlight strands of research that are already doing this. In the paper, I talk about four: Community Comparisons and Interactions, Individual Trajectories, Cross-level Mechanisms, and Simulating Emergent Behavior. Here, I’ll focus on just two.

Individual Trajectories

Figure from Panciera, K., Halfaker, A., & Terveen, L. (2009). Wikipedians are born, not made: A study of power editors on Wikipedia. Proceedings of the ACM 2009 International Conference on Supporting Group Work, 51–60. https://doi.org/10.1145/1531674.1531682

The first is what I call “Individual Trajectories”. In this approach, researchers can look at how individual people behave across a platform. One of the neat things about having longitudinal, unobtrusively collected data is that we can identify something interesting about users and go “back in time” to look for differences in earlier behavior. For example, in the plot above, Panciera et al. identified people who became active Wikipedia editors; they then went back and looked at how their behavior differed from typical editors from their early days on the site.

Researchers could and should do more work that looks at how people move between communities, and how communities influence the behavior of their members.

Simulating Emergent Behavior

The second approach is to use simulations to study emergent behaviors. Agent-based modeling software like NetLogo or Mesa allows researchers to create virtual worlds, where computational “agents” act according to theories of how the world works. Many communication theories make predictions about how individual‐level behavior produces higher‐level patterns, often through feedback loops (e.g., the Spiral of Silence theory). If agent-based models don’t produce those patterns, then we know that something about the theory—or its computational representation—is wrong.

Model of misinformation spread, from Hu et al. (under review)

Agent-based modeling has received some attention from communication researchers lately, including a wonderful special issue was recently published in Communication Methods and Measures; the editorial article makes some great arguments for the promise and benefits of simulations for communication research.

New Opportunities

It is a really exciting time to be a computational social scientist, especially one that is interested in online organizations and organizing. We have only scratched the surface of what we can learn from the data that is pouring down around us, especially when it comes to systems theory questions. Tools, methods, and computational advances are constantly evolving and opening up new avenues of research.

Of course, taking advantage of these data sources and computational advances requires a different set of skills than Communication departments have traditionally focused on, and complicated, large-scale analyses require the use of supercomputers and extensive computational expertise.

However, there are many approaches like agent-based modeling or simple web scraping that can be taught to graduate students in one or two semesters, and open up lots of possibilities for doing this kind of research.

I’d love to talk more about these ideas—please reach out, or if you are coming to ICA, come talk to me!

Community Data Science Collective at ICA 2022

The International Communication Association (ICA)’s 72nd annual conference is coming up in just a couple of weeks. This year, the conference takes place in Paris and a subset of our collective is flying out to present work in person. We are looking forward to meeting up, talking research, and eating croissants. À bientôt!

ICA takes place from Thursday, May 26th to Monday, May 30th, and we are presenting a total of ten (!!) times. All presentations given by members of the collective are scheduled between Friday and Sunday.

Friday

We start off with a presentation by Nathan TeBlunthuis on Friday at 11.00 AM, in Room 351 M (Palais des Congres). In a high-density paper session on Computational Approaches to Online Communities, Nate will present a paper entitled “Dynamics of Ecological Adaptation in Online Communities.”

Later that same day, at 3.30 PM in the Amphitheatre Havana (level 3; Palais des Congres), Carl Colglazier will discuss a paper that he collaborated on with Nick Diakopoulos: “Predictive Models in News Coverage of the COVID-19 Pandemic in the U.S.” This paper session is part of the ICA division Journalism Studies.

Saturday

On Saturday, Floor Fiers will present in the paper session “Impression Management Online: FabriCATing An Image.” Their project, which they wrote with Nathan Walter, discusses “Comments on Airbnb and the Potential for Racial Bias” at 2.00 PM in Regency 1 (Hyatt).

Shortly after, that same afternoon, you’ll find two of our poster presentations at 5.00 PM in the Exhibit Hall (Havana; Palais des Congres, level 3). In one of them, Jeremy Foote will discuss his take on “a systems approach to studying online communities.”

The other poster, presented at the same time and place, is by Kaylea Champion and Benjamin Mako Hill on “Resisting Taboo in the Collaborative Production of Knowledge: Evidence from Wikipedia.”

Sunday

Most of our presentations are on the fourth day of the conference. At 9.30 AM, we’ll be presenting in three locations at the same time! First, Floor will discuss their paper “Inequality and Discrimination in the Online Labor Market: a Scoping Review” in Room 311+312 (Palais des Congres). This presentation is part of the paper session “All Things Are Not Equal: CompliCATions From Digital Inequalities.”

Second, Carl will present work on behalf of himself, Aaron Shaw, and Benjamin Mako Hill during a high-density paper session in Room 242A (Palais des Congres). The title of their project is “Extended Abstract: Exhaustive Longitudinal Trace Data From Over 70,000 Wiki.”

Lastly, at the same time in Room 352B (Palais des Congres), Jeremy will present an interview study entitled “What Communication Supports Multifunctional Public Goods in Organizations? Using Agent-Based Modeling to Explore Differential Uses of Enterprise Social Media.” Jeremy’s co-authors on this paper are Jeffrey Treem and Bart van den Hooff.

On Sunday afternoon, at 3.30 PM in Room 311+312 (Palais des Congres), Tiwaladeoluwa Adekunle will talk about a qualitative project she collaborated on with Jeremy, Nate, and Laura Nelson: “Co-Creating Risk Online: Exploring Conceptualizations of COVID-19 Risk in Ideologically Distinct Online Communities.”

We will finish off our ICA 2022 presentations at 5.00 PM in Room 313+314 (Palais des Congres), where Kaylea will present on behalf of Isabella Brown, Lucy Bao, Jacinta Harshe, and Mako. The title of their paper is “Making Sense of Covid-19: Search Results and Information Providers”.

We look forward to sharing our research and connecting with you at ICA!

Come meet us at CHI 2022

We’re going to be at CHI! The Community Data Science Collective will be presenting three papers. You can find us there in person in New Orleans, Louisiana, April 30 – May 5. If you’ve ever wanted a super cool CDSC sticker, this is your chance!

Two red street cars going down a tree lined street.
Streetcars in New Orleans: 2000 series – Perley A. Thomas Car Works 900 Series Replicas” by Flavio~ is marked with CC BY 2.0.

Stefania (Stef) Druga (University of Washington) wrote “Family as a Third Space for AI Literacies: How do children and parents learn about AI together?” with Amy J. Ko and Fee Lia Christoph (University of Michigan). Stef will be presenting at “Interactive Learning Support Systems,” Monday May 2 at 14:15.

Sejal Khatri (University of Washington) received an honorable mention for her work “The Social Embeddedness of Peer Production: A Comparitive Qualitative Analysis of Three Indian Language Wikipedia Editions,” co-authored by Syamindu Dasgupta, Benjamin Mako Hill, and Aaron Shaw. Sejal will be presenting Tuesday May 3 at 14:15 in “Crowdwork and Collaboration.” Sejal, Aaron, Mako, and Syamindu also have a blog post available.

Ruijia Chen (University of Washington) also received an honorable mention for her paper “How Interest-Driven Content Creation Shapes Opportunities for Informal Learning in Scratch: A Case Study on Novices’ Use of Data Structures,” co-authored by Benjamin Mako Hill and Syamindu Dasgupta. Regina will be talking about it during the session “Programing and Coding Support” on Wednesday May 4 at 09:00. You can also read about Ruijia, Mako, and Syamindu’s work on our blog.

The CDSC logo, which looks a bit like a cloud with four legs, and the text "Community Data Science Collective."
You can have this on a sticker!

How social context explains why some online communities engage contributors better than others

More than a billion people visit Wikipedia each month and millions have contributed as volunteers. Although Wikipedia exists in 300+ language editions, more than 90% of Wikipedia language editions have fewer than one hundred thousand articles. Many small editions are in languages spoken by small numbers of people, but the relationship between the size of a Wikipedia language edition and that language’s number of speakers—or even the number of viewers of the Wikipedia language editions—varies enormously. Why do some Wikipedias engage more potential contributors than others? We attempted to answer this question in a study of three Indian language Wikipedias that will be published and presented at the ACM Conference on Human Factors in Computing (CHI 2022).

To conduct our study, we selected 3 Wikipedia language communities that correspond to the official languages of 3 neighboring states of India: Marathi (MR) from the state of Maharashtra, Kannada (KN) from the state of Karnataka, and Malayalam (ML) from the state of Kerala (see the map in right panel of the figure above). While the three projects share goals, technological infrastructure, and a similar set of challenges, Malayalam Wikipedia’s community engaged its language speakers in contributing to Wikipedia at a much higher rate than the others. The graph above (left panel) shows that although MR Wikipedia has twice as many viewers as ML Wikipedia, ML has more than double the number of articles on MR.

Our study focused on identifying differentiating factors between the three Wikipedias that could explain these differences. Through a grounded theory analysis of interviews with 18 community participants from the three projects, we identified two broad explanations of a “positive participation cycle” in Malayalam Wikipedia and a “negative participation cycle” in Marathi and Kannada Wikipedias. 

As the first step of our study, we conducted semistructured interviews with active participants of all three projects to understand their personal experiences and motivation; their perceptions of dynamics, challenges, and goals within their primary language community; and their perceptions of other language Wikipedia. 

We found that MR and KN contributors experience more day-to-day barriers to participation than ML, and that these barriers hinder contributors’ day-to-day activity and impede engagement. For example, both MR and KN members reported a large number of content disputes that they felt reduced their desire to contribute.

But why do some Wikipedias like MR or KN have more day-to-day barriers to contribution like content disputes and low social support than others? Our interviews pointed to a series of higher-level explanations. For example, our interviewees reported important differences in the norms and rules used within each community as well as higher levels of territoriality and concentrated power structures in MR and KN.

Once again, though: why do the MR and KN Wikipedias have these issues with territoriality and centralized authority structures? Here we identify a third, even higher-level set of differences in the social and cultural contexts of the three language-speaking communities. For example, MR and KN community members attributed low engagement to broad cultural attitudes toward volunteerism and differences in their language community’s engagement with free software and free culture.

The two flow charts above visualize the explanatory mapping of divergent feedback loops we describe.  The top part of the figure illustrates how the relatively supportive macro-level social environment in Kerala led to a larger group of potential contributors to ML as well as a chain reaction of processes that led to a Wikipedia better able to engage potential contributors. The process is an example of a positive feedback cycle. The second, bottom part of the figure shows the parallel, negative feedback cycle that emerged in MR and KN Wikipedias. In these settings, features of the macro-level social environment led to a reliance on a relatively small group of people for community leadership and governance. This led, in turn, to barriers to entry that reduced contributions. 

One final difference between the three Wikipedias was the role that paid labor from NGOs played. Because the MR and KN Wikipedias struggled to recruit and engage volunteers, NGOs and foundations deployed financial resources to support the development of content in Marathi and Kannada, but not in ML to the same degree. Our work suggested this tended to further concentrate power among a small group of paid editors in ways that aggravated the meso-level community struggles. This is shown in the red box in the second (bottom) row of the figure.

The results from our study provide a conceptual framework for understanding how the embeddedness of social computing systems within particular social and cultural contexts shape various aspects of the systems. We found that experience with participatory governance and free/open-source software in the Malayalam community supported high engagement of contributors. Counterintuitively, we found that financial resources intended to increase participation in the Marathi and Kannada communities hindered the growth of these communities. Our findings underscore the importance of social and cultural context in the trajectories of peer production communities. These contextual factors help explain patterns of knowledge inequity and engagement on the internet. 


Please refer to the preprint of the paper for more details on the study and our design suggestions for localized peer production projects. We’re excited that this paper has been accepted to CHI 2022 and received the Best Paper Honorable Mention Award! It will be published in the Proceedings of the ACM on Human-Computer Interaction and presented at the conference in May. The full citation for this paper is:

Sejal Khatri, Aaron Shaw, Sayamindu Dasgupta, and Benjamin Mako Hill. 2022. The social embeddedness of peer production: A comparative qualitative analysis of three Indian language Wikipedia editions. In CHI Conference on Human Factors in Computing Systems (CHI ’22), April 29-May 5, 2022, New Orleans, LA, USA. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3491102.3501832

If you have any questions about this research, please feel free to reach out to one of the authors: Sejal Khatri, Benjamin Mako Hill, Sayamindu Dasgupta, and Aaron Shaw

Attending the conference in New Orleans? Come attend our live presentation on May 3 at 3 pm at the CHI program venue, where you can discuss the paper with all the authors. 

How does Interest-driven Participation Shape Computational Learning in Online Communities?

Online communities are frequently described as promising sites for computing education. Advocates of online communities as contexts for learning argue that they can help novices learn concrete programming skills through self-directed and interest-driven work. Of course, it is not always clear how well this plays out in practice—especially when it comes to learning challenging programming concepts. We sought to understand this process through a mixed-method case study of the Scratch online community that will be published and presented at the ACM Conference on Human Factors in Computing (CHI 2022) in several weeks.

Scratch is the largest online interest-driven programming community for novices. In Scratch, users can create programming projects using the visual-block based Scratch programming language. Scratch users can choose to share their projects—and many do—so that they can be seen, interacted with, and remixed by other Scratch community members. Our study focused on understanding how Scratch users learn to program with data structures (i.e., variables and lists)—a challenging programming concept for novices—by using community-produced learning resources such as discussion threads and curated project examples. Through a qualitative analysis on Scratch forum discussion threads, we identified a social feedback loop where participation in the community raises the visibility of some particular ways of using variables and lists in ways that shaped the nature and diversity of community-produced learning resources. In a follow-up quantitative analysis on a large collection of Scratch projects, we find statistical support for this social process. 

A program made by a stack of Scratch programming blocks. From the top to bottom: ``when clicked,'' ``forever,'' ``if touching Bat? then,'' ``change score by -1.''
A Scratch project code of a score counter in a game. 

As the first step of our study, we collected and qualitatively analyzed 400 discussion threads about variables and lists in the Scratch Q&A forums. Our key finding was that Scratch users use specific, concrete examples to teach each other about variables and lists. These examples are commonly framed in terms of elements in the projects that they are making, often specific to games.

For instance, we observed users teach each other how to make a score counter in a game using variables. In another example, we saw users sharing tips on creating an item inventory in a game using lists. As a result of this focus on specific game elements, user-generated examples and tutorials are often framed in the specifics of these game-making scenarios. For example, a lot of sample Scratch code on variables and lists were from games with popular elements like scores and inventories. While these community-produced learning resources offers valuable concrete examples, not everybody is interested in making games. We some some evidence that users who are not interested in making games involving scores and inventories were less likely to get effective support when they sought to learn about variables. We argue that repeated over time, this dynamic can lead to a social feedback loop where reliance on community-generated resources can place innovative forms of creative coding at a disadvantage compared to historically common forms.

This diagram illustrates the hypothetical social feedback loop that we constructed based on our findings in Study 1. The diagram starts with the box of ``Stage 1'' on the left, and the text that explains Stage 1 says: ``learners create artifacts with Use Case A.'' There is a right-going arrow pointing from Stage 1 to the box of ``Stage 2'' on the right and the text on the arrow says: ``learners turn to the community for help or inspiration.'' The text that explains Study 2 says: ``Community cumulates learning resources framed around Use Case A.'' Above these is a left-going arrow that points from Stage 2 back to Stage 1, forming the loop. The text on the arrow says: ``Subsequent learners get exposed to resources about Use Case A.'' Underneath the entire loop there is an down-going arrow pointing to a box of ``Outcome.'' The text that explains Outcome says: ``Use Case A becomes archetypal. Other innovative use cases become less common.''
Our proposed hypothetical social feedback loop of how community-generated resources may constrain innovative computational participation. 

The graph here is a visualization of the social feedback loop theory that we proposed. Stage 1 suggests that, in an online interest-driven learning community, some specific applications of a concept (“Use Case A”) will be more popular than others. This might be due to random chance or any number of reasons. When seeking community support, learners will tend to ask questions framed specifically around Use Case A and use community resources framed in terms of the same use case. Stage 2 shows the results of this process. As learners receive support, they produce new artifacts with Use Case A that can serve as learning resources for others. Then, learners in the future can use these learning resources, becoming even more likely to create the same specific application. The outcome of the feedback loop is that, as certain applications of a concept become more popular over time, the community’s learning resources are increasingly focused on the same applications.

We tested our social feedback loop theory using 5 years of Scratch data including 241,634 projects created by 75,911 users. We tested both the mechanism and the outcome of the loop from multiple angles in terms of three hypotheses that we believe will be true if our the feedback loop we describe is shaping behavior:

  1. More projects involving variables and lists will be games over time.
  2. The type of project elements that users make with variables and lists (we defined it as the names that they gave to variables and lists) will be more homogenous.
  3. Users who have been exposed to popular variable and list names will be more likely to use those names in their own projects. We found at least some support for all of our hypotheses.

Our results provide broad (if imperfect) support for our social feedback loop theory. For example, the graph below illustrates one of our findings: users who have been exposed to popular list names (solid line) will be more likely to use (in other words, less likely to never use) popular names in their projects, compared to users who have never downloaded projects with popular list names (dashed line). 

This figure is a line plot that illustrates the curves from the survival analysis for lists. The x-axis is ``Number of shared de novo projects w/ list.'' The labels are ``0'', ``10'', ``20'', and ``30'' from left to right. The y-axis is ``Proportion of users who have never used popular variable names.'' The labels are ``0.00'', ``0.25'', ``0.50'', ``0.75'', and ``1.00'' from bottom to top. There are two lines. The dashed line represents users who never downloaded projects with popular variable names. The solid line represents users who has downloaded projects with popular variable names.The solid line starts at 1 on x-axis and approximately 0.75 on the y-axis. The solid line descends in a convex shape and when it reaches 10 on the x-axis, it is at around 0.25 on the y-axis. The line keeps descending, reaches around 0.05 on the y-axis when it is at 25 on the x-axis, and stays at 0.05 for the rest of the x-axis. The dashed line is significantly higher than the solid line and stays above it the entire graph. The dashed line starts at 1 on x-axis and approximately 0.88 on the y-axis. The dashed line descends in a convex shape that is less steep than the solid line, and when it reaches 10 on the x-axis, it is at around 0.50 on the y-axis. The line keeps descending, reaches around 0.24 on the y-axis when it is at 25 on the x-axis, and stays at 0.24 for the rest of the x-axis.
Plots from our cox proportional survival analysis on the difference between users who have previously downloaded projects with popular list names versus those who have never done so. 

The results from our study describe an important trade-off that designers of online communities in computational learning need to be aware of. On the one hand, learners can learn advanced computational concepts by building their own explanation and understanding on specific use cases that are popular in the community. On the other, such learning can be superficial and not conceptual or generalizable: learners’ preference for peer-generated learning resources around specific interests can restrict the exploration of broader and more innovative uses, which can potentially limit sources of inspiration, pose barriers to broadening participation, and confine learners’ understanding of general concepts. We conclude our paper suggesting several design strategies that might be effective in countering this effect.


Please refer to the preprint of the paper for more details on the study and our design suggestions for future online interest-driven learning communities. We’re excited that this paper has been accepted to CHI 2022 and received the Best Paper Honorable Mention Award! It will be published in the Proceedings of the ACM on Human-Computer Interaction and presented at the conference in May. The full citation for this paper is:

Ruijia Cheng, Sayamindu Dasgupta, and Benjamin Mako Hill. 2022. How Interest-Driven Content Creation Shapes Opportunities for Informal Learning in Scratch: A Case Study on Novices’ Use of Data Structures. In CHI Conference on Human Factors in Computing Systems (CHI ’22), April 29-May 5, 2022, New Orleans, LA, USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3491102.3502124

If you have any questions about this research, please feel free to reach out to one of the authors: Ruijia “Regina” Cheng, Sayamindu Dasgupta, and Benjamin Mako Hill.

Notes from the CDSC Community Dialogue Series

This winter, the Community Data Science Collective launched a Community Dialogues series. These are meetings in which we invite community experts, organizers, and researchers to get together to share their knowledge of community practices and challenges, recent research, and how that research can be applied to support communities. We had our first meeting in February, with presentations from Jeremy Foot and Sohyeon Hwang on small communities and Nate TeBlunthius and Charlie Keine on overlapping communities.

Watch the introduction video starring Aaron Shaw!

Six grey and white birds sitting on a fence. A seventh bird is landing to join them.

Joining the Community” by Infomastern is marked with CC BY-SA 2.0.

What we covered

Here are some quick summaries of the presentations. After the presentations, we formed small groups to discuss how what we learned related to our own experiences and knowledge of communities.

Finding Success in Small Communities

Small communities often stay small, medium stay medium, and big stay big. Meteoric growth is uncommon. User control and content curation improves user experience. Small communities help people define their expectations. Participation in small communities is often very salient and help participants build group identity, but not personal relationships. Growth doesn’t mean success, and we need to move beyond that and solely using quantitative metrics to judge our success. Being small can be a feature, not a bug!

We built a list of discussion questions collaboratively. It included:

  • Are you actively trying to attract new members to your community? Why or why not?
  • How do you approach scale/size in your community/communities?
  • Do you experience pressure to grow? From where? Towards what end?
  • What kinds of connections do people seek in the community/communities you are a part of?
  • Can you imagine designs/interventions to draw benefits from small communities or sub-communities within larger projects/communities?
  • How to understand/set community members’ expectations regarding community size?
  • “Small communities promote group identity but not interpersonal relationships.” This seems counterintuitive.
  • How do you managing challenges around growth incentives/pressures?

Why People Join Multiple Communities

People join topical clusters of communities, which have more mutualistic relationships than competitive ones. There is a trilemma (like a dilemma) between large audience, specific content, and homophily (likemindness). No community can do everything, and it may be better for participants and communities to have multiple, overlapping spaces. This can be more engaging, generative, fulfilling, and productive. People develop portfolios of communities, which can involve many small communities..

Questions we had for each other:

  • Do members of your community also participate in similar communities?
  • What other communities are your members most often involved in?
  • Are they “competing” with you? Or “mutualistic” in some way?
  • In what other ways do they relate to your community?
  • There is a “trilemma” between the largest possible audience, specific content, and homophilous (likeminded/similar folks) community. Where does your community sit inside this trilemma?

Slides and videos

How you can get involved

You can subscribe to our mailing list! We’ll be making announcements about future events there. It will be a low volume mailing list.

Acknowledgements

Thanks to speakers Charlie Kiene, Jeremy Foote, Nate TeBlunthius, and Sohyeon Hwang! Kaylea Champion was heavily involved in planning and decision making. The vision for the event borrows from the User and Open Innovation workshops organized by Eric von Hippel and colleagues, as well as others. This event and the research presented in it were supported by multiple awards from the National Science Foundation (DGE-1842165; IIS-2045055; IIS-1908850; IIS-1910202), Northwestern University, the University of Washington, and Purdue University.

Session summaries and questions above were created collaboratively by event attendees.

Conferences, Publications, and Congratulations

This year was packed with things we’re excited about and want to celebrate and share. Great things happened to Community Data Science Collective members within our schools and the wider research community.

A smol brown and golden dog in front of a red door. The dog is wearing a pink collar with ladybugs. She also has very judgemental (or excited) eyebrows.
Meet Tubby! Sohyeon adopted Tubby this year.

Academic Successes

Sohyeon Hwang (Northwestern) and Wm Salt Hale (University of Washington) earned their master’s degrees. You can read Salt’s paper, “Resilience in FLOSS,” online.

Charlie Kiene and Regina Cheng completed their comprehensive exams and are now PhD candidates!

Nate TeBlunthuis defended his dissertation and started a post-doctoral fellowship at Northwestern. Jim Maddock defended his dissertation on December 16th.

Congratulations to everyone!

Teaching and Workshop Participation

Floor Fiers and Sohyeon ran a workshop at Computing Everywhere, a Northwestern initiative to help students build computational literacy. Sohyeon and Charlie participated in Yale SMGI Community Driven Governance Workshop. We also had standout attendance at Social Computing Systems Summer Camp, with Sneha Narayan, Stefania Druga, Charlie, Regina, Salt, and Sohyeon participating.

Regina was a teaching assistant for senior undergraduate students on their capstone projects. Regina’s mentees won Best Design and Best Engineering awards.

Conference Presentations

Sohyeon and Jeremy Foote presented together at CSCW (Computer Supported Co-operative Work) where they earned a Best Paper Honorable Mention award. Nick Vincent had two presentations at CSCW, one relating to Wikipedia links in search engine results and one on conscious data contribution. Benjamin Mako Hill and Nate presented on algorithmic flagging on Wikipedia.

Salt was interviewed on the FOSS and Crafts podcast. His conference presentations included Linux App Summit, SeaGL and DebConf. Kaylea Champion spoke at SeaGL and DebConf. Kaylea’s DebConf present was on her research on detecting at-risk projects in Debian.

Kaylea and Mako also presented at Software Analysis, Evolution and Reengineering, an IEEE conference.

Emilia Gan, Mako, Regina, and Stef organized the “Imagining Future Design of Tools for Youth Data Literacies” workshop at the 2021 Connected Learning Summit.

Our 2021 publications include:

  • Champion, Kaylea. 2021. “Underproduction: An approach for measuring risk in open source software.” 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). pp. 388-399, doi: 10.1109/SANER50967.2021.00043.
  • Fiers, Floor , Aaron Shaw , and Eszter Hargittai. 2021. “Generous Attitudes and Online Participation.” Journal of Quantitative Description: Digital Media, 1. https://doi.org/10.51685/jqd.2021.008
  • Hill, Benjamin Mako , and Aaron Shaw , 2021. “The hidden costs of requiring accounts: Quasi-experimental evidence from peer production.” Communication Research 48(6): 771-795. https://doi.org/10.1177%2F0093650220910345.
  • Hwang, Sohyeon and Jeremy Foote . 2021. “Why do people participate in small online communities?”. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 462:1-462:25. https://doi.org/10.1145/3479606
  • Shaw, Aaron and Eszter Hargittai. 2021. “Do the Online Activities of Amazon Mechanical Turk Workers Mirror Those of the General Population? A Comparison of Two Survey Samples.” International Journal of Communication 15: 4383–4398. https://ijoc.org/index.php/ijoc/article/view/16942
  • TeBlunthuis, Nathan , Benjamin Mako Hill , and Aaron Halfaker. 2021. “Effects of Algorithmic Flagging on Fairness: Quasi-experimental Evidence from Wikipedia.” Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 56 (April 2021), 27 pages. https://doi.org/10.1145/3449130
  • TeBlunthuis, Nathan. 2021 “Measuring Wikipedia Article Quality in One Dimension.” In Proceedings of the 17th International Symposium on Open Collaboration (OpenSym ’21). Online: ACM Press. https://doi.org/10.1145/3479986.3479991.

Fool’s gold? The perils of using online survey samples to study online behavior

The OG Mechanical Turk (public domain via Wikimedia Commons). It probably was not useful for unbiased survey research sampling either.

When it comes to research about participation in social media, sampling and bias are topics that often get ignored or politely buried in the "limitations" sections of papers. This is even true in survey research using samples recruited through idiosyncratic sites like Amazon’s Mechanical Turk. Together with Eszter Hargittai, I (Aaron) have a new paper (pdf) out in the International Journal of Communication (IJOC) that illustrates why ignoring sampling and bias in online survey research about online participation can be a particularly bad idea.

Surveys remain a workhorse method of social science, policy, and market research. But high-quality survey research that produces generalizable insights into big (e.g., national) populations is expensive, time-consuming, and difficult. Online surveys conducted through sites like Amazon Mechanical Turk (AMT), Qualtrics, and others offer a popular alternative for researchers looking to reduce the costs and increase the speed of their work. Some people even go so far as to claim that AMT has "ushered in a golden age in survey research" (and focus their critical energies on other important issues with AMT, like research ethics!).

Despite the hype, the quality of the online samples recruited through AMT and other sites often remains poorly or incompletely documented. Sampling bias online is especially important for research that studies online behaviors, such as social media use. Even with complex survey weighting schemes and sophisticated techniques like multilevel regression with post-stratification (MRP), surveys gathered online may incorporate subtle sources of bias because the people who complete the surveys online are also more likely to engage in other kinds of activities online.

Surprisingly little research has investigated these concerns directly. Eszter and I do so by using a survey instrument administered concurrently on AMT and a national sample of U.S. adults recruited through NORC at the University of Chicago (note that we published another paper in Socius using parts of the same dataset last year). The results suggest that AMT survey respondents are significantly more likely to use numerous social media, from Twitter to Pinterest and Reddit, as well as have significantly more experiences contributing their own online content, from posting videos to participating in various online forums and signing online petitions.

Such findings may not be shocking, but prevalent research practices often overlook the implications: you cannot rely on a sample recruited from an online platform like AMT to map directly to a general population when it comes to online behaviors. Whether AMT has created a survey research "golden age" or not, analysis conducted on a biased sample produces results that are less valuable than they seem.

Catching up on the Collective’s 2021 PhD Q&A

On November 5, folks from the Community Data Science Collective held an open session to allow folks to answer questions about our PhD programs. We’ve posted a 30 minutes video of the first half of the event on our group YouTube channel that includes an introduction to the group and answers some basic questions that attendees had submitted in advance.

Video of the first 30 minutes of the PhD Q&A.

If you have additional questions, feel free to reach out to the CDSC faculty or to anybody else in the group. Contact information for everybody should be online.

Keep in mind that the first due date (University of Washington Department of Communication MA/PhD program) is November 15, 2021.

The rest of the deadlines at Purdue (Brian Lamb School of Communication), Northwestern (Media, Technology & Society; Technology & Social Behavior), and UW (Human-Centered Design & Engineering; Computer Science & Engineering; Information School) are in December—mostly on December 1st.

The Hidden Costs of Requiring Accounts

Should online communities require people to create accounts before participating?

This question has been a source of disagreement among people who start or manage online communities for decades. Requiring accounts makes some sense since users contributing without accounts are a common source of vandalism, harassment, and low quality content. In theory, creating an account can deter these kinds of attacks while still making it pretty quick and easy for newcomers to join. Also, an account requirement seems unlikely to affect contributors who already have accounts and are typically the source of most valuable contributions. Creating accounts might even help community members build deeper relationships and commitments to the group in ways that lead them to stick around longer and contribute more.

In a new paper published in Communication Research, Benjamin Mako Hill and Aaron Shaw provide an answer. We analyze data from “natural experiments” that occurred when 136 wikis on Fandom.com started requiring user accounts. Although we find strong evidence that the account requirements deterred low quality contributions, this came at a substantial (and usually hidden) cost: a much larger decrease in high quality contributions. Surprisingly, the cost includes “lost” contributions from community members who had accounts already, but whose activity appears to have been catalyzed by the (often low quality) contributions from those without accounts.


The full citation for the paper is: Hill, Benjamin Mako, and Aaron Shaw. 2020. “The Hidden Costs of Requiring Accounts: Quasi-Experimental Evidence from Peer Production.” Communication Research, 48 (6): 771–95. https://doi.org/10.1177/0093650220910345.

If you do not have access to the paywalled journal, please check out this pre-print or get in touch with us. We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.