A systems approach to studying online communities

Systems theory is a broad and multidisciplinary scientific approach that studies how things (molecules or cells or organs or people or companies) interact with each other. It argues that understanding how something works requires understanding its relationships and interdependencies.

For example, if we want to predict whether a new online community will grow, an individual perspective might focus on who the founder is, what software it is running on, how well it is designed, etc. A systems approach would argue that it is at least as important to understand things like how many similar communities there are, how active they are, and whether the platform is growing or shrinking.

In a paper just published in Media and Communication, I (Jeremy) argue that 1) it is particularly important to use a systems lens to study online communities, 2) that online communities provide ideal data for taking these approaches, and 3) that there is already really neat research in this area and there should be more of it.

The role of platforms

So, why is it so important to study online communities as interdependent “systems”? The first reason is that many online communities have a really important interdependence with the platforms that they run on. Platforms like Reddit or Facebook provide the servers and software for millions of communities, which are run mostly independently by the community managers and moderators.

However, this is an ambivalent relationship and often the goals and desires of at least some moderators are at odds with those of the platform, and things like community bans from the platform side or protests from the community side are not uncommon. The ways that platform decisions influence communities and how communities can work together to influence platforms are inherently systems questions.

Low barriers to entry and exit

A second feature of online communities is the relative ease with which people can join or leave them. Unlike offline groups, which at least require participants to get dressed, do their hair, and show up somewhere, online community participants can participate in an online community literally within seconds of knowing that it exists.

Similarly, people can leave incredibly easily, and most people do. This figure shows the number of comments made per person across 100 randomly selected subreddits (each line represents a subreddit; axes are both log-scaled). In every case, the vast majority of people only commented once while a few people made many comments.

Fuzzy boundaries

Finally, it’s often really difficult to draw clear boundaries around where one online community ends and another begins. For example, is all of Wikipedia one “community”? It might make sense to think of a language edition, a WikiProject, or even a single page as a community, and researchers have done all of the above. Even on platforms like Reddit, where there is a clearl delineation between communities, there are dependencies, with people and conversations moving across and between communities on similar topics.

In other words, online communities are semi-autonomous, interdependent, contingent organizations, deeply influenced by their environments. Online community scholars have often ignored this larger context, but systems theory gives us a rich set of tools for studying these interdependencies. One reason that it is so ideal is because online communities provide ideal data.

Data from Online Communities

Systems theory is not new – many of the main concepts were developed in the 1950s and 1960s or earlier. Organizational communication researchers saw how applicable these ideas were, and many researchers proposed treating organizations as systems.

However, it was really tough to get the data needed to do systems-based research. To study a group or organization as a system, you need to know about not only the internal workings of the group, but how it relates to other groups, how it is influenced by and influences its environment, etc. Gathering data about even one group was difficult and expensive; getting the data to study many groups and how they interact with each other over time was impossible.

The internet has entered the chat

Online communities provide the kind of data that these earlier researchers could have only dreamed of. Instead of data about one organization, platforms store data about thousands of organizations. And this is not just high-level data about activity levels or participation; on the contrary, we often have longitudinal, full-text conversations of millions of people as they interact within and move between communities.

Systems Approaches

In part, this article is a call for researchers to think more explicitly about online communities as systems, and to apply systems theory as a way of understanding how online communities work and how we can design research projects to understand them better. It is also an attempt to highlight strands of research that are already doing this. In the paper, I talk about four: Community Comparisons and Interactions, Individual Trajectories, Cross-level Mechanisms, and Simulating Emergent Behavior. Here, I’ll focus on just two.

Individual Trajectories

Figure from Panciera, K., Halfaker, A., & Terveen, L. (2009). Wikipedians are born, not made: A study of power editors on Wikipedia. Proceedings of the ACM 2009 International Conference on Supporting Group Work, 51–60. https://doi.org/10.1145/1531674.1531682

The first is what I call “Individual Trajectories”. In this approach, researchers can look at how individual people behave across a platform. One of the neat things about having longitudinal, unobtrusively collected data is that we can identify something interesting about users and go “back in time” to look for differences in earlier behavior. For example, in the plot above, Panciera et al. identified people who became active Wikipedia editors; they then went back and looked at how their behavior differed from typical editors from their early days on the site.

Researchers could and should do more work that looks at how people move between communities, and how communities influence the behavior of their members.

Simulating Emergent Behavior

The second approach is to use simulations to study emergent behaviors. Agent-based modeling software like NetLogo or Mesa allows researchers to create virtual worlds, where computational “agents” act according to theories of how the world works. Many communication theories make predictions about how individual‐level behavior produces higher‐level patterns, often through feedback loops (e.g., the Spiral of Silence theory). If agent-based models don’t produce those patterns, then we know that something about the theory—or its computational representation—is wrong.

Model of misinformation spread, from Hu et al. (under review)

Agent-based modeling has received some attention from communication researchers lately, including a wonderful special issue was recently published in Communication Methods and Measures; the editorial article makes some great arguments for the promise and benefits of simulations for communication research.

New Opportunities

It is a really exciting time to be a computational social scientist, especially one that is interested in online organizations and organizing. We have only scratched the surface of what we can learn from the data that is pouring down around us, especially when it comes to systems theory questions. Tools, methods, and computational advances are constantly evolving and opening up new avenues of research.

Of course, taking advantage of these data sources and computational advances requires a different set of skills than Communication departments have traditionally focused on, and complicated, large-scale analyses require the use of supercomputers and extensive computational expertise.

However, there are many approaches like agent-based modeling or simple web scraping that can be taught to graduate students in one or two semesters, and open up lots of possibilities for doing this kind of research.

I’d love to talk more about these ideas—please reach out, or if you are coming to ICA, come talk to me!

Forming, storming, norming, performing, and …chloroforming?

In 1965, Bruce Tuckman proposed a “developmental sequence in small groups.” According to his influential theory, most successful groups go through four stages with rhyming names:

  1. Forming: Group members get to know each other and define their task.
  2. Storming: Through argument and disagreement, power dynamics emerge and are negotiated.
  3. Norming: After conflict, groups seek to avoid conflict and focus on cooperation and setting norms for acceptable behavior.
  4. Performing: There is both cooperation and productive dissent as the team performs the task at a high level.

Fortunately for organizational science, 1965 was hardly the last stage of development for Tuckman’s theory!

Twelve years later, Tuckman suggested that adjourning or mourning reflected potential fifth stages (Tuckman and Jensen 1977). Since then, other organizational researchers have suggested other stages including transforming and reforming (White 2009), re-norming (Biggs), and outperforming (Rickards and Moger 2002).

What does the future hold for this line of research?

To help answer this question, we wrote a regular expression to identify candidate words and placed the full list is at this page in the Community Data Science Collective wiki.

The good news is that despite the active stream of research producing new stages that end or rhyme with -orming, there are tons of great words left!

For example, stages in a group’s development might include:

  • Scorning: In this stage, group members begin mocking each other!
  • Misinforming: Groups that reach this stage start producing fake news.
  • Shoehorning: These groups try to make their products fit into ridiculous constraints.
  • Chloroforming: Groups become languid and fatigued?

One benefit of keeping our list in the wiki is that the organizational research community can use it to coordinate! If you are planning to use one of these terms—or if you know of a paper that has—feel free to edit the page in our wiki to “claim” it!


Although credit for this post goes primarily to Jeremy Foote and Benjamin Mako Hill, the other Community Data Science Collective members can’t really be called blameless in the matter either.

Summer Institute in Computational Social Science

For the second year, Matt Salganik and Chris Bail are running a two-week Summer Institute in Computational Social Science at Duke Univeristy. The goal of the institute is to bring social scientists and data scientists together to learn about computational social science, which can be described as a merger of their two fields.

This year, there are seven partner locations where local students livestream the activities from Duke and learn from local computational social scientists.  Both of our universities are among the partner locations.

At the University of Washington, Kaylea and Charlie have both been accepted as participants in the UW summer institute. At Northwestern University, Jeremy is helping to organize SICSS Chicago.

Much of the work that we do in the Community Data Science Collective could be considered computational social science, and we are excited about the potential for  computational methods in social science. This is a great program for helping to disseminate computational social science approaches and train the next generation of computational social scientists. The Community Data Science Collective is happy to be a sponsor of the Chicago partner location.

Photo of the SICSS participants in Chicago, sponsored by CDSC!

Introducing Computational Methods to Social Media Scientists

The ubiquity of large-scale data and improvements in computational hardware and algorithms have provided enabled researchers to apply computational approaches to the study of human behavior. One of the richest contexts for this kind of work is social media datasets like Facebook, Twitter, and Reddit.

We were invited by Jean BurgessAlice Marwick, and Thomas Poell to write a chapter about computational methods for the Sage Handbook of Social Media. Rather than simply listing what sorts of computational research has been done with social media data, we decided to use the chapter to both introduce a few computational methods and to use those methods in order to analyze the field of social media research.

A “hairball” diagram from the chapter illustrating how research on social media clusters into distinct citation network neighborhoods.

Explanations and Examples

In the chapter, we start by describing the process of obtaining data from web APIs and use as a case study our process for obtaining bibliographic data about social media publications from Elsevier’s Scopus API.  We follow this same strategy in discussing social network analysis, topic modeling, and prediction. For each, we discuss some of the benefits and drawbacks of the approach and then provide an example analysis using the bibliographic data.

We think that our analyses provide some interesting insight into the emerging field of social media research. For example, we found that social network analysis and computer science drove much of the early research, while recently consumer analysis and health research have become more prominent.

More importantly though, we hope that the chapter provides an accessible introduction to computational social science and encourages more social scientists to incorporate computational methods in their work, either by gaining computational skills themselves or by partnering with more technical colleagues. While there are dangers and downsides (some of which we discuss in the chapter), we see the use of computational tools as one of the most important and exciting developments in the social sciences.

Steal this paper!

One of the great benefits of computational methods is their transparency and their reproducibility. The entire process—from data collection to data processing to data analysis—can often be made accessible to others. This has both scientific benefits and pedagogical benefits.

To aid in the training of new computational social scientists, and as an example of the benefits of transparency, we worked to make our chapter pedagogically reproducible. We have created a permanent website for the chapter at https://communitydata.science/social-media-chapter/ and uploaded all the code, data, and material we used to produce the paper itself to an archive in the Harvard Dataverse.

Through our website, you can download all of the raw data that we used to create the paper, together with code and instructions for how to obtain, clean, process, and analyze the data. Our website walks through what we have found to be an efficient and useful workflow for doing computational research on large datasets. This workflow even includes the paper itself, which is written using LaTeX + knitr. These tools let changes to data or code propagate through the entire workflow and be reflected automatically in the paper itself.

If you  use our chapter for teaching about computational methods—or if you find bugs or errors in our work—please let us know! We want this chapter to be a useful resource, will happily consider any changes, and have even created a git repository to help with managing these changes!

Introduction to R workshop

I recently taught a two-session workshop introducing R to Kellogg MBA students. I had  a few goals for the workshops:

  1. Convince students of the benefits of using text-based programming for data exploration and analysis
  2. Introduce basic programming concepts (e.g., variables, functions)
  3. Give students a basic understanding of how to do some fundamental data analysis tasks in R: importing, cleaning, visualizing, and modeling

Those are really big goals for only four hours. I decided to use the tidyverse as much as possible and not even teach base R syntax like ‘[,]’, apply, etc. I used the first session to show and explain code using the nycflights13 dataset. For the the second session we did a few more examples but mostly worked on exercises using a dataset from Wikia that I created (with help from Mako and Aaron Halfaker‘s code and data).

Learning R does have its downsides

Retrospection

Overall, I think that the workshops went pretty well. I think that students definitely have a better understanding and a better set of tools than I did after I had used R for four hours!

That being said, there was plenty of room for improvement. I am scheduled to teach another set of workshops early next year and I’m planning to make a few changes:

  1. Make both of the workshops more hands-on and interactive. I think I’ll divide the topics covered: the first workshop will be on importing, cleaning, and grouping data and the second will be on visualizing and creating inferential models.
  2. Get more help – teaching non-programmers R requires some hand-holding and individual attention. To be successful, I think a workshop like this requires 1 “TA” for every 8-10 students.
  3. Find a more relevant dataset. Although I actually learned a few things about my dataset that will help with my papers that use it, I think it would be better to have a dataset that is as similar as possible to what students will be working with in their careers.
  4. Connect the visualization and regression more directly to a specific analysis problem rather than as syntax-learning exercises.

Reuse this workshop!

I found some pretty good resources already in existence for introducing students to R, but none of them quite fit the scope of what I was looking for.  All of the code that I used (as well as some slides for the beginning of class) are on github and GPL licensed. Please reuse my work and submit pull requests!

Why do people start new online communities and projects?

Online communities have become ubiquitous, providing not only entertainment but wielding increasing cultural and political influence. While news organizations and researchers have focused a lot of attention on online communities after they become influential, very little is known about how or why they get started. Our survey of hundreds of Wikia.com founders shows that typical online communities are actually very different from the communities that are “in the news”. Online community founders have diverse motivations, but typically have modest goals which are focused on filling their own needs, and they don’t necessarily care if their projects ever get very big. Our research suggests that rather than being failures, small online communities are both intentional and common.

Most online communities are small —Our research is inspired by the skewed distribution of attention online. For example, these three graphs show the number of contributors to each subreddit, github project, and Wikipedia page. (Note the log scale – the reality is even more skewed than these plots make it appear).

Reddit graph


Github graph

Wikipedia graphIn every case, there is a “long tail” of projects with very few contributions or attention, while the most popular projects get the lion’s share. It is perhaps unsurprising, then, that they also garner the majority of scholarly attention. However, what these graphs also show is that most online communities are very small.

Even when scholars include smaller communities in their analysis, they typically treat longevity and size as measures of success. Using this metric, the vast majority of new projects fail. So why do people start new online communities? Are they simply naive, not realizing that large-scale success is so rare? Are community founders trying to win the attention lottery?

Our Survey —We worked with some great folks at Wikia to send a survey to community founders right after they started their community. We received partial or full responses from hundreds of founders.

Wikia homepage
Wikia homepage as it appeared during our data collection (via archive.org) with the invitation to found a new wiki highlighted. Twilight was really big in 2010.

 

In addition to demographic information, we asked a set of thirteen questions about the motivations of founders, based on the contributor motivation literature, and seven questions about their goals for their community. We also asked founders about their plans for their community, and whether they were planning to follow some of the best practices for building and running online communities.

Founders have diverse motivations and modest goals — We found that Wikia founders have diverse motivations. We used PCA to identify four main motivations for creating new wikis: spreading information and building a community, problems with existing wikis, for fun or learning, and creating and publicizing personal content. Spreading information and building a community was the most common motivation, but each of these was marked as a primary motivation by multiple respondents.

We also found that the barriers to starting a new community – both technological and cognitive – are very low. Only 32% of founders reported planning on starting their wiki for a few weeks or longer, while fully 46% of founders had only planned it for a few hours or a few minutes.

As with motivations, founders had diverse goals. The most common top goal was the creation of high-quality information, with nearly half of respondents selecting it. Community longevity/activity and growth were also common goals.

Finally, we looked at whether there was a relationship between motivations and goals, and between goals and plans for community building. We found that those whose top goal was information quality were less likely to be motivated by fun and learning, and that they were less likely to plan on recruiting contributors or encouraging contributions. In future research, we are looking at how a founder’s goals and plans relate to membership and contribution growth.

Motivations by goals
Plans by goals
Distribution of founder motivations and plans, based on whether their top goal is community or information quality.

So what? —We believe that platform designers and researchers should focus more of their resources on understanding small and short-lived communities. Our research suggests that the attention paid to the more popular and long-lived online communities has perpetuated a false assumption that all communities seek to become large and powerful. Indeed, our respondents are typically not seeking or even hoping for large-scale “success”.

In addition, we believe that in many contexts, understanding online communities can be augmented by focusing on founders. Platform designers can study founders to understand how users would like to use a system and researchers can do more to understand the differences between founders and other contributors.

There is also a need to generalize this research – founders on other online platforms (Reddit, github, etc.) may have a different set of motivations and goals (although we suspect that they will be similarly modest in their ambitions). Overall, there is lots of room for additional research on how and why things get started online.

The paper and data — If you liked this blog post, then you’ll love the full paper: Starting online communities: Motivations and goals of wiki founders. Even better, if you are planning to be at CHI 2017, come watch the talk!

This post (and the paper) were written by Jeremy Foote, Aaron Shaw and Darren Gergle. The charts at the beginning of the post were created using data from the great public datasets at Big Query. Anonymized results of the survey are publicly available, and code is coming.