A new paper on the risk of nationalist governance capture in self-governed Wikipedia projects

Wikipedia is one of the most visited websites in the world and the largest online repository of human knowledge. It is also both a target of and a defense against misinformation, disinformation, and other forms of online information manipulation. Importantly, its 300 language editions are self-governed—i.e., they set most of their rules and policies. Our new paper asks: What types of governance arrangements make some self-governed online groups more vulnerable to disinformation campaigns? We answer this question by comparing two Wikipedia language editions—Croatian and Serbian Wikipedia. Despite relying on common software and being situated in a common sociolinguistic environment, these communities differed in how successfully they responded to disinformation-related threats.

For nearly a decade, the Croatian language version of Wikipedia was run by a cabal of far-right nationalists who edited articles in ways that promoted fringe political ideas and involved cases of historical revisionism related to the Ustaše regime, a fascist movement that ruled the Nazi puppet state called the Independent State of Croatia during World War II. This cabal seized complete control of the governance of the encyclopedia, banned and blocked those who disagreed with them, and operated a network of fake accounts to give the appearance of grassroots support for their policies.

Thankfully, Croatian Wikipedia appears to be an outlier. Though both the Croatian and Serbian language editions have been documented to contain nationalist bias and historical revisionism, Croatian Wikipedia alone seems to have succumbed to governance capture: a takeover of the project’s mechanisms and institutions of governance by a small group of users.

The situation in Croatian Wikipedia was well-documented and is now largely fixed, but still know very little about why Croatian Wikipedia was taken over, while other language editions seem to have rebuffed similar capture attempts. In a new paper that is accepted for publication in the Proceedings of the ACM: Human-Computer Interaction (CSCW), we present an interview-based study that tries to explain why Croatian was captured while several other editions facing similar contexts and threats fared better.

Short video presentation of the work given at Wikimania in August 2023.

We interviewed 15 participants from both the Croatian and Serbian Wikipedia projects, as well as the broader Wikimedia movement. Based on insights from these interviews, we arrived at three propositions that, together, help explain why Croatian Wikipedia succumbed to capture while Serbian Wikipedia did not: 

  1. Perceived Value as a Target. Is the project worth expending the effort to capture?
  2. Bureaucratic Openness. How easy is it for contributors outside the core founding team to ascend to local governance positions?
  3. Institutional Formalization. To what degree does the project prefer personalistic, informal forms of organization over formal ones?
The conceptual model from our paper, visualizing possible institutional configurations among Wikipedia projects that affect the risk of governance capture. 

We found that both Croatian Wikipedia and Serbian Wikipedia were attractive targets for far-right nationalist capture due to their sizable readership and resonance with a national identity. However, we also found that the two projects diverged early on in their trajectories in terms of how open they remained to new contributors ascending to local governance positions and the degree to which they privileged informal relationships over formal rules and processes as organizing principles of the project. Ultimately, Croatian’s relative lack of bureaucratic openness and rules constraining administrator behavior created a window of opportunity for a motivated contingent of editors to seize control of the governance mechanisms of the project. 

Though our empirical setting was Wikipedia, our theoretical model may offer insight into the challenges faced by self-governed online communities more broadly. As interest in decentralized alternatives to Facebook and X (formerly Twitter) grows, communities on these sites will likely face similar threats from motivated actors. Understanding the vulnerabilities inherent in these self-governing systems is crucial to building resilient defenses against threats like disinformation. 

For more details on our findings, take a look at the preprint of our paper.


Preprint on arxiv.org: https://arxiv.org/abs/2311.03616. The paper has been accepted for publication in Proceedings of the ACM on Human-Computer Interaction (CSCW) and will be presented at CSCW in 2024. This blog post and the paper it describes are collaborative work by Zarine Kharazian, Benjamin Mako Hill, and Kate Starbird.

The social structure of new wiki communities

A new paper that our that our group has published seeks to test whether the kind of communication patterns associated with successful offline teams also predict success in online collaborative settings. Surprisingly, we find that it does not. In the rest of this blog post, we summarize that research and unpack that result.

Many of us have been part of a work team where everyone clicked. Everyone liked and respected each other, maybe you even hung out together outside of work. In a team like that, when someone asks you to cover a shift, or asks you to stay late to help them finish a project, you do it.

This anecdotal experience that many of us have is borne out by research. When members of work groups in corporate settings feel integrated into a group, and particularly when their identity is connected to their group membership, they are more willing to contribute to the group’s goals. Integrative groups (where there isn’t a strong hierarchy and where very few people are on the periphery) are also able to communicate and coordinate their work better.

One way to measure whether a group is “integrative” is to look at the group’s conversation networks, as shown in the figure below. Groups where few people are on the periphery (like on the left) usually perform better along a number of dimensions, such as creativity and productivity.

Examples of two possible configurations of a work group. The work group on the left is much more “integrative,” and we would expect it to be more creative and productive.

In our new paper, we set out to look for evidence that early online wiki communities at Fandom.com work the same way as work groups. When communities are getting started, there are lots of reasons to think that they would also benefit from integrative networks. Their members typically don’t know each other and communicate mostly via text—conditions that should make building a shared identity tough. In addition, they are volunteers who can easily leave at any time. The research on work groups made us think that integrative social structures would be especially important in making new wikis successful.

Communication network of the Spongebob wiki after 700 edits

In order to measure the social structure of these communities, we created communication networks for almost 1,000 wikis for the talk that happened during their firs 700 main page edits. Connections between people were based on who talked to whom on Talk pages. These are wiki pages connected to each page and each registered user on a wiki. We connected users who talked to each other at least a few times on the same talk pages, and looked at whether how integrative a communication network was predicted 1) how much people contributed and 2) how long a wiki remained active.

Surprisingly, we found that no matter how we measured communication networks, and no matter how we measured success, integrative network measures were not good at predicting that a wiki would survive or be productive. While a few of our control variables helped to predict productivity and survival, none of the network measures (nor all of them taken together) helped much to predict either of our success measures, as shown in Figures 5 and 6 from the paper.

Figure 5. Estimated coefficients predicting the productivity of a wiki.
Figure 6. Estimated coefficients predicting how quickly a wiki will become inactive.

So, what is going on here?

We have a few possible explanations for why communication network structures don’t seem to matter. One is that group identity for wiki members may not be influenced much by network structure. In a work group, it can be painfully obvious if you are on the periphery and not included in conversations or activities. Even though wiki conversations are technically all public and visible, in practice it’s very easy for group members to be unaware of conversations happening in other parts of the site. This idea is supported by research led by Sohyeon Hwang, which showed that people can build identity in an online community even without personal relationships.

Another complementary explanation for how groups coordinate work without integrative communication networks is that wiki software helps to organize what needs to be done without explicit communication. Much of this happens just because the central artifact of the community—the wiki—is continuously updated, so it is (relatively) clear what has been done and what needs to be done. In addition, there are opportunities for stigmergy. Stigmergy occurs when actors modifying the environment as a way of communicating. Then, others make decisions based on observing the environment. The canonical example is ants who leave pheremone trails for other ants to find and follow.

In wikis, this can be accomplished in a few ways. For example, contributors can create a link to a page that doesn’t yet exist. By default, these show up as red links, suggesting to others that a page needs to be created.

A final possible explanation for our results is based on how easy it is to join and leave online communities. It may be that integrative structures are so important because they help groups to overcome and navigate conflicts; in online communities contributors may be more likely to simply disengage instead of trying to resolve a conflict.

As we conclude in the paper:

Why do communication networks—important predictors of group performance outcomes across diverse domains—not predict productivity or survival in peer production? Our findings suggest that the relationship of communication structure to effective collaboration and organization is not universal but contingent. While all groups require coordination and undergo social influence, groups composed of different types of people or working in different technological contexts may have different communicative needs. Wikis provide a context where coordination via stigmergy may suffice and where the role of cheap exit as well as the difficulty of group-level conversation may lead to consensus-by-attrition.

We hope that others will help us to study some of these mechanisms more directly, and look forward to talking more with researchers and others interested in how and why online groups succeed.


The full citation for the paper is: Foote, Jeremy, Aaron Shaw, and Benjamin Mako Hill. 2023. “Communication Networks Do Not Predict Success in Attempts at Peer Production.” Journal of Computer-Mediated Communication 28 (3): zmad002. https://doi.org/10.1093/jcmc/zmad002.

We have also released replication materials for the paper, including all the data and code used to conduct the analyses.

Effects of Algorithmic Flagging on Fairness: Quasi-experimental Evidence from Wikipedia

Many online platforms are adopting machine learning as a tool to maintain order and high quality information in the face of massive influxes of of user generated content. Of course, machine learning algorithms can be inaccurate, biased or unfair. How do signals from machine learning predictions shape the fairness of online content moderation? How can we measure an algorithmic flagging system’s effects?

In our paper published at CSCW 2021, I (Nate TeBlunthuis) together with Benjamin Mako Hill and Aaron Halfaker analyzed the RCFilters system: an add-on to Wikipedia that highlights and filters edits that a machine learning algorithm called ORES identifies as likely to be damaging to Wikipedia. This system has been deployed on large Wikipedia language editions and is similar to other algorithmic flagging systems that are becoming increasingly widespread. Our work measures the causal effect of being flagged in the RCFilters user interface.

Screenshot of Wikipedia edit metadata on Special:RecentChanges with RCFilters enabled. Highlighted edits with a colored circle to the left side of other metadata are flagged by ORES. Different circle and highlight colors (white, yellow, orange, and red in the figure) correspond to different levels of confidence that the edit is damaging. RCFilters does not specifically flag edits by new accounts or unregistered editors, but does support filtering changes by editor types.

Our work takes advantage of the fact that RCFilters, like many algorithmic flagging systems, create discontinuities in the relationship between the probability that a moderator should take action and whether a moderator actually does. This happens because the output of machine learning systems like ORES is typically a continuous score (in RCFilters, an estimated probability that a Wikipedia edit is damaging), while the flags (in RCFilters, the yellow, orange, or red highlights) are either on or off and are triggered when the score crosses some arbitrary threshold. As a result, edits slightly above the threshold are both more visible to moderators and appear more likely to be damaging than edits slightly below. Even though edits on either side of the threshold have virtually the same likelihood of truly being damaging, the flagged edits are substantially more likely to be reverted. This fact lets us use a method called regression discontinuity to make causal estimates of the effect of being flagged in RCFilters.

Charts showing the probability that an edit will be reverted as function of ORES scores in the neighborhood of the discontinuous threshold that triggers the RCfilters flag. The jump in the increase in reversion chances is larger for registered editors compared to unregistered editors at both thresholds.

To understand how this system may effect the fairness of Wikipedia moderation, we estimate the effects of flagging on edits on different groups of editors. Comparing the magnitude of these estimates lets us measure how flagging is associated with several different definitions of fairness. Surprisingly, we found evidence that these flags improved fairness for categories of editors that have been widely perceived as troublesome—particularly unregistered (anonymous) editors. This occurred because flagging has a much stronger effect on edits by the registered than on edits by the unregistered.

We believe that our results are driven by the fact algorithmic flags are especially helpful for finding damage that can’t be easily detected otherwise. Wikipedia moderators can see the editor’s registration status in the recent changes, watchlists, and edit history. Because unregistered editors are often troublesome, Wikipedia moderators’ attention is often focused on their contributions, with or without algorithmic flags. Algorithmic flags make damage by registered editors (in addition to unregistered editors) much more detectable to moderators and so help moderators focus on damage overall, not just damage by suspicious editors. As a result, the algorithmic flagging system decreases the bias that moderators have against unregistered editors.

This finding is particularly surprising because the ORES algorithm we analyzed was itself demonstrably biased against unregistered editors (i.e., the algorithm tended to greatly overestimate the probability that edits by these editors were damaging). Despite the fact that the algorithms were biased, their introduction could still lead to less biased outcomes overall.

Our work shows that although it is important to design predictive algorithms to not have such biases, it is equally important to study fairness at the level of the broader sociotechnical system. Since we first published a preprint of our paper, a followup piece by Leijie Wang and Haiyi Zhu replicated much of our work and showed that differences between different Wikipedia communities may be another important factor driving the effect of the system. Overall, this work suggests that social signals and social context can interact with algorithmic signals and together these can influence behavior in important and unexpected ways.


The full citation for the paper is: TeBlunthuis, Nathan, Benjamin Mako Hill, and Aaron Halfaker. 2021. “Effects of Algorithmic Flagging on Fairness: Quasi-Experimental Evidence from Wikipedia.” Proceedings of the ACM on Human-Computer Interaction 5 (CSCW): 56:1-56:27. https://doi.org/10.1145/3449130.

We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

Community Data Science Collective at CSCW 2021

The Community Data Science Collective is attending the 2021 ACM Conference on Computer-supported Cooperative Work and Social Computing (CSCW).

The conference will feature two new papers by collective students and faculty that were published in the journal Proceedings of the ACM on Human-Computer Interaction: CSCW.

Information on the talks as well as links to the papers are available here (CSCW members are listed in italics):

In addition, Benjamin Mako Hill is a panel co-chair.

Mako, Sohyeon, Jeremy, and Nathan will all be at the conference and so will tons of our social computing friends. Please come and say “Hello” to any of us and introduce yourself if you don’t already know us :)

Measuring Wikipedia Article Quality in One Continuous Dimension

Accurately estimating the quality of Wikipedia articles is important task for both researchers and Wikipedia community members. In a forthcoming paper in the Proceedings of the OpenSym 2021, I describe a new method for estimating article quality in Wikipedia in one dimension that builds on the widely used ORES quality model and that improves on the techniques researches have used to incorporate measures of quality into their studies of Wikipedia in the past. I will presenting virtually this week at the OpenSym 2021 conference. OpenSym is free and open to the public this year but requires registration.

Numerous efforts have gone into measuring the quality of Wikipedia articles and the Wikimedia Foundation maintains a machine learning model for measuring article quality on English Wikipedia called the ORES quality model. This model is trained on quality assessments conducted by members of WikiProjects that label articles into hierarchy of quality levels (i.e., stub, start-class, C-class, B-class, Good, Featured) and use boosted decision trees to predict the quality of versions of articles. This model is useful because it can predict the quality of versions of articles that haven’t been assessed. My OpenSym paper (teaser video, full presentation) builds on the ORES quality models to improve measuring Wikipedia article quality in one continuous dimension using ordinal regression models. A 1-D real-valued measure of quality is more granular and easier to use in downstream research. My method also increases the accuracy of quality prediction for units of analysis that are most important for research like articles or revisions and also estimates spacing between different levels of article quality.

Quality scores and predictions of the ordinal regression models. Columns in the grid of charts correspond to the ordinal quality model calibrated to the indicated unit of analysis and rows correspond to sampled articles having the indicated level of quality as assessed by Wikipedians. Each chart shows the histogram of scores, thresholds inferred by the ordinal model with 95% credible intervals colored in gray, and colors indicating when the model makes correct or incorrect predictions. The thresholds are not evenly spaced, especially in revision model and article model that have more weight on lower quality classes. These two models infer that the gaps between Stub and Start and between Start and C-class articles are considerably wider than the gap between C-class and B-class articles.

An important limitation of the ORES quality models is that they do not output a single real-valued quality score. Instead, they output a multinomial distribution of the probability of an article’s quality class. This means that instead of saying “article X has quality 0.3” the model tells you “the probability of article X’s quality class is 0.1 for stub, 0.2 for start-class, 0.5 for C-class, 0.15 for B-class, 0.05 for Good and 0 for featured and the most probable quality class (MPQC) is C.” Using this kind of output in a downstream statistical analysis is kind of messy. For example, it might seem reasonable to use the MPQC as an integer-valued measure of quality, but this throws away information. Suppose ORES says that “the probability of article Y’s quality class is 0.00 for stub, 0.15 for start-class, 0.5 for C-class, 0.2 for B-class, 0.1 for Good, and 0.05 for Featured” According to the ORES quality model, Y probably has greater quality than X. Even though both have a MPQC of C-class, there’s a much greater change for article Y to be B-class or better than for article X. Is there a way to use the ORES probabilities to build a more granular measure of quality that accounts for this difference?

Aaron Halfaker, one of the creators of the ORES system, combined the probabilities into a 1-D score for one of his research projects by taking a weighed sum of the probabilities and assuming that Wikipedia article quality levels are “evenly spaced.” This creates a score out of the quality class probabilities by multiplying each probability by a weight so that probabilities for higher quality levels get more weight. He chose the weights [0,1,2,3,4,5] so an article gets 0 quality points for being a probable stub, 1 for the probability of being start-class, 2 for C-class, and so on with 5 points for Featured. This results in a nice continuous measure of quality that simplifies downstream statistical analysis. A number of others have followed his lead.

But how reasonable is the “evenly spaced” assumption that supports using the weights [0,1,2,3,4,5]? Could there be a way to pick weights to combine the ORES probabilities without using this assumption? My paper explains why ordinal regression is the right tool for the job and proposes a procedure for fitting an ordinal regression model to a sample representative of a unit of analysis like articles or revisions or quality classes that have been labeled by a WikiProject and scored by the ORES quality model. The ordinal regression predicts the quality labels as a function of the ORES scores and in this way finds a good way to combine the ORES scores into a single value. It also infers threshold parameters that define different regions of the quality score corresponding to quality classes and this allows the “evenly spaced” assumption to be tested. The figure above shows that the article quality levels are not evenly spaced! Interestingly, the way that they are unevenly spaced depends on how quality is measured. If the quality scale is designed to be accurate across all quality classes, then the quality levels are relatively even. However, if it is designed to be accurate for revisions or articles then more of the scale goes to lower levels of quality. Overall, the models agree that the difference between C-class and Start articles is smaller than that between the other levels.

Using the quality scores based on ordinal regression also improves accuracy. This mostly comes from calibrating the ordinal model to the appropriate unit of analysis. The ORES quality model is fit on a “balanced” dataset where each quality class is equally represented. This means that the ORES quality model has learned that each quality class is equally likely to occur in the data. Of course, in reality lower quality articles are far more common than high quality articles. There are over 3,000,000 Stubs on English Wikipedia but less than 8,000 Featured articles. As the table below shows, fitting ordinal models that know the true proportion of each quality class in a sample can improve accuracy compared to the ORES quality model.

Accuracy of quality prediction models depends on the unit of analysis. The greatest accuracy scores are highlighted. Models are more accurate when calibrated on the same unit of analysis on which they are evaluated. Compared to the MPQC, the ordinal quality models have better accuracy when revisions or articles are the unit of analysis. When the quality class is the unit of analysis, the ordinal quality model has slightly worse accuracy.

Even though my study has found evidence against the “evenly spaced” assumption, I also found that the quality scores based on it are strongly correlated with the scores from the ordinal model as shown in the figure below. The ‘𝜏’ symbols in the figure stand for the Kendall rank correlation coefficient while the ‘r’ symbols stand for the Pearson correlation coefficient. I used the Kendall correlation because it can capture nonlinear divergences between the measures and the Pearson coefficient is the familiar linear correlation. The “evenly spaced” scores aren’t totally in agreement with the scores from the ordinal models, but they are close enough that I doubt that prior research that used the “evenly spaced” scores to measure quality was mislead by this choice.

Correlations between quality measures show that the different approaches to measuring quality are quite similar. “Evenly spaced” uses the weighted sum of the ORES scores with handpicked coefficients [0,1,2,3,4,5]. Lower values of Kendall’s 𝜏, a nonparametric rank correlation statistic, compared to Pearson’s 𝑟 suggest nonlinear differences between the weighted sum and the other measures.

Conclusions

Measuring article quality in one continuous dimension is a valuable tool for studying the peer production of information goods because it provides granularity and is amenable to statistical analysis. Prior approaches extended ORES article quality prediction into a continuous measure under the “evenly spaced” assumption. I used ordinal regression models to transform the ORES predictions into a continuous measure of quality that is interpretable as a probability distribution over article quality levels, provides an account of its own uncertainty and does not assume that quality levels are “evenly spaced.” Calibrating the models to the chosen unit of analysis improves accuracy for research applications. I recommend that future work adopt this approach when article quality is an independent variable in a statistical analysis.

My paper also has a number of other tidbits about the uncertainty of different quality measures, the importance of feedback between measurement and knowledge in the scientific process and demonstrates model calibration.


A preprint of the paper is available here. The paper has been accepted to OpenSym 2021 and will be presented at the virtual conference on September 17th. A video of the presentation is available here. A dataverse repository containing code and data from the project is available here.

Dogfooding

This quarter, I am teaching a graduate seminar called “The Practice of Scholarship” that is required for second-year students in the Northwestern MTS and TSB programs. Following Mako’s lead, I am using the Community Data Science Collective wiki to host the (editable) syllabus. In other words, I am eating to my heart’s content.

dog food.
© 2006 chrismetcalfTV (cc-by-2.0)

We had our first class session yesterday and it went really well. The goal for the quarter is for every student to prepare a manuscript for submission to a peer reviewed venue. I told the students that the course will serve as a hybrid writing boot camp and extended group therapy session. There will be much workshopping and iteration and sharing of feelings. There will also be polite, friendly, and unyielding pressure to produce scholarly work of exceptional quality.

In keeping with the wikified ethos, much of the course schedule remains tbd at this point, so please drop me a line with comments, suggestions, or pointers to great readings that brilliant, interdisciplinary, empirical social scientists and HCI researchers like my students would appreciate.