Let’s talk about taboo! A new paper on how taboo shapes activity on Wikipedia

Taboo subjects—such as sexuality and mental health—are as important to discuss as they are difficult to raise in conversation. Although many people turn to online resources for information on taboo subjects, censorship and low quality information are common in search results. In work that has just been published at CSCW this week, we present a series of analyses that describe how taboo shapes the process of collaborative knowledge building on English Wikipedia. Our work shows that articles on taboo subjects are much more popular and the subject of more vandalism than articles on non-taboo topics. In surprising news, we also found that they were edited more often and were of higher quality! We also found that contributors to taboo articles did less to hide their identity than we expected.

Short video of a our presentation of the work given at Wikimania in August 2023.

The first challenge we faced in conducting our study was building a list of Wikipedia articles on taboo topics. This was challenging because while taboo is deeply cultural and can seem natural, our individual perspectives of what is and isn’t taboo is privileged and limited. In building our list, we wanted to avoid relying on our own intuition about what qualifies as taboo. Our approach was to make use of an insight from linguistics: people develop euphemisms as ways to talk about taboos. Think about all the euphemisms we’ve devised for death, or sex, or menstruation, or mental health. Using figurative languages lets us distance ourselves from the pollution of a taboo.

We used this insight to build a new machine learning classifier based on dictionary definitions in English Wiktionary. If a ‘sense’ of a word was tagged as a euphemism, we treated the words in the definition as indicators of taboo. The end result of this analysis is a series of words and phrases that most powerfully differentiate taboo from non-taboo. We then did a simple match between those words and phrases and Wikipedia article titles. We built a comparison sample of articles whose titles are words that, like our taboo articles, appear in Wiktionary definitions.

We used this new dataset to test a series of hypotheses about how taboo shapes collaborative production in Wikipedia. Our initial hypotheses were based on the idea that taboo information is often in high demand but that Wikipedians might be reluctant to associate their names (or usernames) with taboo topics. The result, we argued, would be articles that were in high demand but of low quality. What we found was that taboo articles are thriving on Wikipedia! In summary, we found in comparison to non-taboo articles:

  • Taboo articles are more popular (as expected).
  • Taboo articles receive more contributions (contrary to expectations).
  • Taboo articles receive more low-quality contributions (as expected).
  • Taboo articles are higher quality (contrary to expectations).
  • Taboo article contributors are more likely to contribute without an account (as expected), and have less experience (as expected), but that accountholders are more likely to make themselves more identifiable by having a user page, disclosing their gender, and making themselves emailable (all three of these are contrary to expectation!).

For more details, visualizations, statistics, and more, we hope you’ll take a look at our paper. If you are attending CSCW in October 2023, we also hope and come to our CSCW presentation in Minneapolis!


The full citation for the paper is: Champion, Kaylea, and Benjamin Mako Hill. 2023. “Taboo and Collaborative Knowledge Production: Evidence from Wikipedia.” Proceedings of the ACM on Human-Computer Interaction 7 (CSCW2): 299:1-299:25. https://doi.org/10.1145/3610090.

We have also released replication materials for the paper, including all the data and code used to conduct the analyses.

This blog post and the paper it describes are collaborative work by Kaylea Champion and Benjamin Mako Hill.

FOSSY Wrap-up Bonus – Eriol Fox on User Research

Welcome to a bonus round of our series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Eriol Fox presented their talk, “Community lead user research and usability in Science and Research OSS: What we learned,” (due to scheduling issues, this landed in the Wildcard track, but it was definitely on-topic for Science of Community! Eriol introduced us to their work exploring how scientists and researchers think about open source software, including differences in norms and motivations as well as challenges around the structure of labor. They also brought along copies of their 4 super cool zines from this project!

You can watch the talk HERE and learn more about Eriol’s work HERE.

FOSSY Wrap-Up: CDSC presents Interactive Session — Let’s Get Real: Putting Research Findings into Practice

Welcome to part 7 of a 7-part series spotlighting presentations from the Science of Community track at FOSSY 23!

In this interactive session, Dr. Benjamin Mako Hill, Dr. Aaron Shaw, and Kaylea Champion hosted a series of conversations with FOSS community members about finding research, putting it to use, and building partnerships between researchers and communities!

This talk was (intentionally!) not recorded, but we’ve synthesized the resources we shared into this wiki page.

FOSSY Wrap-Up: Mariam Guizani on Rules of Engagement: Why and How Companies Participate in OSS

Welcome to part 6 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

In this talk, Dr. Guizani shared her work to understand the motivation for companies to participate in open source software development, encompassing the perspective of both small and large firms.

You can watch the talk HERE and learn more about Dr. Guizani HERE.

FOSSY Wrap-Up: Shoji Kajita on Research Data Management Skills Development Leveraged by an Open Source Portfolio

Welcome to part 5 of our 7-part series reviewing all the great talks we were fortunate enough to host during the Science of Community track at this year’s FOSSY.

In this talk, Dr. Kajita introduced us to the work being done as part of the Apereo (formerly JA-SIG/Sakai) to create FOSS platforms to serve as academic and administrative infrastructure in higher education. Research data management is a skill that emerging scholars must learn to do modern quantitative research — and this skill can be scaffolded and tracked via the Karuta portfolio tool.

Watch the talk HERE, learn more about Karuta HERE, and learn more about Dr. Kajita HERE.

FOSSY Wrap-Up: Kaylea Champion’s Lightning Talk on Undermaintained Packages

Welcome to part 4 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Kaylea presented on her new research project to identify how packages come to be undermaintained, in particular investigating assumptions that it’s all about “the old stuff” — old packages, old languages. It turns out that’s only part of the story — older packages and software written in older languages do tend to be undermaintained, but old packages in old languages — the tried and true, as it were — do relatively well!

Watch the talk HERE and learn more about Kaylea’s work HERE.

FOSSY Wrap-Up: Anita Sarma’s Lightning Talk on Inclusion Bugs

Welcome to part 3 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Dr. Anita Sarma gave us an excellent introduction to her and her team’s work on understanding how to make FOSS more inclusive by identifying errors in user interaction design.

Matt Gaughan delivered a rapid introduction to his dataset highlighting the numerous places where the Linux Kernel is using unsafe memory practices.

You can watch the talk HERE and learn more about Dr. Sarma HERE.

FOSSY Wrap-up – Sophia Vargas on Proactive Metrics to Combat Maintainer Burnout

Welcome to part 1 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Sophia Vargas presented ‘Can we combat maintainer burnout with proactive metrics?’ In this talk, Sophia takes us through her extensive investigations across multiple projects to weigh the value of different metrics to anticipate when people might be burning out, including some surprising instances where metrics we might think are helpful really don’t tell us what we think they do.

You can watch the talk HERE and learn more about Sophia’s work HERE.

FOSSY Fun, Finished

The CDSC hosted the Science of Community Track on July 15th at FOSSY this year — it was an awesome day of learning and conversation with a fantastic group of senior scholars, industry partners, students, practitioners, community members, and more! We are so grateful and eager to build on the discussions we began.

If you missed the sessions, watch this space! Most sessions were recorded, and we’ll post links and materials as they’re released.

Special thanks to Molly de Blanc for all the long distance organizing work; Shauna Gordon McKeon for stepping in to help share some closing thoughts on the Science of Community track at the very last minute, and to the FOSSY organizing team for convening such a warm, welcoming inaugural event (indeed, the warmth was palpable as it nearly hit 100° F on Friday and Saturday in Portland).

One tangible result of a free software conference: new laptop stickers!