FOSSY Wrap-Up: Shoji Kajita on Research Data Management Skills Development Leveraged by an Open Source Portfolio

Welcome to part 5 of our 7-part series reviewing all the great talks we were fortunate enough to host during the Science of Community track at this year’s FOSSY.

In this talk, Dr. Kajita introduced us to the work being done as part of the Apereo (formerly JA-SIG/Sakai) to create FOSS platforms to serve as academic and administrative infrastructure in higher education. Research data management is a skill that emerging scholars must learn to do modern quantitative research — and this skill can be scaffolded and tracked via the Karuta portfolio tool.

Watch the talk HERE, learn more about Karuta HERE, and learn more about Dr. Kajita HERE.

FOSSY Wrap-Up: Kaylea Champion’s Lightning Talk on Undermaintained Packages

Welcome to part 4 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Kaylea presented on her new research project to identify how packages come to be undermaintained, in particular investigating assumptions that it’s all about “the old stuff” — old packages, old languages. It turns out that’s only part of the story — older packages and software written in older languages do tend to be undermaintained, but old packages in old languages — the tried and true, as it were — do relatively well!

Watch the talk HERE and learn more about Kaylea’s work HERE.

The State of Wikimedia Research, 2022–2023

Wikimania, the annual global conference of the Wikimedia movement, took place in Singapore last month. For the first time since 2019, the conference was held in person again. It was attended by over 670 people in-person and more than 1,500 remotely.

At the conference, Benjamin Mako Hill, Tilman Bayer, and Miriam Redi presented “The State of Wikimedia Research: 2022–2023”, an overview of scholarship and academic research on Wikipedia and other Wikimedia projects from the last year. This resumed an annual Wikimania tradition started by Mako back in 2008 as a graduate student, aiming to provide “a quick tour … of the last year’s academic landscape around Wikimedia and its projects geared at non-academic editors and readers.” With hundreds of research publications every year featuring Wikipedia in their title (and more recently, Wikidata too), is it of course impossible to cover all important research results within one hour. Hence our presentation aimed to identify a set of important themes that attracted researchers’ attention during the past year, and illustrate each theme with a brief “research postcard” summary of one particular publication. Unfortunately, Miriam was not able to be in Singapore to present..

This year’s presentation focused on seven such research themes:

Theme 1. Generative AI and large language models
The boom in generative AI and LLMs triggered by the release of ChatGPT has affected Wikimedia research deeply. As an example, we highlighted a preprint that used Wikipedia to enhance the factual accuracy of a conversational LLM-based chatbot.

Theme 2. Wikidata as a community
While Wikidata is the subject of over 100 published studies each year, the vast majority of these have been primarily concerned with the project’s content as a database which scientists use to advance research about e.g. the semantic web, knowledge graphs and ontology management. This year also saw several papers studying Wikidata as a community, including a study of how Wikidata contributors use talk page to coordinate (preprint).

Theme 3. Cross-project collaboration
Beyond Wikipedia and Wikidata, Wikimedia sister projects have attracted comparatively little researcher attention over the years. We highlighted one of the very first research publication in the social sciences that studied Wikimedia Commons, the free media repository, examining how it interconnects with English Wikipedia.

Theme 4. Rules and governance
Research on rules and governance continues to attract researchers’ attention. Here, we featured a new paper by a political scientist that documented important changes in how English Wikipedia’s NPoV (Neutral Point of View) policy has been applied over time, and used this to advance an explanation for political change in general.

Theme 5. Wikipedia as a tool to measure bias
While Wikimedia research has often focused on Wikipedia’s own biases, researchers have also turned to Wikipedia to construct baselines against which to measure and mitigate biases elsewhere. We highlighted an example of Meta’s AI researchers doing this for their Llama 2 large language model.

Theme 6. Measuring Wikipedia’s own content bias
Despite the huge interest in content gaps along dimensions such as race and gender, systematic approaches to measuring them have not been as frequent as one might hope. We featured a paper that advanced our understanding in this regard, presented a useful method, and is also one of the first to study differences in intersectional identities.

Theme 7. Critical and humanistic approaches
Although most of the published research work related to Wikipedia is based in the sciences or engineering disciplines, a growing body of humanities scholarship can offer important insights as well. We highlighted a recent humanities paper about the measuring of race and ethnicity gaps on Wikipedia, which focused in particular on gaps in such measurements themselves, placing them into a broader social context.

We invite you to watch the video recording on Youtube or our self-hosted media server or peruse the annotated slides from the talk.

Again, this work represents just a tiny fraction of what has been published about Wikipedia in the last year. In particular, we avoided research that was presented elsewhere in Wikimania’s research track.

To keep up to date with the Wikimedia research field throughout the year, consider subscribing to the monthly Wikimedia Research Newsletter and its associated Twitter and Mastodon feeds which are maintained by Miriam and Tilman.


This post was written by Benjamin Mako Hill and Tilman Bayer.

FOSSY Wrap-Up: Anita Sarma’s Lightning Talk on Inclusion Bugs

Welcome to part 3 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Dr. Anita Sarma gave us an excellent introduction to her and her team’s work on understanding how to make FOSS more inclusive by identifying errors in user interaction design.

Matt Gaughan delivered a rapid introduction to his dataset highlighting the numerous places where the Linux Kernel is using unsafe memory practices.

You can watch the talk HERE and learn more about Dr. Sarma HERE.

FOSSY Wrap-up – Sophia Vargas on Proactive Metrics to Combat Maintainer Burnout

Welcome to part 1 of a 7-part series spotlighting the excellent talks we were fortunate enough to host during the Science of Community track at FOSSY 23!

Sophia Vargas presented ‘Can we combat maintainer burnout with proactive metrics?’ In this talk, Sophia takes us through her extensive investigations across multiple projects to weigh the value of different metrics to anticipate when people might be burning out, including some surprising instances where metrics we might think are helpful really don’t tell us what we think they do.

You can watch the talk HERE and learn more about Sophia’s work HERE.

FOSSY Fun, Finished

The CDSC hosted the Science of Community Track on July 15th at FOSSY this year — it was an awesome day of learning and conversation with a fantastic group of senior scholars, industry partners, students, practitioners, community members, and more! We are so grateful and eager to build on the discussions we began.

If you missed the sessions, watch this space! Most sessions were recorded, and we’ll post links and materials as they’re released.

Special thanks to Molly de Blanc for all the long distance organizing work; Shauna Gordon McKeon for stepping in to help share some closing thoughts on the Science of Community track at the very last minute, and to the FOSSY organizing team for convening such a warm, welcoming inaugural event (indeed, the warmth was palpable as it nearly hit 100° F on Friday and Saturday in Portland).

One tangible result of a free software conference: new laptop stickers!

Meet us at FOSSY!

The Free and Open Source Software Yearly conference (FOSSY) is in less than a week and we will be there!

We will be running the Science of Community track on Saturday July 15.

Two photos. In one is Kaylea Chamption, who has purple hair and a blue shirt. In the other is Sejal Khatri, Benjamin Mako Hill, and Aaron Shaw.
Kaylea Champion, and Benjamin Mako Hill and Aaron Shaw with Sejal Khatri (who won’t be at FOSSY)

The Science of Community track is inspired by the CDSC Science of Community Dialogues, which aim to bring together practitioners and researchers to discuss scholarly work that is relevant to the efforts of practitioners. As researchers, we get so much from the communities we work with and study and we want them to also learn from the research they so generously take part in. While the Dialogues cover a broad range of topics and communities, FOSSY presentations focus on how that work related to free and open source software communities, projects, and practitioners.

At FOSSY, we will have a number of really amazing researchers presenting their work. We wanted to share some highlights from the schedule.

Sophia Vargas, from Google’s Open Source Programs Office, will be presenting on how metrics can help us understand contributor burnout. Professor Shoji Kajita, from Kyoto University, will discuss research data management for FOSS communities. Mariam Guizani, from Oregon State University, will cover research on the why and how of corporate participation in FOSS. We will additionally have lightning talks by Adam Hyde, Anita Sarma, Shauna Gordon-McKeon, and incoming Northwestern Ph.D. student Matthew Gaughan.

We are really excited about our workshop “Let’s Get Real: Putting Research Findings Into Practice.” This workshop, designed for FOSS contributors and practitioners, will help guide you on how to get the most out of the incredible research on and relevant to FOSS. If you want to learn how to navigate the sheer volume of interesting research work happening or how to understand what it means, this is the session for you! Our workshop will be led by Kaylea Chamption and Professors Aaron Shaw and Benjamin Mako Hill. You can read more on our wiki.

Due to scheduling issues, Eriol Fox will be presenting their talk, “Community lead user research and usability in Science and Research OSS: What we learned,” in the Wildcard Track. We recommend going!

We hope to see you at FOSSY. Even if you can’t make it to our sessions, we’ll be at the conference so stop by and say hello!

Community Data Science Collective at ICA 2023

The International Communication Association (ICA)’s 73nd annual conference is coming up soon. This year, the conference takes place in Toronto, Canada, and a subset of our collective is showing up to present work in person. We are looking forward to meeting up, talking about research, and hanging out together!

ICA takes place from Wednesday, May 24, to Monday, May 29, and CDSC members will take various roles in a number of different conference programs, including chairing, presentations, and co-organizing of preconference. Here is the list of our participation by the time order, so feel free to join us!

Thursday, May 25

We start off with a presentation by Yibin Fan on Thursday at 10:45 am in the International Living Learning Centre of Toronto Metropolitan University on Political Communication Graduate Student Preconference. In a panel on The Causes and Outcomes of Political Polarization and Violence, Yibin will present a paper entitled “Does Incidental Political Discussion Make Political Expression Less Polarized? Evidence From Online Communities”.

Later on Thursday, another preconference on New Frontiers in Global Digital Inequalities Research will take place in M – Room Linden (Sheraton) from 1:30pm to 5pm, Floor Fiers will work as a co-chair with a number of scholars from international, various academic institutions, and they will give a presentation on “The Gig Economy: A Site of Opportunity Vs. a Site of Risk?”. This preconference is affiliated with Communication and Technology Division and Communication Law & Policy Division of ICA.

Friday, May 26

On Friday, Carl Colglazier will present in a panel on Disinformation, Politics and Social Media at 3:00pm in M – Room York (Sheraton), and the presentation is entitled as “The Effects of Sanctions on Decentralized Social Networking Sites: Quasi-Experimental Evidence From the Fediverse”.

Sunday, May 28

On Sunday we will be actively taking different roles in various sessions. In the morning, Yibin Fan will serve as the moderator for a research paper panel on Political Deliberation and Expression affiliated by Political Communication Division at 9:00 am in 2 – Room Simcoe (Sheraton).

Then there comes our highlighting paper that won the Top Paper Award by Computational Methods Division: Nathan TeBlunthuis will present a methodological research paper entitled as “Automated Content Misclassification Causes Bias in Regression: Can We Fix It? Yes We Can!” in a panel on Debate, Deliberation and Discussion in the Public Sphere at noon in M – Room Maple East (Sheraton). This is a project on which Nate collaborates with Valerie Hase at LMU Munich and Chung-hong Chan at University of Mannheim. Congratulations to them for getting the Top Paper Award!

Last but not least, we will finish off our ICA 2023 by seeing our faculty members serving as the chair and discussants in the Computational Methods Research Escalator Session at 1:30pm in M – Room Maple West (the same room as Nate’s presentation!). The session is junior scholars who are inexperienced in publishing to have connections with more senior researchers in the field. As the Call for Papers by Computational Methods Division says, “Research escalator papers provide an opportunity for less experienced researchers to obtain feedback from more senior scholars about a paper-in-progress, with the goal of making the paper ready for submission to a conference or journal.” Aaron Shaw, together with Matthew Weber at Rutgers University, will serve as the chairs for the session. Benjamin Mako Hill and Jeremy Foote, together with a bunch of scholars from other institutions, will work as discussants for improving the research presented here.

We look forward to sharing our research and connecting with you at ICA!

Community Dialogue on Digital Inequalities

Join the Community Data Science Collective (CDSC) for our 5th Science of Community Dialogue! This Community Dialogue will take place on May 19 at 10:00 am PDT (18:00 UTC). This Dialogue focuses on digital inequalities and online community participation. Professor Hernan Galperin (University of Southern California) will join Floor Fiers (Northwestern University) to present recent research on topics including:

  • Inequalities in online access and participation
  • Differentiated participation in online communities
  • Causes and consequences of online inequalities
  • Digital skills as a barrier to online participation
  • Combating digital discrimination

A full session descriptions is on our website. Register online

What is a Dialogue?

The Science of Community Dialogue Series is a series of conversations between researchers, experts, community organizers, and other people who are interested in how communities work, collaborate, and succeed. You can watch this short introduction video with Aaron Shaw.

What is the CDSC?

The Community Data Science Collective (CDSC) is an interdisciplinary research group made of up of faculty and students at the University of Washington Department of Communication, the Northwestern University Department of Communication Studies, the Carleton College Computer Science Department, and the Purdue University School of Communication.

Learn more

If you’d like to learn more or get future updates about the Science of Community Dialogues, please join the low volume announcement list.