FOSSY 2024: Call for Proposals!

Does your work touch open source, communities, technology, or cooperation? Do you want to help bridge the gaps between research and practice? Join us at FOSSY! The Free and Open Source Software Yearly conference (FOSSY) is back this summer and the call for proposals is open!

We’ll be running the Science of Community track, and are looking for presenters to speak to an audience of FOSS practitioners, developers, community organizers, contributors, and people just generally into and curious about FOSS. 

The Science of Community track is inspired by the CDSC Science of Community Dialogues, which bring together practitioners and researchers to discuss scholarly work that is relevant to the efforts of practitioners. As researchers, we benefit so much from the communities we work with and study and we want them to also learn from the research they so generously take part in. While the Dialogues cover a broad range of topics and communities, FOSSY presentations will focus on how that work relates to free and open source software communities, projects, and practitioners.

FOSSY is a low-stress opportunity to talk to people who your work can benefit. For topics, consider presenting implications from past papers, synthesizing work from your field overall, or floating ideas and problems (lightning talks! long talks! short talks!). A full track description and answers to common questions is available on our wiki.

The CFP deadline is June 14th and uses this form.

Decentralizing Social Media: The challenges and opportunities of federated systems

A Virtual Thought Leader Dialogue on May 23, from 4 – 5:15 p.m. CST. Register here to join.

Based on File:Decentralization.jpg, by Adam Aladdin, CC-BY-SA 3.0

How can we create more trustworthy and accountable social media that support diverse communities? Decentralized social media—systems that allow users to connect and communicate across independent services like Mastodon or BlueSky—offer promising alternatives to centralized commercial platforms like Instagram, TikTok, or X. However, decentralized social media also face urgent design challenges, especially when it comes to content integrity, protecting community trust and safety, and forging collective governance. What happens when there is no central authority to review posts or ban abusive users? How can networks of autonomous communities build and adopt systems to govern effectively? What critical infrastructure can prevent the pervaisve harms of existing social media and support the integrity of public discourse?

Join Northwestern’s Center for Human-Computer Interaction + Design (HCI+D) and the Community Data Science Collective (CDSC) for an engaging conversation about the challenges and opportunites of decentralized social media on May 23rd from 4 to 5:15 p.m. CST. This panel features designers, leaders, and researchers involved in federated social media and will address opportunities for effective design and governance in this space.

Panelists include Jaz-Michael King, Bryan Newbold, and Christine Lemmer-Webber. Short presentations will be followed by discussion and Q&A moderated by Aaron Shaw (Northwestern HCI+D, CDSC). 

Moderator: Aaron Shaw, photograph by Nikki Ritcher Photography

Aaron Shaw is Associate Professor of Communication Studies and Sociology (by courtesy) at Northwestern University and a Faculty Associate of the Berkman Klein Center for Internet and Society at Harvard University. He is a co-founder of the Community Data Science Collective. At Northwestern, he is also affiliated with the Center for Human-Computer Interaction + Design (HCI+D), the Institute for Policy Research, the Buffett Institute for Global Affairs, and the Public Affairs Residential College.

Speaker: Christine Lemmer-Webber, Executive Director of Spritely Networked Communities Institute

Christine has devoted her life to advancing user freedom. Realizing that the federated social web was fractured by a variety of incompatible protocols, she co-authored and shepherded ActivityPub‘s standardization. She has also contributed to many other free and open source projects, including co-founding MediaGoblin.

Christine established the open source Spritely Project to solve known problems in existing centralized and decentralized social media platforms and to re-imagine the way we build networked applications – work that now continues here at the institute under her guidance as Executive Director.

Speaker: Jaz-Michael King, Executive Director of IFTAS (Federated Trust & Safety)

An accomplished professional with an extraordinary record of enabling data-driven decisions, developing innovative products, creating new business opportunities, driving strong operational performance, and building high-performing, agile teams.
Highly versatile, with extensive experience in data and technology from a privacy, improvement, and reporting perspective, Jaz has a proven record in building solutions for non-profit programs. 
As Executive Director of IFTAS, Jaz is now focused on independent, open Social Web activities, with the aim of creating #BetterSocialMedia by supporting trust and safety at scale in federated social media networks.

Speaker: Bryan Newbold, Protocol Engineer at BlueSky

Bryan works at Bluesky, a startup company building a federated social media protocol called “atproto”. Until a few months ago he worked at the Internet Archive collecting scientific research datasets and publications, and created scholar.archive.org. And before that he worked on infrastructure at Stripe, attended the Recurse Center in New York City, and built Atomic Magnetometers for a small New Jersey company called Twinleaf.

Over that same time period, Bryan climbed up and down the ladder of abstraction, obtaining an undergraduate degree in physics (at MIT), operating under-ice robots in Antarctica, developing open hardware lab instrumentation for large-scale brain probing (at LeafLabs), cataloging hundreds of millions of electronics components (at Octopart), and improved production service reliability at Stripe (a financial infrastructure start-up).

Bryan is a transplant from the East Coast and enjoys the road biking, large trees, generous salads, used bookstores, and world-class tech non-profits. This will be his third year serving on the Code of Conduct team at DWeb Camp.

Interested in attending? Register here to join!

CDSC welcomes Madison Deyo!

Madison Deyo has recently joined the CDSC as a Program Coordinator and we couldn’t be more thrilled to welcome her to the team!

Madison Deyo headshot.

Madison is based at Northwestern. With the CDSC, Madison’s role includes a mix of event planning and coordination; outreach and communications; and supporting the operations of the group. She also works with the Northwestern Center for Human-Computer Interaction + Design. Madison brings experience working with community-based non-profits in several different capacities.

Madison currently lives in Chicago, and grew up in Wisconsin, where she attended the University of Wisconsin-Madison. There, she received my B.S. in Art (with a focus on illustration) and Communications: Radio-TV-Film. In addition to her position at Northwestern, Madison also works as a freelance artist designing mead labels, tattoos, and occasionally album/EP covers. You can check out her portfolio.

The State of Wikimedia Research, 2022–2023

Wikimania, the annual global conference of the Wikimedia movement, took place in Singapore last month. For the first time since 2019, the conference was held in person again. It was attended by over 670 people in-person and more than 1,500 remotely.

At the conference, Benjamin Mako Hill, Tilman Bayer, and Miriam Redi presented “The State of Wikimedia Research: 2022–2023”, an overview of scholarship and academic research on Wikipedia and other Wikimedia projects from the last year. This resumed an annual Wikimania tradition started by Mako back in 2008 as a graduate student, aiming to provide “a quick tour … of the last year’s academic landscape around Wikimedia and its projects geared at non-academic editors and readers.” With hundreds of research publications every year featuring Wikipedia in their title (and more recently, Wikidata too), is it of course impossible to cover all important research results within one hour. Hence our presentation aimed to identify a set of important themes that attracted researchers’ attention during the past year, and illustrate each theme with a brief “research postcard” summary of one particular publication. Unfortunately, Miriam was not able to be in Singapore to present..

This year’s presentation focused on seven such research themes:

Theme 1. Generative AI and large language models
The boom in generative AI and LLMs triggered by the release of ChatGPT has affected Wikimedia research deeply. As an example, we highlighted a preprint that used Wikipedia to enhance the factual accuracy of a conversational LLM-based chatbot.

Theme 2. Wikidata as a community
While Wikidata is the subject of over 100 published studies each year, the vast majority of these have been primarily concerned with the project’s content as a database which scientists use to advance research about e.g. the semantic web, knowledge graphs and ontology management. This year also saw several papers studying Wikidata as a community, including a study of how Wikidata contributors use talk page to coordinate (preprint).

Theme 3. Cross-project collaboration
Beyond Wikipedia and Wikidata, Wikimedia sister projects have attracted comparatively little researcher attention over the years. We highlighted one of the very first research publication in the social sciences that studied Wikimedia Commons, the free media repository, examining how it interconnects with English Wikipedia.

Theme 4. Rules and governance
Research on rules and governance continues to attract researchers’ attention. Here, we featured a new paper by a political scientist that documented important changes in how English Wikipedia’s NPoV (Neutral Point of View) policy has been applied over time, and used this to advance an explanation for political change in general.

Theme 5. Wikipedia as a tool to measure bias
While Wikimedia research has often focused on Wikipedia’s own biases, researchers have also turned to Wikipedia to construct baselines against which to measure and mitigate biases elsewhere. We highlighted an example of Meta’s AI researchers doing this for their Llama 2 large language model.

Theme 6. Measuring Wikipedia’s own content bias
Despite the huge interest in content gaps along dimensions such as race and gender, systematic approaches to measuring them have not been as frequent as one might hope. We featured a paper that advanced our understanding in this regard, presented a useful method, and is also one of the first to study differences in intersectional identities.

Theme 7. Critical and humanistic approaches
Although most of the published research work related to Wikipedia is based in the sciences or engineering disciplines, a growing body of humanities scholarship can offer important insights as well. We highlighted a recent humanities paper about the measuring of race and ethnicity gaps on Wikipedia, which focused in particular on gaps in such measurements themselves, placing them into a broader social context.

We invite you to watch the video recording on Youtube or our self-hosted media server or peruse the annotated slides from the talk.

Again, this work represents just a tiny fraction of what has been published about Wikipedia in the last year. In particular, we avoided research that was presented elsewhere in Wikimania’s research track.

To keep up to date with the Wikimedia research field throughout the year, consider subscribing to the monthly Wikimedia Research Newsletter and its associated Twitter and Mastodon feeds which are maintained by Miriam and Tilman.


This post was written by Benjamin Mako Hill and Tilman Bayer.

Join us! Call for Ph.D. Applications and Public Q&A Event

It’s Ph.D. application season and the Community Data Science Collective is recruiting! As always, we are looking for talented people to join our research group. Applying to one of the Ph.D. programs that the CDSC faculty members are affiliated with is a great way to get involved in research on communities, collaboration, and peer production.

Because we know that you may have questions for us that are not answered in this webpage, we will be hosting a panel discussion and Q&A about the CDSC and Ph.D. opportunities on October 20 at 7:30pm UTC (3:30pm US Eastern, 2:30pm US Central, 12:30pm US Pacific). You can register online.

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. programs our faculty members are affiliated with, and some general ideas about what we’re looking for when we review Ph.D. applications.

a four by four grid of CDSC members making "dynamic" poses
Group photo of the collective at a recent virtual retreat.

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly quantitative) empirical social scientists and designers pursuing research about the organization of online communities, peer production, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington, Carleton College, Purdue University, and a few other places. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

This year the group includes three faculty principal investigators (PIs) who are actively recruiting PhD students: Aaron Shaw (Northwestern University), Benjamin Mako Hill (University of Washington in Seattle), and Jeremy Foote (Purdue University). Each of these PIs advise Ph.D. students in Ph.D. programs at their respective universities. Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty person has specific areas of expertise and interests. The reasons you might choose to apply to one Ph.D. program or to work with a specific faculty member could include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. programs have different application deadlines, requirements, and procedures!

Who is actively recruiting this year?

If you are interested in applying to any of the programs, we strongly encourage you to reach out the specific faculty in that program before submitting an application.

Ph.D. Advisors

A photo of Benjamin mako Hill. He is wearing a pink shirt.
Benjamin Mako Hill

Benjamin Mako Hill is an Associate Professor of Communication at the University of Washington. He is also an Adjunct Assistant Professor at UW’s Department of Human-Centered Design and Engineering (HCDE), Computer Science and Engineering (CSE) and Information School. Although many of Mako’s students are in the Department of Communication, he has also advised students in all three other departments—although he typically has more limited ability to admit students into those programs on his own and usually does so with a co-advisor in those departments. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning. Mako has also put together a webpage for prospective graduate students with some useful links and information..

A photo of Aaron Shaw. He is wearing a black shirt.
Aaron Shaw. (Photo credit: Nikki Ritcher Photography, cc-by-sa)

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. This year, he’s also the “Scholar in Residence” for King County, Washington. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs (please note: the TSB program is a joint degree between Communication and Computer Science). Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and collaborative organizing in pursuit of public goods.

A photo of Jeremy Foote. He is wearing a grey shirt.
Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s current research focuses on how individuals decide when and in what ways to contribute to online communities, how communities change the people who participate in them, and how both of those processes can help us to understand which things become popular and influential. Most of his research is done using data science methods and agent-based simulations.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solve matter a lot. We think doctoral education is less about executing tasks that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat. You can also join our panel discussion on October 20 at 3:30pm ET (UTC-5).

Community Data Science Collective at ICA 2022

The International Communication Association (ICA)’s 72nd annual conference is coming up in just a couple of weeks. This year, the conference takes place in Paris and a subset of our collective is flying out to present work in person. We are looking forward to meeting up, talking research, and eating croissants. À bientôt!

ICA takes place from Thursday, May 26th to Monday, May 30th, and we are presenting a total of ten (!!) times. All presentations given by members of the collective are scheduled between Friday and Sunday.

Friday

We start off with a presentation by Nathan TeBlunthuis on Friday at 11.00 AM, in Room 351 M (Palais des Congres). In a high-density paper session on Computational Approaches to Online Communities, Nate will present a paper entitled “Dynamics of Ecological Adaptation in Online Communities.”

Later that same day, at 3.30 PM in the Amphitheatre Havana (level 3; Palais des Congres), Carl Colglazier will discuss a paper that he collaborated on with Nick Diakopoulos: “Predictive Models in News Coverage of the COVID-19 Pandemic in the U.S.” This paper session is part of the ICA division Journalism Studies.

Saturday

On Saturday, Floor Fiers will present in the paper session “Impression Management Online: FabriCATing An Image.” Their project, which they wrote with Nathan Walter, discusses “Comments on Airbnb and the Potential for Racial Bias” at 2.00 PM in Regency 1 (Hyatt).

Shortly after, that same afternoon, you’ll find two of our poster presentations at 5.00 PM in the Exhibit Hall (Havana; Palais des Congres, level 3). In one of them, Jeremy Foote will discuss his take on “a systems approach to studying online communities.”

The other poster, presented at the same time and place, is by Kaylea Champion and Benjamin Mako Hill on “Resisting Taboo in the Collaborative Production of Knowledge: Evidence from Wikipedia.”

Sunday

Most of our presentations are on the fourth day of the conference. At 9.30 AM, we’ll be presenting in three locations at the same time! First, Floor will discuss their paper “Inequality and Discrimination in the Online Labor Market: a Scoping Review” in Room 311+312 (Palais des Congres). This presentation is part of the paper session “All Things Are Not Equal: CompliCATions From Digital Inequalities.”

Second, Carl will present work on behalf of himself, Aaron Shaw, and Benjamin Mako Hill during a high-density paper session in Room 242A (Palais des Congres). The title of their project is “Extended Abstract: Exhaustive Longitudinal Trace Data From Over 70,000 Wiki.”

Lastly, at the same time in Room 352B (Palais des Congres), Jeremy will present an interview study entitled “What Communication Supports Multifunctional Public Goods in Organizations? Using Agent-Based Modeling to Explore Differential Uses of Enterprise Social Media.” Jeremy’s co-authors on this paper are Jeffrey Treem and Bart van den Hooff.

On Sunday afternoon, at 3.30 PM in Room 311+312 (Palais des Congres), Tiwaladeoluwa Adekunle will talk about a qualitative project she collaborated on with Jeremy, Nate, and Laura Nelson: “Co-Creating Risk Online: Exploring Conceptualizations of COVID-19 Risk in Ideologically Distinct Online Communities.”

We will finish off our ICA 2022 presentations at 5.00 PM in Room 313+314 (Palais des Congres), where Kaylea will present on behalf of Isabella Brown, Lucy Bao, Jacinta Harshe, and Mako. The title of their paper is “Making Sense of Covid-19: Search Results and Information Providers”.

We look forward to sharing our research and connecting with you at ICA!

Why do people participate in small online communities?

The number of unique commenters who commented on subreddits in March 2020, for subreddits that had at least 1 comment in the each of the previous 23 months. The “SR” communities are those we drew our interview sample from.

When it comes to online communities, we often assume that bigger is better. Large communities can create robust interactions, have access to broad and extensive body of experiences, and provide many opportunities for connections. As a result, small communities are often thought as failed attempts to build big ones. In reality, most online communities are very small and most small communities remain small throughout their lives.  If growth and a large number of members are so advantageous, why do small communities not only exist but persist in their smallness?

In a recent research study, we investigated why individuals participate in these persistently small online communities by interviewing twenty participants of small subreddits  on Reddit. We asked people about their motivations and explicitly tried to get them to compare their experiences in small subreddits with their experience in larger subreddits. Below we present three of the main things that we discovered through analyzing our conversations.

Size of consistently active subreddits over time (i.e., those with at least one comment per month from April 2018 to March 2020). Subreddits are grouped by their size in April 2018. Lines represent the median size each month, and ribbons show the first and third quartiles.

Informational niches

First, we found that participants saw their small communities as unique spaces for information and interaction. Frequently, small communities are narrower versions or direct offshoots of larger communities. For example, the r/python community is about the programming language Python while the r/learnpython community is a smaller community explicitly for newcomers to the language. 

By being in a smaller, more specific community, our participants described being able to better anticipate the content, audience, and norms: a specific type of content, people who cared about the narrow topic just like them, and expectations of how to behave online. For example, one participant said:

[…] I can probably make a safe assumption that people there more often than not know what they’re talking about. I’ll definitely be much more specific and not try to water questions down with like, my broader scheme of things—I can get as technical as possible, right? If I were to ask like the same question over at [the larger parent community], I might want to give a little bit background on what I’m trying to do, why I’m trying to do it, you know, other things that I’m using, but [in small community], I can just be like, hey, look, I’m trying to use this algorithm for this one thing. Why should I? Or should I not do it for this?

Curating online experiences

More broadly, participants explained their participation in these small communities as part of an ongoing strategy of curating their online experience. Participants described a complex ecosystem of interrelated communities that the small communities sat within, and how the small communities gave them the ability to select very specific topics, decide who to interact with, and manage content consumption.

In this sense, small communities give individuals a semblance of control on the internet. Given the scale of the internet—and a widespread sense of malaise with online hate, toxicity, and harassment—it is possible that controlling the online experience is more important to users than ever. Because of their small size, these small communities were largely free of the vandals and trolls that plague large online communities, and  several participants described their online communities as special spaces to get away from the negativity on the rest of the internet. 

Relationships

Finally, one surprise from our research was what we didn’t find. Previous research led us to predict that people would participate in small communities because they would make it easier to develop friendships with other people. Our participants described being interested in the personal experiences of other group members, but not in building individual relationships with them.

Conclusions

Our research shows that small online communities play an important and underappreciated role. At the individual level, online communities help people to have control over their experiences, curating a set of content and users that is predictable and navigable. At the platform level, small communities seem to have a symbiotic relationship with large communities. By breaking up broader topical niches, small communities likely help to keep a larger set of users engaged.

We hope that this paper will encourage others to take seriously the role of small online communities. They are qualitatively different from large communities, and more empirical and theoretical research is needed in order to understand how communities of different sizes operate and interact in community ecosystems.


A preprint of the paper is available here. We’re excited that this paper has been accepted to CSCW2021 and will be published in the Proceedings of the ACM on Human-Computer Interaction and presented at the conference in November. If you have any questions about this research, please feel free to reach out to one of the authors: Sohyeon Hwang or Jeremy Foote.

Apply to Join the Community Data Science Collective as a PhD student!

It’s Ph.D. application season and the Community Data Science Collective is recruiting! As always, we are looking for talented people to join our research group. Applying to one of the Ph.D. programs that the CDSC faculty members are affiliated with is a great way to do that.

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. programs we’re affiliated with, and what we’re looking for when we review Ph.D. applications. It’s close to the deadline for some of our programs, but we hope this post will still be useful to prospective applicants now and in the future.

Group photo of the collective at a recent virtual retreat.

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly) quantitative social scientists and designers pursuing research about the organization of online communities, peer production, online communities, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington, Carleton College, the University of North Carolina, Chapel Hill, Purdue University, and a few other places. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

This year the group includes four faculty principal investigators (PIs) who are actively recruiting PhD students: Aaron Shaw (Northwestern University), Benjamin Mako Hill and Sayamindu Dasgupta (University of Washington in Seattle), and Jeremy Foote (Purdue University). Each of these PIs advise Ph.D. students in Ph.D. programs at their respective universities. Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty person has specific areas of expertise and interests. The reasons you might choose to apply to one Ph.D. program or to work with a specific faculty member could include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. programs have different application deadlines, requirements, and procedures!

Who is actively recruiting this year?

Given the disruptions and uncertainties associated with the COVID19 pandemic, the faculty PIs are more constrained in terms of whether and how they can accept new students this year. If you are interested in applying to any of the programs, we strongly encourage you to reach out the specific faculty in that program before submitting an application.

Ph.D. Advisors

Sayamindu Dasgupta head shot
Sayamindu Dasgupta

Although he is currently at the University of North Carolina, Sayamindu Dasgupta will starting this year as an Assistant Professor in the Department of Human-Centered Design and Engineering at the University of Washington. Sayamindu’s research focus includes data science education for children and informal learning online—this work involves both system building and empirical studies.

Benjamin Mako Hill

Benjamin Mako Hill is an Assistant Professor of Communication at the University of Washington. He is also an Adjunct Assistant Professor at UW’s Department of Human-Centered Design and Engineering (HCDE), Computer Science and Engineering (CSE) and Information School. Although many of Mako’s students are in the Department of Communication, he has also advised students in all three other departments—although he typically has more limited ability to admit students into those programs. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning. Mako has also put together a webpage for prospective graduate students with some useful links and information..

Aaron Shaw. (Photo credit: Nikki Ritcher Photography, cc-by-sa)

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs. Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current research projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and empirical research methods.

Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s current research focuses on how individuals decide when and in what ways to contribute to online communities, how communities change the people who participate in them, and how both of those processes can help us to understand which things become popular and influential. Most of his research is done using data science methods and agent-based simulations.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include experience consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solve matter a lot. We think doctoral education is less about executing a task that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat.

Community Data Science Collective Research at DebConf 2021

Debian is one of the oldest, largest, and most influential peer production communities and has produced an operating system used by millions for over the last three decades. DebConf is that community’s annual meeting. This year, the Community Data Science Collective was out in force at Debian’s virtual conference to present several Debian-focused research projects that we’ve been working on.

First, Wm Salt Hale presented work from his master thesis project on “Resilience in FLOSS: Do founder decisions impact development activity after crisis events?” His work tried to understand the social dynamics behind organizational resilience among free software projects based on what Salt calls “founder decisions.” He did so by estimating the relationship between changes in developer activity after security bugs and testing several theories about how this relationship might vary between permissive and copyleft licensed software packages.

Wm Salt Hale’s presentation plus Q&A. (WebM available)

Next, Kaylea and Salt facilitated a “birds-of-a-feather” get-together session for FLOSS project founders (video is also available).

Finally, Kaylea Champion presented her work with Benjamin Mako Hill on “Detecting At Risk Software in Debian.” Her work described a new technique that involves identifying software packages that are less (or more) high quality than you we might expect given their popularity. You can read more about that work in our blog post from earlier this year.

Kaylea Champion’s presentation plus Q&A. (WebM available)

If you saw either presentation and are interested in continuing the conversation, you are welcome to reach out to us individually ({kaylea OR halew}@uw.edu). You can also follow us on this blog, or follow or engage with us in the Fediverse (@communitydata@social.coop), or on Twitter (@comdatasci).

Future Tools for Youth Data Literacies

Workshop Report From Connected Learning Summit 2021

What are data literacies? What should they be? How can we best support youth in developing them via future tools? On July 13th and July 15th 2021, we held a two-day workshop at the Connected Learning Summit to explore these questions. Over the course of two very-full one-hour sessions, 40 participants from a range of backgrounds got to know each other, shared their knowledge and expertise, and engaged in brainstorming to identify important pressing questions around youth data literacies as well as promising ways to design future tools to support youth in developing them. In this blog post, we provide a full report from our workshop, links to the notes and boards we created during the workshop, and a description of  how anyone can get involved in the community around youth data literacies that we have begun to build.

Caption: We opened our sessions by encouraging participants to share and synthesize what youth data literacies meant to them. This affinity diagram is the result. 

How this workshop came to be

As part of the research team interested in research about learning at the Community Data Science Collective, we have long been fascinated with how youth and adults learn how to ask and answer questions with data  While we have engaged with these questions ourselves by looking to Scratch and Cognimates, we are always curious about how we might design tools to promote youth data literacies in the future in other contexts. 

The Connected Learning Summit is a unique gathering of practitioners, researchers, teachers, educators, industry professionals, and others, all interested in formal and informal learning and the impact of new media on current and future communities of learners. When the Connected Learning Summit put up a call for workshops, we thought this was a great opportunity to engage the broader community on the topic of youth data literacies. 

Several months ago, the four of us (Stefania, Regina, Emilia and Mako) started to brainstorm ideas for potential proposals. We started by listing potential aspects and elements of data literacies such as: finding & curating data, visualizing & analyzing it, programming with data, and engaging in critical reflection. We then started to identify tools that can be used to accomplish each goal and tied to identify opportunities and gaps. See some examples of these tools on our workshop website.

Caption: Workshop core team and co-organizers community. Find out more here http://www.dataliteracies.com/

As part of this process, we  identified a number of leaders in the space. This included people who have built tools like Rahul Bhargava and Catherine D’Ignazio who designed Databasic.io,Andee Rubinwho contributed to CODAP, and Victor Lee who focused on tools that link personal informatics and data. Other leaders included scholars who researched how existing tools are being used to support data literacies, including Tammy Clegg who has researched how college athletes develop data literacy skills, Yasmin Kafai who has looked at e-textile projects, and Camillia Matuk who has done research on data literacy curricula. Happily, all of these leaders agreed to join us as co-organizers for the workshop. 

The workshop and what we learned from it

Our workshop took place on July 13th and July 15th as part of the 2021 Connected Learning Summit. Participants came from diverse backgrounds and the group included academic researchers, industry practitioners, K-12 teachers, and librarians. On the first day we focused on exploring existing learning scenarios designed to promote youth data literacies. On the second day we built on big questions raised in the initial session and brainstormed features for future systems. Both workshop sessions were composed of several breakout sessions. We took notes in a shared editor and encouraged participants to add their ideas and comments on sticky notes on collaborative digital white boards and share their definitions and questions around data literacies. 

Caption: organizers and participants sharing past projects and ideas in a breakout session. 

Day 1 Highlights

On Day 1, we explored a variety of existing tools designed to promote youth data literacies. We had a total of 28 participants who attended the session. We began with a group exercise where we shared their own definitions of youth data literacies before dividing into 3 groups: a group focusing on tools for data visualization and storytelling, a group focusing on block-based tools, and a group focusing on data literacy curricula. In each breakout session, our co-organizers first demonstrated one or two existing tools. Each group then discussed how the demo tool might support a single learning scenario based on the following prompt: “Imagine a six-grader who just learned basic concepts about central tendency, how might she use these tools to apply this concept on real world data?” Each group generated many reflective questions and ideas that would prompt and help inform the design of future data literacies tools. Results of our process are captured in the boards linked below. 

Caption: Activities on Miro boards during the workshop.

Data visualization and storytelling

Click here to see the activities on Miro board for this breakout session. 

 

In the sub-section focusing on data visualization and storytelling, Victor Lee first demonstrated Tinkerplots, a desktop-based software that allows students to explore a variety of visualizations with simple click-button interaction using data in .csv format. Andee Rubin then demonstrated CODAP, a web-based tool similar to Tinkerplots that supports drag-and-drop with data, additional visual representation options including maps, and connection between representations. 

Caption: CODAP and Tinkerplots—two tools demonstrated during the workshop.

We discussed how various features of these tools could support youth data literacies in specific learning scenarios. We saw flexibility as one of the most important factors in tool use, both for learners and teachers. Both tools are topic-agnostic and compatible with any data in .csv format. This allows students to explore data of any topics that interest them. Simplicity in interaction is another important advantage. Students can easily see the links between tabular data and visualizations and try out different representations using simple interactions like drag-and-drop, check boxes, and button clicks. Features of these tools can also support students in performing aggregation on data and telling stories about trends and outliers. 

We further discussed potential learning needs beyond what the current features could support. Before creating visualizations, students may need scaffolds during the process of data collection, as well as in the stage of programming with and preprocessing data. Story telling about the process of working with data was another theme that came up a lot from our discussion. Open questions include how features can be designed to support reproducibility, how we can design scaffolds for students to explain what they are doing with data in diary style stories, and how we can help students narrate what they think about a dataset and why they generate particular visualizations.

Block-based tools

Click here to see the activities on Miro board for this breakout session. 

The breakout section about block-based tools started with PhD candidate Stefania Druga demonstrating a program in Scratch and how users could interact with data using the Scratch Cloud Data. We brainstormed about the kind of data students could collect and explore and the kind of visualization, game-based, or other creative interactions youth could create with the help of block-based tools. As a group, we came up with many creative ideas. For example, students can collect and visualize “the newest COVID tweet at the time you touched” a sensor and make “sound effect every time you count a face-touch.” 

Caption: A Scratch project demonstrated during the workshop made with Cloud Data.

We discussed how interaction with data was part of an enterprise that is larger than any particular digital scaffold. After all, data exploration is embedded in social context and might reflect hot topics and recent trends. For instance, many of our ideas about data explorations were around COVID-19 related data and topics. 

Our group also felt that interaction with data should not be limited to a single digital software. Many scenarios we came up with were centered on personal data collection in physical spaces (e.g., counting the number of times a student touches their own face). This points to a future design direction of how we can connect multiple tools that support interaction in both digital and physical spaces and encourage students to explore questions using different tools. 

A final theme from our discussion was around how we can use block-based tools to allow engagement with data among a wider audience. For example, accessible and interesting activities and experience with block-based tools could be designed so that librarians can get involved in meaningful ways to introduce people to data. 

Data literacy curriculum

Click here to see the activities on Miro board for this breakout session. 

In the breakout section emphasizing on curriculum design, we started with an introduction by Catherine D’Ignazio and Rahul Bhargava on DataBasic.io’s Word Counter: a tool that allows users to paste in text to see word counts in various ways. We also walked through some curricula that the team created to guide students through the process of telling stories with data. 

We talked about how this design was powerful in that it allows students to bring their own data and context, and to share knowledge about what they expect to find. Some of the scenarios we imagined included students analyzing their own writings, favorite songs, and favorite texts, and how they might use data to tell personalized stories from there. The specificity of the task supported by the tool enables students to deepen concepts about data by asking specific questions and looking at different datasets to explore the same question. 

Caption: dataBASIC.io helps users explore data.

We also reflected on the fact that tools provided in Databasic.io are easy to use precisely because they are quite narrowly focused on a specific analytic task. This is a major strength of the tools, as they are intended as transitional bridges to help users develop foundational skills for data analysis. Using these tools should help answer questions, but should also encourage users to ask even more.

This led to a new set of issues discussed during the breakout session: How do we chain collections of small tools that might serve as one part of a data literacies pipeline together? This is where we felt curricular design could really come into play. Rather than having tools that try to “be everything,” using well-designed tools that address one aspect of an analysis can provide more flexibility and freedom to explore. Our group felt that curriculum can help learners reach the most important step in their learning, going from data to story to the bigger world—and to understanding why the data might matter. 

Day 2 Highlights

The goal for the Day 2 of our workshop was to speculate and brainstorm future designs of tools that support youth data literacies. After our tool exploration and discussions on Day 1, three interesting brainstorming questions emerged across the breakout sections described above:

  • How can we close the gap between general purpose tools and specific learning goals?
  • How can we support storytelling using data?
  • How can we support insights into the messiness of data and hidden decisions

We focused on discussing these questions on Day 2. A total of 29 participants attended and we once again divided into breakout groups based on the three questions above. For each brainstorming question, we considered the key questions in terms of the following three sub-questions: What are some helpful tools or features that can help answer the question? What are some pitfalls? And what new ideas can we come up with?

Caption: Workshop activities generated an abundance of ideas.

How can we close the gap between general purpose tools and specific learning goals?

Click here to see the activities on Miro board for this breakout session. 

Often tools designed to solve a range of potential problems. That said, learners attempting to engage in data analysis are frequently faced with extremely specific questions about their analysis and datasets. Where does their data come from? How is it structured? How can it be collected? How do we balance the desire to serve many specific learners’ goals with general tools against the desire to handle specific challenges well?

As one approach, we drew lines between different parts of doing data analysis and frequently required features in different tools. Of course, data analysis is rarely a simple linear process. We also concluded that perhaps not everything needs to happen in one place or with one tool, and that this should be acknowledged and considered during the design process.  We also discussed the importance of providing context within more general data analytic tools. We also talked about how learners need to think about the purpose of their analysis before they consider what tool to use and how, ideally, youth would learn to see patterns in data and to understand the significance of the patterns they find. Finally, we agreed that tools that help students understand the limitations of data and the uncertainty inherent in the data are also important.

Challenges and opportunities for telling stories with data

Click here to see the activities on Miro board for this breakout session. 

In this section, we discussed challenges and opportunities around supporting students to tell stories with data. We talked about enabling students to recognize and represent the backstory of data. Open questions included: How do we make sure learners are aware of bias? And how can we help people recognize and document the decision of what to include and exclude?

As for telling stories about students’ own experience of working with data, collaboration was also a topic that came up frequently. We agreed that narrative with data is never an individual process. We discussed that future tools should be designed to support critique, iteration, and collaboration among storytellers, audiences, and maybe also between tellers and audiences.

Finally, we talked about future directions. This included taking a crowdsourced, community-driven approach to tell stories with data. We also noted that we had seen a lot of research effort to support storytelling about data in visualization systems or computational notebooks. We agreed that storytelling should not be limited to digital format and speculated that future designs could extend the storytelling process to unplugged, physical activities. For example, we can design to encourage students to create artefacts and monuments as part of the data storytelling process. We also talked about designing  to engage people from diverse backgrounds and communities to contribute to and explore data together. 

Challenges and opportunities for helping students to understand the messiness of data

Click here to see the activities on Miro board for this breakout session. 

In this section, we talked about the tension between the need to make data clean and easy to use for students and the need to let youth understand the messiness of real world data. We shared our own experiences helping students engage with real or realistic data. A common way is to engage students in collaborative data production and have them compare the outcomes of a similar analysis between each other. For instance, students can document their weekly groceries and find that different people record the same items under different names. They can then come up with a plan to name things consistently and clean their data.

One very interesting point that came up from our discussion was what we really mean by “messy data.” “Messy,” incomplete, or inconsistent data may be unusable for computers while still comprehensible by humans. Therefore to be able to work with messy data does not only mean to have the skills to preprocess, but also involve the recognition of hidden human decisions and assumptions. 

We came up with many ideas regarding future system design. We suggested designing to support crowdsourced data storytelling. For example, students can each contribute a small piece of documentation about the background of a dataset. Features might also be designed to support students to collect and represent the backstory of data in innovative ways. For example, functions that support the generation of rich media, such as videos, drawings, journal entries, can be embedded into data representation systems. We might also innovate on the way we design the interface of data storage so that students can interact with rich background information and metadata while still keeping the data “clean” for computation.

Next steps & community

We intend for this workshop to be only the beginning of our learning and exploration in the space of youth data literacies. We also hope to continue building the community we built. In particular, we have started a mailing list where we can continue our ongoing discussion. Please feel free to add yourself to the mailing list if you would like to be kept informed about our ongoing activities.

Although the workshop has ended, we have included links to many resources on the workshop website, and we invite you to explore the site. We also encourage you to contribute to a crowdsourced list of papers on data literacies by filling out this form.  


This blog was collaboratively written by Regina Cheng, Stefania Druga, Emilia Gan, and Benjamin Mako Hill.

Stefania Druga is a PhD candidate in the Information School at University of Washington. Her research centers on AI literacy for families and designing tools for interest-based creative coding. In her most recent project, she focuses on building a platform that leverages youth creative confidence via coding with AI agents. 

Regina Cheng is a PhD candidate in the Human Centered Design and Engineering department at University of Washington. Her research centers on broadening and facilitating participation in online informal learning communities. In her most recent work, she focuses on designing for novices’ engagement with data in online communities.

Emilia Gan is a graduate student in the Paul G. Allen School of Computer Science and Engineering (UW-Seattle). Her research explores factors that lead to continued participation of novices in computing.

Benjamin Mako Hill is an Assistant Professor at UW. His research involves democratizing data science—and doing it from time to time as well.