Prospective PhD Student Q&A – September 26th

Thinking about applying to graduate school? Wonder what it’s like to pursue a PhD or research-based master’s degree? Interested in understanding relationships between technology and society? Curious about how to do research on online communities like Reddit, Wikipedia, or GNU/Linux? The Community Data Science Collective is hosting a virtual Q&A session on Friday, September 26th at 10am PT / 12pm CT / 1pm ET for prospective students. This session is scheduled for an hour, to be divided between a larger group session with faculty and then smaller groups with current graduate students. If you would like to attend, register at this link!

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. and research master’s programs our faculty members are affiliated with, and some general ideas about what we’re looking for when we review applications.

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly quantitative) empirical social scientists and designers pursuing research about the organization of online communities, peer production, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington-Seattle, University of Washington-Bothell, The University of Texas at Austin, Purdue University, and a few other places. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

This year the group includes multiple faculty principal investigators (PIs) who are actively recruiting PhD students: Aaron Shaw (Northwestern University), Benjamin Mako Hill (University of Washington in Seattle), Kaylea Champion (University of Washington in Bothell) and Jeremy Foote (Purdue University). Each of these PIs advise Ph.D. and / or M.S. students in graduate programs at their respective universities. We also have faculty PIs who are not currently recruiting students, but are active members of the group: Nathan TeBlunthuis (University of Texas – Austin) and Ryan Funkhouser (University of Idaho). Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty member has specific areas of expertise and interests. The reasons you might choose to apply to one of these Ph.D. or M.S. programs or to work with a specific faculty member could include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. and M.S. programs have different application deadlines, requirements, and procedures!

Faculty who are actively recruiting this year

If you are interested in applying to any of the programs, we strongly encourage you to reach out to the specific faculty in that program before submitting an application.

Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s research focuses on how individuals decide when and in what ways to participate in online communities, how communities change the people who participate in them, and how both of those processes can help us to understand what people believe and which things become popular and influential. He and his students use multiple methods, including data science, agent-based modeling, field experiments, and interviews.

Benjamin Mako Hill

Benjamin Mako Hill is an Associate Professor of Communication at the University of Washington. He is also adjunct faculty at UW’s Department of Human-Centered Design and Engineering (HCDE), Computer Science and Engineering (CSE) and Information School. Although many of Mako’s students are in the Department of Communication, he has also advised students in all three other departments—although he typically has more limited ability to admit students into those programs on his own and usually does so with a co-advisor in those departments. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning. Mako has also put together a webpage for prospective graduate students with some useful links and information..

Aaron Shaw, Nikki Ritcher Photography

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs (please note: the TSB program is a joint degree between Communication and Computer Science). Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and collaborative organizing in pursuit of public goods.

Kaylea Champion

Kaylea Champion is an Assistant Professor in Computing & Software Systems at the University of Washington-Bothell. Kaylea’s research investigates how people collaborate to build digital infrastructure, including operating systems, programming languages, and information repositories. What gets made and maintained and secured—and what gets neglected? What risks do we face (including from AI and cybercriminals)? What practices lead to better outcomes? How can we work smarter and what can we stop doing? Kaylea’s work seeks to bridge the divide between research and practice, which for her means building relationships with practitioner communities, organizations, and industry to directly share research findings. If you are interested in the Computing & Software Systems graduate programs at the University of Washington – Bothell, you should register for one of the information sessions specific to these programs as well. Kaylea’s department only admits M.S. and B.S. students. In addition, she is happy to support any students who are working with others in the CDSC and can serve on thesis and dissertation committees.

Other Faculty members of the CDSC

Nathan TeBlunthuis, Ventrait Pictures

Nathan TeBlunthuis is an Assistant Professor in the School of Information at the University of Texas at Austin in the area of social informatics. Nathan’s research focuses on analyzing ecosystems of online communities, AI tools in peer production, and methods in computational social science. His current projects continue in these areas and also draw from them all to understand how information sources achieve legitimacy in online communities. He works primarily using computational tools and big data, but also grounds his work in qualitative evidence.

Ryan Funkhouser

Ryan Funkhouser is an Assistant Professor in the Department of Psychology and Communication at the University of Idaho. Ryan’s research focuses on communication processes for bridging ideological divides in online spaces. His work includes explorations of deliberation-focused online communities, the role of narrative in persuasion, and the mechanisms of belief change. Ryan utilizes both computational and qualitative methods to explore text data, primarily from online sources.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solving matter a lot. We think doctoral education is less about executing tasks that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat.

Community Data Science Collective logo

FOSSY 2025 Wrap-Up: Ben Ford “It’s all about the ecosystem!”

Ben Ford wrapped up Friday as the sixth speaker in the Science of Community track at FOSSY 2025, talking about the idea that the ecosystem is the product and the thing that you build and sell only exists to support it, something OSS companies might learn from.

You can check out Ben’s slide deck here.

This is the 6th of our 11 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract.

FOSSY 2025 Wrap Up: Laura Langdon “From Campus to Network: Creating the UC System-Wide OSPO Initiative”

In the fifth talk of the Science of Community track at FOSSY 2025, Laura Langdon shared lessons learned from the early stages of building a network of academic Open Source Program Office’s (OSPO) across the University of California system. She discussed both the benefits and challenges encountered while developing this first-of-its-kind system-wide network. 

This is the 5th of our 12 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract.

FOSSY 2025 Wrap Up: Cathy Richards “Designing for Collaboration: A Toolkit for Open and Inclusive Environmental Research”

Our fourth speaker of the Science of Community, Cathy Richards, shared lessons learned from the Open Environmental Data Project (OEDP), focusing on how the toolkit translates open infrastructure into inclusive, practical frameworks that empower communities to use data for local action and advocacy. 

This is the 4th of our 12 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract.

FOSSY 2025 Wrap Up: Mike Jang “Open source your repository: a roadmap”

Mike Jang, the third talk of the Science of Community track at FOSSY 2025, covered how open sourcing existing software is more than just “pushing a button”, it involves things like auditing security, sharing with your community, setting ground rules, and more. Jang shared insights on how to access a template repository, a checklist to follow, tips for hackathons, and how to understand the work required to move to open source. 

This is the 3rd of our 12 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract.

FOSSY 2025 Wrap-Up: Justin Ribeiro “The Creative Trade-Off: Governance, Conflict, and Their Impact On Innovation In Open-Source Software”

In the second talk of the Science of Community track, Dr. Justin Ribeiro discussed how development approaches shape creativity at the project level, drawing from a study of 40 open source projects, over 10,000 releases, and interview with developers across corporate and community-run efforts. 

This is the second of our 12 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract.

FOSSY 2025 Wrap Up: Matt Gaughan “How do sponsored open source ecosystems manage feature deployments?”

Matt kicked off the Science of Community track on Friday, August 1st, discussing his current research on how the recent growth of sponsored open source libraries (projects stewarded by large, formally incorporated organizations) provides new organization relationships and processes to better understand FOSS libraries organized as communities of volunteer contributors. 

(Matt’s talk starts at 16:30 and ends at 45:51).

This is the first of our 12 part series sharing highlights from the Science of Community track at FOSSY 2025. Visit the FOSSY site for bio details and a full abstract [https://2025.fossy.us/schedule/presentation/350/].

Only One Week Until FOSSY 2025, Come See Us There!

Let the countdown begin! FOSSY 2025 begins next week July 31st through August 3rd. We’ll be there, running the Science of Community track on Friday, August 1st and Saturday August 2nd.

The Science of Community track is inspired by the CDSC Science of Community Dialogues, which aim to bring together practitioners and researchers to discuss scholarly work that is relevant to the efforts of practitioners. As researchers, we get so much from the communities we work with and study and we want them to also learn from the research they so generously take part in. While the Dialogues cover a broad range of topics and communities, FOSSY presentations focus on how that work related to free and open source software communities, projects, and practitioners.

We have a number of great presenters, including the CDSC’s very own Matt Gaughan and Dr. Kaylea Champion. You can check our full schedule below:

Tickets are still available at every price tier, check them out here.

We’ll you there!

Join us at FOSSY 2025!

Interested in free and open source software? Want to hear insights from researchers, community leaders, contributors, and advocates working on and with FOSS?

Join us July 31st – August 3rd at the Free and Open Source Software Yearly conference!

We will be running the Science of Community track on Friday August 1st and Saturday August 2nd. We’re excited to have a number of awesome presenters speaking about their work. You can find the schedule here.

The Science of Community track is inspired by the CDSC Science of Community Dialogues, which aim to bring together practitioners and researchers to discuss scholarly work that is relevant to the efforts of practitioners. As researchers, we get so much from the communities we work with and study and we want them to also learn from the research they so generously take part in. While the Dialogues cover a broad range of topics and communities, FOSSY presentations focus on how that work related to free and open source software communities, projects, and practitioners.

Collaborations between practitioners and researchers can be transformative! Let’s get to know each other.

Tickets are still available at every price tier, check them out here.

We hope to see you there!

Niche Dynamics in Complex Online Community Ecosystems (ICWSM 2025)

This post is about my (Nathan TeBlunthuis) paper (pdf) just published at ICWSM 2025.

Often, several different online communities exist where similar people talk about similar things. This is really easy to observe from browsing platforms like Reddit or Facebook groups.

Names of bicycle-related subreddits in cluster of subreddits with many overlapping users.

For example, as we can see from this visualization of clustered subreddits with overlapping users, there are many different subreddits related to cycling. We see some communities have different emphases in complementary ways like “fixedgearbicycle” and “bicycletouring” — these are different types of cycling. But why have a community for
“cycling” and a different one for “bicycling”? A number of puzzles appear when we reflect on the existence of such related communities.

How do online communities relate to each other?

Why not have one large community that does everything?

How do people construct these systems of related online communities?

I investigated these questions in my dissertation using the theoretical lens of organizational ecology drawn from organizational sociology. This new paper explored some findings from earlier projects in more depth. The paper I published in ICWSM 2022 (pdf), takes up the question of ecological relationships among online communities. I used time series models to infer networks of competition and mutualism between overlapping online communities. This work found evidence that they tended to be mutualistic. For example, the diagram below shows a network of mental health subreddits that is dense with mutualism.

Ecological network of a cluster of mental subreddits. Blue arrows indicate mutualism and yellow arrows indicate competition according to a vector autoregression model.


However, this method, based on vector autoregression (VAR) models of activity, assumes that these relationships are static and constant over time. But dynamics of attention online are often bursty, and online communities grow, decline, and change over time in other ways. So, in this new work, I adopted nonlinear models called (regularized) S-map that can model more complex dynamics.

Since I found in the previous work that mutualism tended to happen more often than competition, I wanted to find out if that result was robust using the S-map. Since the S-map breaks these relationships down into episodes of competition or mutualism it afforded testing a more nuanced hypothesis about this tendency towards mutualism.

H1: Mutualistic interactions will be more frequent and longer lasting than competitive interactions.

In the another empirical paper previously published at CSCW 2022 (acm dl), we focused on the question of why people build overlapping online communities and found that they complementary sets of benefits to members, as illustrated below. Trade-offs between the benefits lead to specialized roles for different types of communities.

Figure from “No Community Can Do Everything: Why People Participate in Similar Online Communities” depicts three key benefits that people seek from online communities and how individual communities tend not to optimally provide all three. For example, large communities tend not to afford tight-knit homophilous community.

This reflects propositions from ecology that specialization can be a strategy to avoid competition. The new study seeks to provide more generalizable quantitative evidence about how online communities find their specialized niches. Ecology theory suggests that online communities, similar to organizations or organisms, might adapt to increase specialization and thereby promote more mutualistic relationships. To investigate whether people build specialized online communities through such an adaptive feedback process, I set out to test the following two hypotheses:

H2: Two communities having greater competition (mutualism) will subsequently have greater decreases (increases) in overlap.

H3: Two subreddits having decreasing (increasing) overlap will subsequently have greater mutualism (competition).

Methods and measures

To test these three hypotheses, I had to measure competition/mutualism, and overlap within clusters of related subreddits over time. I made topic- and user-overlap measures based on a community embedding via the LSA algorithm. To create the clusters I reused the approach from the earlier paper by using the HDBSCAN algorithm based on user overlap. As mentioned above I used the Regularized S-MAP algorithm to create a dynamic measure of ecological influence. With these longitudinal measures in hand I could test the hypotheses using two-way fixed-effects panel data estimators with dyad-robust standard errors. That’s a brief and dense summary of the methods. The chart below might help you make sense of it, but if you care to fully understand you’ll want to check out the full paper.

This flowchart illustrates the dataset and measures in the study.
On the left-hand side, nonline “Regularized S-Map models” are fit to time series of posts and comments in clusters of subreddits with high user-overlap to test hypothesis 1.
In the middle, competition and mutualism from the S-Map models are used with longitidunal measures of topic and user overlap based on community embeddings in panel regression models to test hypotheses 2 and 3.
Model selection is on the right-hand side.

Here are a few final notes on the data and methods. The data came from the Pushshift Reddit archive of submissions and comments from December 5th 2015 to April 13th 2020. I Started with the 19,533 subreddits that were active during at least 20% of study period weeks, excluding NSFW subreddits. HDBSCAN clustering discovered related 1,919 clusters of 8,806 subreddits having 48,484 relationships measured 17,374,116 times over 758 weeks.

Results

I found support for H1, which predicted that mutualistic interactions will be more frequent and longer lasting than competitive interactions. The plot below shows evidence in favor of the hypothesis. First, we can see clearly that the longest episodes tend to be mutualistic.
Notably, these ecological relationships are often bursty and short-lived. The average length of a mutualistic episode was 2.13 weeks and the average length of a competitive episode was just 1.83 weeks.

Frequency plot of the durations of competition and mutualism episodes. Mutualism tends to last longer than
competition. The y-axis is log-transformed. The axes truncated to omit outliers for visibility.

I also found support for H2, which predicted that I’d find positive coefficients for previous ecological interaction indicating that competition predicts decreases in overlap. Indeed, the panel regression models found that online communities tend to increase their specialization a bit in relatively competitive conditions, by about 0.02 standard deviations in term or user overlap for every 1-unit increase in competition.

Do increasingly specialized communities tend to decrease their competition as predicted by H3? My analysis didn’t find evidence for this. In fact, according to the panel regression models, after specialization increases, competition actually tends to increase as well.

Discussion

What to take away from all this? I still think the most important finding from this work to me is the robustness of the tendency toward mutualism among online communities. Unlike firms or other organizations that demand relatively exclusive commitments from their members, it is easy to participate in many online communities. Where classical organizations (imagine firms, churches, sports teams, nonprofit, and state organizations) seem likely to compete over employees, customers, or members online communities seem to benefit to some extant from sharing users with each other. I suspect this has to do with the ease with which nonrival content, ideas, and knowledge move between communities.

A second important takeaway from this work is that I think the evidence it finds for the adaptation explanation for the tendency toward mutualism isn’t all that convincing. Sure, communities in competition tend to become more specialized, but the effect size is pretty small and the fact that specialization doesn’t reduce competition suggests that it isn’t truly adaptive in the strongest sense. Put another way, specialized online communities might be made via an adaptive process, or they might be born out of the intentions and designs of their founders and early joiners. This work finds a bit of evidence for how specialization might be made, but the born process merits more investigation.

One clue about the significance of design for specialization comes from fellow CDSC-er Jeremy Foote‘s a nice CHI paper (acm dl) last year on how the early stages of a subreddit’s development are important to its trajectory and found that most subreddit creators didn’t set out to create a large community. Another study (arxiv.org), by Chenhao Tan on “community genealogy” shows how the growth of new subreddits often seems to depend on having high overlap with a “parent” subreddit. These papers don’t focus on specialization, but it would be cool to see future work take up these ideas.

If you enjoyed reading this summary or want to learn more, please check out the full paper. I got the chance to speculate a bit about what sorts of future technology designs might assist community leaders in crafting online communities to fill ecological roles. I also got to engage with ecological theory in a new way writing this. I hope you read and enjoy.

Finally, I wasn’t able to attend ICWSM in person this year, so I want to thank Kristen Engel for presenting on my behalf. I also want to note that CDSC-er Kaylea Champion and I were both recognized as “best reviewers” at the conference.

This work started as a chapter of my dissertation. Thanks to the committee — Professors Benjamin Mako Hill, Kirsten Foot, Aaron Shaw, David McDonald and Emma Spiro.

I also gratefully acknowledge support by NSF grants IIS-1908850 and IIS-1910202 and GRFP \#2016220885. This work was facilitated through the use of the advanced computational infrastructure provided by the Hyak supercomputer system at the University of Washington and TACC at the University of Texas.