Apply to Join the Community Data Science Collective as a PhD student!

It’s Ph.D. application season and the Community Data Science Collective is recruiting! As always, we are looking for talented people to join our research group. Applying to one of the Ph.D. programs that the CDSC faculty members are affiliated with is a great way to do that.

This post provides a very brief run-down on the CDSC, the different universities and Ph.D. programs we’re affiliated with, and what we’re looking for when we review Ph.D. applications. It’s close to the deadline for some of our programs, but we hope this post will still be useful to prospective applicants now and in the future.

Group photo of the collective at a recent virtual retreat.

What is the Community Data Science Collective?

The Community Data Science Collective (or CDSC) is a joint research group of (mostly) quantitative social scientists and designers pursuing research about the organization of online communities, peer production, online communities, and learning and collaboration in social computing systems. We are based at Northwestern University, the University of Washington, Carleton College, the University of North Carolina, Chapel Hill, Purdue University, and a few other places. You can read more about us and our work on our research group blog and on the collective’s website/wiki.

What are these different Ph.D. programs? Why would I choose one over the other?

This year the group includes four faculty principal investigators (PIs) who are actively recruiting PhD students: Aaron Shaw (Northwestern University), Benjamin Mako Hill and Sayamindu Dasgupta (University of Washington in Seattle), and Jeremy Foote (Purdue University). Each of these PIs advise Ph.D. students in Ph.D. programs at their respective universities. Our programs are each described below.

Although we often work together on research and serve as co-advisors to students in each others’ projects, each faculty person has specific areas of expertise and interests. The reasons you might choose to apply to one Ph.D. program or to work with a specific faculty member could include factors like your previous training, career goals, and the alignment of your specific research interests with our respective skills.

At the same time, a great thing about the CDSC is that we all collaborate and regularly co-advise students across our respective campuses, so the choice to apply to or attend one program does not prevent you from accessing the expertise of our whole group. But please keep in mind that our different Ph.D. programs have different application deadlines, requirements, and procedures!

Who is actively recruiting this year?

Given the disruptions and uncertainties associated with the COVID19 pandemic, the faculty PIs are more constrained in terms of whether and how they can accept new students this year. If you are interested in applying to any of the programs, we strongly encourage you to reach out the specific faculty in that program before submitting an application.

Ph.D. Advisors

Sayamindu Dasgupta head shot
Sayamindu Dasgupta

Although he is currently at the University of North Carolina, Sayamindu Dasgupta will starting this year as an Assistant Professor in the Department of Human-Centered Design and Engineering at the University of Washington. Sayamindu’s research focus includes data science education for children and informal learning online—this work involves both system building and empirical studies.

Benjamin Mako Hill

Benjamin Mako Hill is an Assistant Professor of Communication at the University of Washington. He is also an Adjunct Assistant Professor at UW’s Department of Human-Centered Design and Engineering (HCDE), Computer Science and Engineering (CSE) and Information School. Although many of Mako’s students are in the Department of Communication, he has also advised students in all three other departments—although he typically has more limited ability to admit students into those programs. Mako’s research focuses on population-level studies of peer production projects, computational social science, efforts to democratize data science, and informal learning. Mako has also put together a webpage for prospective graduate students with some useful links and information..

Aaron Shaw. (Photo credit: Nikki Ritcher Photography, cc-by-sa)

Aaron Shaw is an Associate Professor in the Department of Communication Studies at Northwestern. In terms of Ph.D. programs, Aaron’s primary affiliations are with the Media, Technology and Society (MTS) and the Technology and Social Behavior (TSB) Ph.D. programs. Aaron also has a courtesy appointment in the Sociology Department at Northwestern, but he has not directly supervised any Ph.D. advisees in that department (yet). Aaron’s current research projects focus on comparative analysis of the organization of peer production communities and social computing projects, participation inequalities in online communities, and empirical research methods.

Jeremy Foote

Jeremy Foote is an Assistant Professor at the Brian Lamb School of Communication at Purdue University. He is affiliated with the Organizational Communication and Media, Technology, and Society programs. Jeremy’s current research focuses on how individuals decide when and in what ways to contribute to online communities, how communities change the people who participate in them, and how both of those processes can help us to understand which things become popular and influential. Most of his research is done using data science methods and agent-based simulations.

What do you look for in Ph.D. applicants?

There’s no easy or singular answer to this. In general, we look for curious, intelligent people driven to develop original research projects that advance scientific and practical understanding of topics that intersect with any of our collective research interests.

To get an idea of the interests and experiences present in the group, read our respective bios and CVs (follow the links above to our personal websites). Specific skills that we and our students tend to use on a regular basis include experience consuming and producing social science and/or social computing (human-computer interaction) research; applied statistics and statistical computing, various empirical research methods, social theory and cultural studies, and more.

Formal qualifications that speak to similar skills and show up in your resume, transcripts, or work history are great, but we are much more interested in your capacity to learn, think, write, analyze, and/or code effectively than in your credentials, test scores, grades, or previous affiliations. It’s graduate school and we do not expect you to show up knowing how to do all the things already.

Intellectual creativity, persistence, and a willingness to acquire new skills and problem-solve matter a lot. We think doctoral education is less about executing a task that someone else hands you and more about learning how to identify a new, important problem; develop an appropriate approach to solving it; and explain all of the above and why it matters so that other people can learn from you in the future. Evidence that you can or at least want to do these things is critical. Indications that you can also play well with others and would make a generous, friendly colleague are really important too.

All of this is to say, we do not have any one trait or skill set we look for in prospective students. We strive to be inclusive along every possible dimension. Each person who has joined our group has contributed unique skills and experiences as well as their own personal interests. We want our future students and colleagues to do the same.

Now what?

Still not sure whether or how your interests might fit with the group? Still have questions? Still reading and just don’t want to stop? Follow the links above for more information. Feel free to send at least one of us an email. We are happy to try to answer your questions and always eager to chat.

Community Data Science Collective Research at DebConf 2021

Debian is one of the oldest, largest, and most influential peer production communities and has produced an operating system used by millions for over the last three decades. DebConf is that community’s annual meeting. This year, the Community Data Science Collective was out in force at Debian’s virtual conference to present several Debian-focused research projects that we’ve been working on.

First, Wm Salt Hale presented work from his master thesis project on “Resilience in FLOSS: Do founder decisions impact development activity after crisis events?” His work tried to understand the social dynamics behind organizational resilience among free software projects based on what Salt calls “founder decisions.” He did so by estimating the relationship between changes in developer activity after security bugs and testing several theories about how this relationship might vary between permissive and copyleft licensed software packages.

Wm Salt Hale’s presentation plus Q&A. (WebM available)

Next, Kaylea and Salt facilitated a “birds-of-a-feather” get-together session for FLOSS project founders (video is also available).

Finally, Kaylea Champion presented her work with Benjamin Mako Hill on “Detecting At Risk Software in Debian.” Her work described a new technique that involves identifying software packages that are less (or more) high quality than you we might expect given their popularity. You can read more about that work in our blog post from earlier this year.

Kaylea Champion’s presentation plus Q&A. (WebM available)

If you saw either presentation and are interested in continuing the conversation, you are welcome to reach out to us individually ({kaylea OR halew}@uw.edu). You can also follow us on this blog, or follow or engage with us in the Fediverse (@communitydata@social.coop), or on Twitter (@comdatasci).

Do generous attitudes underlie contributions to user-generated content?

User-generated content on the Internet provides the basis for some of the most popular websites, such as Wikipedia, crowdsourced question-and-answer sites like Stack Overflow, video-sharing sites like YouTube, and social media platforms like Reddit. Much (or in some cases all) of the content on these sites is created by unpaid volunteers, who invest substantial time and effort to produce high quality information resources. So are these volunteers and content contributors more generous in general than people who don’t contribute their time, knowledge, or information online?

We (Floor Fiers, Aaron Shaw, and Eszter Hargittai) consider this question in a recent paper published in The Journal of Quantitative Description: Digital Media (JQD:DM). The publication of this particularly is exciting because it pursues a new angle on these questions, and also because it’s part of the inaugural issue of JQD:DM, a new open-access venue for research that seeks to advance descriptive (as opposed to analytic or causal) knowledge about digital media.

The study uses data from a national survey of U.S. adult internet users that includes questions about many kinds of online contribution activities, various demographic and background attributes, as well as a dictator game to measure generosity. In the dictator game, each participant has an opportunity to make an anonymous donation of some unanticipated funds to another participant in the study. Prior experimental research across the social sciences has used dictator games, but no studies we know of had compared dictator game donations with online content contributions.

Sharing content. GotCredit via flickr.

Overall, we find that people who contribute some kind of content online exhibit more generosity in the dictator game. More specifically, we find that people producing any type of user-generated content tend to donate more in the dictator game than those who do not produce any such content. We also disaggregate the analysis by type of content contribution and find that donating in the dictator game only correlates with content contribution for those who write reviews, upload public videos, pose or answer questions, and contribute to encyclopedic knowledge collections.

So, generous attitudes and behaviors may help explain contributions to some types of user-generated content, but not others. This implies that user-generated content is not a homogeneous activity, since variations exist between different types of content contribution.

The (open access!) paper has many more details, so we hope you’ll download, read, and cite it. Please feel free to leave a comment below too.

Paper Citation: Fiers, Floor, Aaron Shaw, and Eszter Hargittai. 2021. “Generous Attitudes and Online Participation”. Journal of Quantitative Description: Digital Media 1 (April). https://doi.org/10.51685/jqd.2021.008.

CDSC is hiring a staff person!

Group photo of many of the collective members at a virtual retreat in Spring 2021.

Do you (or someone you know) care about online communities and organizing, scientific research, education, and sharing ideas? We are looking for a person to join us and help grow our research and public impact. The (paid, part-time with benefits) position will focus on responsibilities such as research assistance, research administration, communications and outreach. 

This is a new position and will be the first dedicated staff member with the group. The person who takes the job will shape the role together with us based on their interests and skills.  While we have some ideas about the qualifications that might make somebody a compelling candidate (see below), we are eager to hear from anyone who is willing to get involved, learn on the job, and collaborate with us. You do not need to be an expert or have decades of experience to apply for this job. We aim to value and build on applicants’ experiences.

The position is about half time (25 hours per week) through Northwestern University and could be performed almost entirely remotely (the collective hosts in-person meetings and workshops when public health/safety allows). The salary will start at around $30,000 per year and includes excellent benefits through Northwestern. We’re looking for a minimum 1 year commitment.

Expected responsibilities will likely fall into three areas:

  • Support research execution (example: develop materials to recruit study participants)
  • Research administration (example: manage project tracking, documentation)
  • Community management (example: plan meetings with partner organizations)

Candidates must hold at least a bachelor’s degree. Familiarity with scientific research, project management, higher education, and/or event planning is a plus, as is prior experience in the social or computer sciences, research organizations, online communities, and/or public interest technology and advocacy projects of any kind.

To learn more about the Community Data Science Collective, you should check out our wiki, read previous posts on this blog, and look at some of our recent publications. Please feel free to contact anyone in the group with questions. We are committed to creating a diverse, inclusive, equitable, and accessible work environment within our collective and we look forward to working with someone who shares these values.

Ready to apply? Please do so via this Northwestern University job posting.  We are reviewing applications on a rolling basis and hope to hire someone to begin later this summer.

Workshop Announcement: Imagining Future Tools for Youth Data Literacies @ CLS2021

As today’s youth come of age in an increasingly data-driven world, the development of new literacies is increasingly important. Young people need both skills to work with, analyze, and interpret data, as well as an understanding of the complex social issues surrounding the collection and use of data. But how can today’s youth develop the skills they need?

We will exploring this question during an upcoming workshop on Imagining Future Designs of Tools for Youth Data Literacies, one of the offerings at this year’s Connected Learning Summit. As co-organizers for this workshop, we are motivated by our interest in how young people learn to work with and understand data. We are also curious about how other people working in this area define the term ‘data literacy’ and what they feel are the most critical skills for young people to learn. As there are a number of great tools available to help young people learn about and use data, we  also hope to explore which features of these tools made them most effective. We are looking forward to discussions on all of these issues during the workshops.

This workshop promises to be an engaging discussion of existing tools available to help young people work with and understand data (Session 1) and an exploration of what future tools might offer (Session 2). We invite all researchers, educators, and other practitioners to join us for one or both of these sessions. We’re hoping for all attendees to come away with a deeper understanding of data literacies and how to support youth in developing data literacy skills.

Information on registering for the Connected Learning Summit available at: https://connectedlearningsummit.org/

To register interest in attending the Youth Data Literacies Workshop, please complete the pre-registration form at: http://dataliteracies.com/

The workshop is organized by Community Data Science Collective members Regina Cheng, Stefania Druga, Emilia Gan, and Benjamin Mako Hill in collaboration with Rahul Bhargava, Tamara Clegg, Catherine D’Ignazio, Yasmin Kafai, Victor Lee, Camillia Matuk, and Andee Rubin.

Community Data Science Collective at ICA 2021

As we do every year, members of the Community Data Science Collective will be presenting work at the International Communication Association (ICA)’s 71st Annual Conference which will take place virtually next week. Due to the asynchronous format of ICA this year, none of the talks will happen at specific times. Although the downside of the virtual conference is that we won’t be able to meet up with you all in person, the good news is that you’ll be able to watch our talks and engage with us on whatever timeline suits you best between May 27 and and 31st.

This year’s offerings from the collective include:

Nathan TeBlunthuis will be presenting work with Benjamin Mako Hill as part of the ICA Computational Methods section on “Time Series and Trends in Communication Research.” The name of their talk is “A Community Ecology Approach for Identifying Competitive and Mutualistic Relationships Between Online Communities.”

Aaron Shaw is presenting a paper on “Participation Inequality in the Gig Economy” on behalf of himself, Floor Fiers and Eszter Hargittai . The talk will be as part of a session organized by the ICA Communication and Technology section on “From Autism to Uber: The Digital Divide and Vulnerable Populations.”

Floor Fiers collaborated with Nathan Walter on a poster titled “Sharing Unfairly: Racial Bias on Airbnb and the Effect of Review Valence.” The poster is part of the interactive poster session of the ICA Ethnicity and Race section.

Nick Hager will be talking about his paper with Aaron Shaw titled “Randomly-Generated Inequality in Online News Communities,” which is part of a high density session on “Social Networks and Influence.”

Finally, Jeremy Foote will be chairing a session on “Cyber Communities: Conflicts and Collaborations” as part of the ICA Communication and Technology division.

We look forward to sharing our research and connecting with you at ICA!

UPDATE: The paper led by Nathan TeBlunthuis won the best paper award from the ICA Computational Methods section! Congratulations, Nate!

Newcomers, Help, Feedback, Critical Infrastructure….: Social Computing Scholarship at SANER 2021

This year I was fortunate to present to the 2021 IEEE International Conference on Software Analysis, Evolution and Re-engineering or “SANER 2021.” You can see the write-up of my own presentation on “underproduction” elsewhere on this blog.

SANER is primarily focused on software engineering practices, and several of the projects presented this year were of interest for social computing scholars. Here’s a quick rundown of presentations I particularly enjoyed:

Newcomers: Does marking a bug as a ‘Good First Issue’ help retain newcomers? These results from Hyuga Horiguchi, Itsuki Omori and Masao Ohira suggest the answer is “yes.” However, marking documentation tasks as a ‘Good First Issue’ doesn’t seem to help with the onboarding process. Read more or watch the talk at: Onboarding to Open Source Projects with Good First Issues: A Preliminary Analysis [VIDEO]

Comparison of online help communities: This article by Mahshid Naghashzadeh, Amir Haghshenas, Ashkan Sami and David Lo compares two question/answer environments that we might imagine as competitors—the Matlab community of Stack Overflow versus the Matlab community hosted by Matlab. These sites have similar affordances and topics, however, the two sites seem to draw distinctly different types of questions. This article features an extensive hand-coded dataset by subject matter experts: How Do Users Answer MATLAB Questions on Q&A Sites? A Case Study on Stack Overflow and MathWorks [VIDEO]

Feedback: What goes wrong when software developers give one another feedback on their code? This study by a large team (Moataz Chouchen, Ali Ouni, Raula Gaikovina Kula, Dong Wang, Patanamon Thongtanunam, Mohamed Wiem Mkaouer and Kenichi Matsumoto) offers an ontology of the pitfalls and negative interactions that can occur during the popular code feedback practice known as code review: confused reviewers, divergent reviewers, low review participation, shallow review, and toxic review:
Anti-patterns in Modern Code Review: Symptoms and Prevalence [VIDEO]

Critical Infrastructure: This study by Mahmoud Alfadel, Diego Elias Costa and Emad Shihab was focused on traits of security problems in Python and made some comparisons to npm. This got me thinking about different community-level factors (like bug release/security alert policies) that may influence underproduction. I also found myself wondering about inter-rater reliability for bug triage in communities like Python. The paper showed a very similar survival curve for bugs of varying severities, whereas my work in Debian showed distinct per-severity curves. One explanation for uniform resolution rate across severities could be high variability in how severity ratings are applied. Another factor worth considering may be the role of library abandonment: Empirical analysis of security vulnerabilities in python packages [VIDEO]

Mako Hill gets an NSF CAREER Award!

In exciting collective news, the US National Science Foundation announced that Benjamin Mako Hill has received of one of this year’s CAREER awards. The CAREER is the most prestigious grant that the NSF gives to early career scientists in all fields.

You can read lots more about the award in a detailed announcement that the University of Washington Department of Communication put out, on Mako’s personal blog (or in this Twitter thread and this Fediverse thread), or on the NSF website itself. The grant itself—about $550,000 over five years—will support a ton of Community Data Science Collective research and outreach work over the next half-decade. Congratulations, Mako!

A round-up of our recent research

Data (Alice Design, cc-by, via the noun project)

We try to keep this blog updated with new research and presentations from members of the group, but we often fall behind. With that in mind, this post is more of a listicle: 22 things you might not have seen from the CDSC in the past year! We’ve included links to (hopefully un-paywalled copies) of just about everything.

Papers and book chapters

Presentations and panels

  • Champion, Kaylea. (2020) How to build a zombie detector: Identifying software quality problems. Seattle Gnu/Linux Users Conference, November, 2020.
  • Hwang, Sohyeon and Aaron Shaw. (2020) Heterogeneous practices in collective governance. Presented at Collective Intelligence 2020 (CI 2020). Boston & Copenhagen (Virtually held).
  • Shaw, Aaron. The importance of thinking big: Convergence, divergence, and independence among wikis and peer production communities. WIkiResearch Showcase. January 20, 2021.
  • TeBlunthuis Nathan E., Benjamin Mako Hill. Aaron Halfaker. “Algorithmic flags and Identity-Based Signals in Online Community Moderation” Session on Social media 2, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E.., Aaron Shaw, *Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Culture and fairness, International Conference on Computational Social Science (IC2S2 2020), Cambridge, MA, July 19, 2020.
  • TeBlunthuis Nathan E., Aaron Shaw, Benjamin Mako Hill. “The Population Ecology of Online Collective Action.” Session on Collective Action, ACM Conference on Collective Intelligence (CI 2020), Boston, MA, June 18, 2020.

CDSC is hiring research assistants

The Northwestern University branch of the Community Data Science Collective (CDSC) is hiring research assistants. CDSC is an interdisciplinary research group made of up of faculty and students at multiple institutions, including Northwestern University, Purdue University, and the University of Washington. We’re social and computer scientists studying online communities such as Wikipedia, Reddit, Scratch, and more.

Screenshot from a recent remove meeting of the CDSC
A screenshot from a recent remote meeting of the CDSC…

Recent work by the group includes studies of participation inequalities in online communities and the gig economy, comparisons of different online community rules and norms, and evaluations of design changes deployed across thousands of sites. More examples and information can be found on our list of publications and our research blog (you’re probably reading our blog right now).

This posting is specifically to work on some projects through the Northwestern University part of the CDSC. Northwestern Research Assistants will contribute to data collection, analysis, documentation, and administration on one (or more) of the group’s ongoing projects. Some research projects you might help with include:

  • A study of rules across the five largest language editions of Wikipedia.
  • A systematic literature review on the gig economy.
  • Interviews with contributors to small, niche subreddit communities.
  • A large-scale analysis of the relationships between communities.

Successful applicants will have an interest in online communities, social science or social computing research, and the ability to balance collaborative and independent work. No specialized skills are required and we will adapt work assignments and training to the skills and interests of the person(s) hired. Relevant skills might include: coursework, research, and/or familiarity with digital media, online communities, human computer interaction, social science research methods such as interviewing, applied statistics, and/or data science. Relevant software experience might include: R, Python, Git, Zotero, or LaTeX. Again, no prior experience or specialized skills are required. 

Expected minimum time commitment is 10 hours per week through the remainder of the Winter quarter (late March) with the possibility of working additional hours and/or continuing into the Spring quarter (April-June). All work will be performed remotely.

Interested applicants should submit a resume (or CV) along with a short cover letter explaining your interest in the position and any relevant experience or skills. Applicants should indicate whether you would prefer to pursue this through Federal work-study, for course credit (most likely available only to current students at one of the institutions where CDSC affiliates work), or as a paid position (not Federal work-study). For paid positions, compensation will be $15 per hour. Some funding may be restricted to current undergraduate students (at any institution), which may impact hiring decisions.

Questions and/or applications should be sent to Professor Aaron Shaw. Work-study eligible Northwestern University students should indicate this in their cover letter. Applications will be reviewed by Professor Shaw and current CDSC-NU team members on a rolling basis and finalists will be contacted for an interview.

The CDSC strives to be an inclusive and accessible research community. We particularly welcome applications from members of groups historically underrepresented in computing and/or data sciences. Some of these positions funded through a U.S. National Science Foundation Research Experience for Undergraduates (REU) supplement to awards numbers: IIS-1910202 and IIS-1617468.