Roundup: Community Data Science Collective at CHI 2017

The Community Data Science Collective had an excellent week showing off our stuff at CHI 2017 in Denver last week. The collective presented three papers. If you didn’t make it Denver, or if just missed our presentations, blog post summaries of the papers — plus the papers themselves — are all online:

Additionally, Sayamindu Dasgupta’s “Scratch Community Blocks” paper — adapted from his dissertation work at MIT — received a best paper honorable mention award.

All three papers were published as open access so enjoy downloading and sharing the papers!

Children’s Perspectives on Critical Data Literacies

Last week, we presented a new paper that describes how children are thinking through some of the implications of new forms of data collection and analysis. The presentation was given at the ACM CHI conference in Denver last week and the paper is open access and online.

Over the last couple years, we’ve worked on a large project to support children in doing — and not just learning about — data science. We built a system, Scratch Community Blocks, that allows the 18 million users of the Scratch online community to write their own computer programs — in Scratch of course — to analyze data about their own learning and social interactions. An example of one of those programs to find how many of one’s follower in Scratch are not from the United States is shown below.

Last year, we deployed Scratch Community Blocks to 2,500 active Scratch users who, over a period of several months, used the system to create more than 1,600 projects.

As children used the system, Samantha Hautea, a student in UW’s Communication Leadership program, led a group of us in an online ethnography. We visited the projects children were creating and sharing. We followed the forums where users discussed the blocks. We read comment threads left on projects. We combined Samantha’s detailed field notes with the text of comments and forum posts, with ethnographic interviews of several users, and with notes from two in-person workshops. We used a technique called grounded theory to analyze these data.

What we found surprised us. We expected children to reflect on being challenged by — and hopefully overcoming — the technical parts of doing data science. Although we certainly saw this happen, what emerged much more strongly from our analysis was detailed discussion among children about the social implications of data collection and analysis.

In our analysis, we grouped children’s comments into five major themes that represented what we called “critical data literacies.” These literacies reflect things that children felt were important implications of social media data collection and analysis.

First, children reflected on the way that programmatic access to data — even data that was technically public — introduced privacy concerns. One user described the ability to analyze data as, “creepy”, but at the same time, “very cool.” Children expressed concern that programmatic access to data could lead to “stalking“ and suggested that the system should ask for permission.

Second, children recognized that data analysis requires skepticism and interpretation. For example, Scratch Community Blocks introduced a bug where the block that returned data about followers included users with disabled accounts. One user, in an interview described to us how he managed to figure out the inconsistency:

At one point the follower blocks, it said I have slightly more followers than I do. And, that was kind of confusing when I was trying to make the project. […] I pulled up a second [browser] tab and compared the [data from Scratch Community Blocks and the data in my profile].

Third, children discussed the hidden assumptions and decisions that drive the construction of metrics. For example, the number of views received for each project in Scratch is counted using an algorithm that tries to minimize the impact of gaming the system (similar to, for example, Youtube). As children started to build programs with data, they started to uncover and speculate about the decisions behind metrics. For example, they guessed that the view count might only include “unique” views and that view counts may include users who do not have accounts on the website.

Fourth, children building projects with Scratch Community Blocks realized that an algorithm driven by social data may cause certain users to be excluded. For example, a 13-year-old expressed concern that the system could be used to exclude users with few social connections saying:

I love these new Scratch Blocks! However I did notice that they could be used to exclude new Scratchers or Scratchers with not a lot of followers by using a code: like this:
when flag clicked
if then user’s followers < 300
stop all.
I do not think this a big problem as it would be easy to remove this code but I did just want to bring this to your attention in case this not what you would want the blocks to be used for.

Fifth, children were concerned about the possibility that measurement might distort the Scratch community’s values. While giving feedback on the new system, a user expressed concern that by making it easier to measure and compare followers, the system could elevate popularity over creativity, collaboration, and respect as a marker of success in Scratch.

I think this was a great idea! I am just a bit worried that people will make these projects and take it the wrong way, saying that followers are the most important thing in on Scratch.

Kids’ conversations around Scratch Community Blocks are good news for educators who are starting to think about how to engage young learners in thinking critically about the implications of data. Although no kid using Scratch Community Blocks discussed each of the five literacies described above, the themes reflect starting points for educators designing ways to engage kids in thinking critically about data.

Our work shows that if children are given opportunities to actively engage and build with social and behavioral data, they might not only learn how to do data analysis, but also reflect on its implications.

This blog-post and the work that it describes is a collaborative project by Samantha Hautea, Sayamindu Dasgupta, and Benjamin Mako Hill. We have also received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Hal Abelson from MIT CSAIL. Financial support came from the US National Science Foundation.

Surviving an “Eternal September:” How an Online Community Managed a Surge of Newcomers

Attracting newcomers is among the most widely studied problems in online community research. However, with all the attention paid to challenge of getting new users, much less research has studied the flip side of that coin: large influxes of newcomers can pose major problems as well!

The most widely known example of problems caused by an influx of newcomers into an online community occurred in Usenet. Every September, new university students connecting to the Internet for the first time would wreak havoc in the Usenet discussion forums. When AOL connected its users to the Usenet in 1994, it disrupted the community for so long that it became widely known as “The September that never ended.

Our study considered a similar influx in NoSleep—an online community within Reddit where writers share original horror stories and readers comment and vote on them. With strict rules requiring that all members of the community suspend disbelief, NoSleep thrives off the fact that readers experience an immersive storytelling environment. Breaking the rules is as easy as questioning the truth of someone’s story. Socializing newcomers represents a major challenge for NoSleep.

Number of subscribers and moderators on /r/NoSleep over time.

On May 7th, 2014, NoSleep became a “default subreddit”—i.e., every new user to Reddit automatically joined NoSleep. After gradually accumulating roughly 240,000 members from 2010 to 2014, the NoSleep community grew to over 2 million subscribers in a year. That said, NoSleep appeared to largely hold things together. This reflects the major question that motivated our study: How did NoSleep withstand such a massive influx of newcomers without enduring their own Eternal September?

To answer this question, we interviewed a number of NoSleep participants, writers, moderators, and admins. After transcribing, coding, and analyzing the results, we proposed that NoSleep survived because of three inter-connected systems that helped protect the community’s norms and overall immersive environment.

First, there was a strong and organized team of moderators who enforced the rules no matter what. They recruited new moderators knowing the community’s population was going to surge. They utilized a private subreddit for NoSleep’s staff. They were able to socialize and educate new moderators effectively. Although issuing sanctions against community members was often difficult, our interviewees explained that NoSleep’s moderators were deeply committed and largely uncompromising.

That commitment resonates within the second system that protected NoSleep: regulation by normal community members. From our interviews, we found that the participants felt a shared sense of community that motivated them both to socialize newcomers themselves as well as to report inappropriate comments and downvote people who violate the community’s norms.

Finally, we found that the technological systems protected the community as well. For instance, post-throttling was instituted to limit the frequency at which a writer could post their stories. Additionally, Reddit’s “Automoderator”, a programmable AI bot, was used to issue sanctions against obvious norm violators while running in the background. Participants also pointed to the tools available to them—the report feature and voting system in particular—to explain how easy it was for them to report and regulate the community’s disruptors.

This blog post was written with Benjamin Mako Hill. The paper and work this post describes is collaborative work with Benjamin Mako Hill and Andrés Monroy-Hernández. The paper was published in the Proceedings of CHI 2016 and is released as open access so anyone can read the entire paper here. A version of this blogpost was posted on Benjamin Mako Hill’s blog Copyrighteous.

Introducing the Cannabis Data Science Collective

In 2012, Washington State became one of the first two US states to legalize cannabis for non-medical use. Since then, sales tax revenues from the “green economy” have flooded state coffers. Washington’s academic institutions have been elevated by that rising tide. The University of Washington (one of our research group’s two institutional homes) is now home to pot-focused grants from UW’s Center for Cannabis Research and the UW Law School’s Cannabis Law and Policy Project.

Today, our research group — formerly known as the “Community Data Science Collective” — announces that we too will be raiding that pantry to satisfy our own munchies.  Toward that end, we have changed our name to the Cannabis Data Science Collective. We’ll still be the CDSC, but we’re changing our logo to match our new focus.

The CDSC’s new logo!

Our research will leverage our existing expertise in studying the chronic challenges faced by online communities, peer production, and social computing. We plan to blaze ahead on this path to greener pastures.

Although we’re still in the early days of this new research focus, our group has started a work on series of projects related to cannabis, communication, and social computing. The preliminary titles below are a bit half-baked, but will give you a whiff of what’s to come:

  • Altered state: Mobile device usage on public university campuses before and after marijuana legalization
  • A tale of two edibles: Automated polysemy detection and the stevia/sativa controversy
  • Best buds: Online friendship formation and recreational drug use
  • Bing bong: The effect of legalization on Microsoft’s search results
  • Blunt truths: The effect of the joint probability distribution on community participation
  • Dank memes: The role of viral social media in marijuana legalization
  • Decision trees: The role of deliberation in governance of a marijuana sub-Reddit
  • The Effects of cannabis on word usage: An analysis of Wikipedia articles pre/post pot legalization
  • Fully baked: Evidence of the importance of completing institutionalized socialization practices from an online cannabis community
  • Ganja rep: A novel approach to managing identity on the World Weed Web
  • Hashtags: Bottom-up organization of the marijuana-focused Internet public sphere
  • Higher calling: Marijuana use and altruistic behavior online
  • Joint custody: Overcoming territoriality with shared ownership of work products in a collaborative cannabis community
  • Pass the piece: Hardware design and social exchange norms in synchronous marijuana-sharing communities
  • Pipe dreams: Fan fiction and the imagined futures of the marijuana legalization movement
  • Sticky icky: Keeping order with pinned messages on an online marijuana discussion board
  • Turn on, tune in, drop out: Wikipedia participation rates following marijuana legalization
  • Weed and Wikipedia: Marijuana legalization and public goods participation
  • World Weed Web: A look at the global conversation before and after half of the United States decriminalized

We planned to post this announcement about three weeks ago but our efforts were blunted by a series of events outside our control. We figured it was high time to make the announcement today!

 

Why do people start new online communities and projects?

Online communities have become ubiquitous, providing not only entertainment but wielding increasing cultural and political influence. While news organizations and researchers have focused a lot of attention on online communities after they become influential, very little is known about how or why they get started. Our survey of hundreds of Wikia.com founders shows that typical online communities are actually very different from the communities that are “in the news”. Online community founders have diverse motivations, but typically have modest goals which are focused on filling their own needs, and they don’t necessarily care if their projects ever get very big. Our research suggests that rather than being failures, small online communities are both intentional and common.

Most online communities are small —Our research is inspired by the skewed distribution of attention online. For example, these three graphs show the number of contributors to each subreddit, github project, and Wikipedia page. (Note the log scale – the reality is even more skewed than these plots make it appear).

Reddit graph


Github graph

Wikipedia graphIn every case, there is a “long tail” of projects with very few contributions or attention, while the most popular projects get the lion’s share. It is perhaps unsurprising, then, that they also garner the majority of scholarly attention. However, what these graphs also show is that most online communities are very small.

Even when scholars include smaller communities in their analysis, they typically treat longevity and size as measures of success. Using this metric, the vast majority of new projects fail. So why do people start new online communities? Are they simply naive, not realizing that large-scale success is so rare? Are community founders trying to win the attention lottery?

Our Survey —We worked with some great folks at Wikia to send a survey to community founders right after they started their community. We received partial or full responses from hundreds of founders.

Wikia homepage
Wikia homepage as it appeared during our data collection (via archive.org) with the invitation to found a new wiki highlighted. Twilight was really big in 2010.

 

In addition to demographic information, we asked a set of thirteen questions about the motivations of founders, based on the contributor motivation literature, and seven questions about their goals for their community. We also asked founders about their plans for their community, and whether they were planning to follow some of the best practices for building and running online communities.

Founders have diverse motivations and modest goals — We found that Wikia founders have diverse motivations. We used PCA to identify four main motivations for creating new wikis: spreading information and building a community, problems with existing wikis, for fun or learning, and creating and publicizing personal content. Spreading information and building a community was the most common motivation, but each of these was marked as a primary motivation by multiple respondents.

We also found that the barriers to starting a new community – both technological and cognitive – are very low. Only 32% of founders reported planning on starting their wiki for a few weeks or longer, while fully 46% of founders had only planned it for a few hours or a few minutes.

As with motivations, founders had diverse goals. The most common top goal was the creation of high-quality information, with nearly half of respondents selecting it. Community longevity/activity and growth were also common goals.

Finally, we looked at whether there was a relationship between motivations and goals, and between goals and plans for community building. We found that those whose top goal was information quality were less likely to be motivated by fun and learning, and that they were less likely to plan on recruiting contributors or encouraging contributions. In future research, we are looking at how a founder’s goals and plans relate to membership and contribution growth.

Motivations by goals
Plans by goals
Distribution of founder motivations and plans, based on whether their top goal is community or information quality.

So what? —We believe that platform designers and researchers should focus more of their resources on understanding small and short-lived communities. Our research suggests that the attention paid to the more popular and long-lived online communities has perpetuated a false assumption that all communities seek to become large and powerful. Indeed, our respondents are typically not seeking or even hoping for large-scale “success”.

In addition, we believe that in many contexts, understanding online communities can be augmented by focusing on founders. Platform designers can study founders to understand how users would like to use a system and researchers can do more to understand the differences between founders and other contributors.

There is also a need to generalize this research – founders on other online platforms (Reddit, github, etc.) may have a different set of motivations and goals (although we suspect that they will be similarly modest in their ambitions). Overall, there is lots of room for additional research on how and why things get started online.

The paper and data — If you liked this blog post, then you’ll love the full paper: Starting online communities: Motivations and goals of wiki founders. Even better, if you are planning to be at CHI 2017, come watch the talk!

This post (and the paper) were written by Jeremy Foote, Aaron Shaw and Darren Gergle. The charts at the beginning of the post were created using data from the great public datasets at Big Query. Anonymized results of the survey are publicly available, and code is coming.

 

Searching for competition on Change.org with LDA topic models

You may have heard of Change.org. It’s a popular online petitioning platform. You may have even noticed there can many online petitions about popular topics. For instance, it is easy to find dozens of petitions protesting the Lychee and Dog Meat Festival with varying levels of support.

Imagine you want to start an online petition. You might worry if your petition is very similar to other people’s petitions that already have signatures. These other petitions have a head start and will get all the attention. That said, if nobody has made any similar petitions, maybe that’s because the issue you are petitioning about doesn’t yet have a lot of popular support. You might also worry if your petition is unusual. Which of these two worries (making a duplicate petition and making a petition no one cares about) should concern you, dear petition creator? In my research, I set out to answer this question. The project is still in progress. I recently presented it as a poster at CSCW ’17.

Sociologists of organizational ecology considered similar questions about businesses and social movement organizations. They wanted to explain why organizations were more likely to die when an industry was young or old, but less likely to die in between. They argued that density, or the number of organizations in the population, was tied both to processes of legitimation and competition. There aren’t many firms in unproven industries because it’s not clear the industry will succeed, but when an industry is mature it becomes competitive. Everybody wants a piece of the pie, but you might not get enough pie to survive! This notion is called density dependence theory.

I think it is intuitive to apply this logic to online petitions and topics. If you make a petition about a low-density topic, chances for success should be lower because the petition is more likely to be unusual or illegitimate. However if you make a petition in a high-density topic, now you have to worry about competition with all the other petitions in the topic. You want your petition to be original, but not weird!

To collect data to test this theory, I downloaded a large set of petitions from Change.org, spam filtered them, and removed very short ones. Next I used LDA topic modeling to group petitions into topics. This makes it possible to assign petitions to points in a topic space. The more crowded this part of topic space, the denser the petition’s environment.

Finally, I used a regression model to predict petition signature counts. Since density dependence theory predicts that the relationship between density and signature count is shaped like an upside-down U, I included a quadratic term for density. The plot below shows that observed relationship between density in topic space and signature count is what the theory predicted. The darkness of the lines at the bottom of the plot show that most petitions are in less dense parts of topic space.  So you, dear petition creator, should worry about competition and legitimacy, but worry about legitimacy first!

I’m excited by this result because it shows interesting similarities between efforts to organize coordinated activism online and traditional organizations like firms. I’m planning to apply this method to other forms of online coordination like wikis and online communities.

This blog-post and the work it describes is a collaborative project between Nate TeBlunthuis, Benjamin Mako Hill and Aaron Shaw. We are still at work writing this project up as a research article. The work has been supported by the US National Science Foundation

New Dataset: Five Years of Longitudinal Data from Scratch

Scratch is a block-based programming language created by the Lifelong Kindergarten Group (LLK) at the MIT Media Lab. Scratch gives kids the power to use programming to create their own interactive animations and computer games. Since 2007, the online community that allows Scratch programmers to share, remix, and socialize around their projects has drawn more than 16 million users who have shared nearly 20 million projects and more than 100 million comments. It is one of the most popular ways for kids to learn programming and among the larger online communities for kids in general.

Front page of the Scratch online community (https://scratch.mit.edu) during the period covered by the dataset.

Since 2010, I have published a series of papers using quantitative data collected from the database behind the Scratch online community. As the source of data for many of my first quantitative and data scientific papers, it’s not a major exaggeration to say that I have built my academic career on the dataset.

I was able to do this work because I happened to be doing my masters in a research group that shared a physical space (“The Cube”) with LLK and because I was friends with Andrés Monroy-Hernández, who started in my masters cohort at the Media Lab. A year or so after we met, Andrés conceived of the Scratch online community and created the first version for his masters thesis project. Because I was at MIT and because I knew the right people, I was able to get added to the IRB protocols and jump through the hoops necessary to get access to the database.

Over the years, Andrés and I have heard over and over, in conversation and in reviews of our papers, that we were privileged to have access to such a rich dataset. More than three years ago, Andrés and I began trying to figure out how we might broaden this access. Andrés had the idea of taking advantage of the launch of Scratch 2.0 in 2013 to focus on trying to release the first five years of Scratch 1.x online community data (March 2007 through March 2012) — most of the period that the codebase he had written ran the site.

After more work than I have put into any single research paper or project, Andrés and I have published a data descriptor in Nature’s new journal Scientific Data. This means that the data is now accessible to other researchers. The data includes five years of detailed longitudinal data organized in 32 tables with information drawn from more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and much more. The dataset includes metadata on user behavior as well the full source code for every project. Alongside the data is the source code for all of the software that ran the website and that users used to create the projects as well as the code used to produce the dataset we’ve released.

Releasing the dataset was a complicated process. First, we had navigate important ethical concerns about the the impact that a release of any data might have on Scratch’s users. Toward that end, we worked closely with the Scratch team and the the ethics board at MIT to design a protocol for the release that balanced these risks with the benefit of a release. The most important features of our approach in this regard is that the dataset we’re releasing is limited to only public data. Although the data is public, we understand that computational access to data is different in important ways to access via a browser or API. As a result, we’re requiring anybody interested in the data to tell us who they are and agree to a detailed usage agreement. The Scratch team will vet these applicants. Although we’re worried that this creates a barrier to access, we think this approach strikes a reasonable balance.

Beyond the the social and ethical issues, creating the dataset was an enormous task. Andrés and I spent Sunday afternoons over much of the last three years going column-by-column through the MySQL database that ran Scratch. We looked through the source code and the version control system to figure out how the data was created. We spent an enormous amount of time trying to figure out which columns and rows were public. Most of our work went into creating detailed codebooks and documentation that we hope makes the process of using this data much easier for others (the data descriptor is just a brief overview of what’s available). Serializing some of the larger tables took days of computer time.

In this process, we had a huge amount of help from many others including an enormous amount of time and support from Mitch Resnick, Natalie Rusk, Sayamindu Dasgupta, and Benjamin Berg at MIT as well as from many other on the Scratch Team. We also had an enormous amount of feedback from a group of a couple dozen researchers who tested the release as well as others who helped us work through through the technical, social, and ethical challenges. The National Science Foundation funded both my work on the project and the creation of Scratch itself.

Because access to data has been limited, there has been less research on Scratch than the importance of the system warrants. We hope our work will change this. We can imagine studies using the dataset by scholars in communication, computer science, education, sociology, network science, and beyond. We’re hoping that by opening up this dataset to others, scholars with different interests, different questions, and in different fields can benefit in the way that Andrés and I have. I suspect that there are other careers waiting to be made with this dataset and I’m excited by the prospect of watching those careers develop.

You can find out more about the dataset, and how to apply for access, by reading the data descriptor on Nature’s website.

The paper and work this post describes is collaborative work with Andrés Monroy-Hernández. The paper is released as open access so anyone can read the entire paper here. This blog post was also posted on Benjamin Mako Hill’s blog.

Supporting children in doing data science

As children use digital media to learn and socialize, others are collecting and analyzing data about these activities. In school and at play, these children find that they are the subjects of data science. As believers in the power of data analysis, we believe that this approach falls short of data science’s potential to promote innovation, learning, and power.

Motivated by this fact, we have been working over the last three years as part of a team at the MIT Media Lab and the University of Washington to design and build a system that attempts to support an alternative vision: children as data scientists. The system we have built is described in a new paper—Scratch Community Blocks: Supporting Children as Data Scientists—that will be published in the proceedings of CHI 2017.

Our system is built on top of Scratch, a visual, block-based programming language designed for children and youth. Scratch is also an online community with over 15 million registered members who share their Scratch projects, remix each others’ work, have conversations, provide feedback, bookmark or “love” projects they like, follow other users, and more. Over the last decade, researchers—including us—have used the Scratch online community’s database to study the youth using Scratch. With Scratch Community Blocks, we attempt to put the power to programmatically analyze these data into the hands of the users themselves.

To do so, our new system adds a set of new programming primitives (blocks) to Scratch so that users can access public data from the Scratch website from inside Scratch. Blocks in the new system gives users access to project and user metadata, information about social interaction, and data about what types of code are used in projects. The full palette of blocks to access different categories of data is shown below.

Project metadata
User metadata
Site-wide statistics

The new blocks allow users to programmatically access, filter, and analyze data about their own participation in the community. For example, with the simple script below, we can find whether we have followers in Scratch who report themselves to be from Spain, and what their usernames are.

Simple demonstration of Scratch Community Blocks

In designing the system, we had two primary motivations. First, we wanted to support avenues through which children can engage in curiosity-driven, creative explorations of public Scratch data. Second, we wanted to foster self-reflection with data. As children looked back upon their own participation and coding activity in Scratch through the project they and their peers made, we wanted them to reflect on their own behavior and learning in ways that shaped their future behavior and promoted exploration.

After designing and building the system over 2014 and 2015, we invited a group of active Scratch users to beta test the system in early 2016. Over four months, 700 users created more than 1,600 projects. The diversity and depth of users creativity with the new blocks surprised us. Children created projects that gave the viewer of the project a personalized doughnut-chart visualization of their coding vocabulary on Scratch, rendered the viewer’s number of followers as scoops of ice-cream on a cone, attempted to find whether “love-its” for projects are more common on Scratch than “favorites”, and told users how “talkative” they were by counting the cumulative string-length of project titles and descriptions.

We found that children, rather than making canonical visualizations such as pie-charts or bar-graphs, frequently made information representations that spoke to their own identities and aesthetic sensibilities. A 13-year-old girl had made a virtual doll dress-up game where the player’s ability to buy virtual clothes and accessories for the doll was determined by the level of their activity in the Scratch community. When we asked about her motivation for making such a project, she said:

I was trying to think of something that somebody hadn’t done yet, and I didn’t see that. And also I really like to do art on Scratch and that was a good opportunity to use that and mix the two [art and data] together.

We also found at least some evidence that the system supported self-reflection with data. For example, after seeing a project that showed its viewers a visualization of their past coding vocabulary, a 15-year-old realized that he does not do much programming with the pen-related primitives in Scratch, and wrote in a comment, “epic! looks like we need to use more pen blocks. :D.”

Doughnut visualization
Ice-cream visualization
Data-driven doll dress up

Additionally, we noted that that as children made and interacted with projects made with Scratch Community Blocks, they started to critically think about the implications of data collection and analysis. These conversations are the subject of another paper (also being published in CHI 2017).

In a 1971 article called “Teaching Children to be Mathematicians vs. Teaching About Mathematics”, Seymour Papert argued for the need for children doing mathematics vs. learning about it. He showed how Logo, the programming language he was developing at that time with his colleagues, could offer children a space to use and engage with mathematical ideas in creative and personally motivated ways. This, he argued, enabled children to go beyond knowing about mathematics to “doing” mathematics, as a mathematician would.

Scratch Community Blocks has not yet been launched for all Scratch users and has several important limitations we discuss in the paper. That said, we feel that the projects created by children in our the beta test demonstrate the real potential for children to do data science, and not just know about it, provide data for it, and to have their behavior nudged and shaped by it.

This blog-post and the work that it describes is a collaborative project between Sayamindu Dasgupta and Benjamin Mako Hill. We have also received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Hal Abelson from MIT CSAIL. Financial support came from the US National Science Foundation. We will be presenting this paper at CHI in May, and will be thrilled to talk more about our work and about future directions.

Studying the relationship between remixing & learning

With more than 10 million users, the Scratch online community is the largest online community where kids learn to program. Since it was created, a central goal of the community has been to promote “remixing” — the reworking and recombination of existing creative artifacts. As the video above shows, remixing programming projects in the current web-based version of Scratch is as easy is as clicking on the “see inside” button in a project web-page, and then clicking on the “remix” button in the web-based code editor. Today, close to 30% of projects on Scratch are remixes.

Remixing plays such a central role in Scratch because its designers believed that remixing can play an important role in learning. After all, Scratch was designed first and foremost as a learning community with its roots in the Constructionist framework developed at MIT by Seymour Papert and his colleagues. The design of the Scratch online community was inspired by Papert’s vision of a learning community similar to Brazilian Samba schools (Henry Jenkins writes about his experience of Samba schools in the context of Papert’s vision here), and a comment Marvin Minsky made in 1984:

Adults worry a lot these days. Especially, they worry about how to make other people learn more about computers. They want to make us all “computer-literate.” Literacy means both reading and writing, but most books and courses about computers only tell you about writing programs. Worse, they only tell about commands and instructions and programming-language grammar rules. They hardly ever give examples. But real languages are more than words and grammar rules. There’s also literature – what people use the language for. No one ever learns a language from being told its grammar rules. We always start with stories about things that interest us.

In a new paper — titled “Remixing as a pathway to Computational Thinking” — that was recently published at the ACM Conference on Computer Supported Collaborative Work and Social Computing (CSCW) conference, we used a series of quantitative measures of online behavior to try to uncover evidence that might support the theory that remixing in Scratch is positively associated with learning.

scratchblocksOf course, because Scratch is an informal environment with no set path for users, no lesson plan, and no quizzes, measuring learning is an open problem. In our study, we built on two different approaches to measure learning in Scratch. The first approach considers the number of distinct types of programming blocks available in Scratch that a user has used over her lifetime in Scratch (there are 120 in total) — something that can be thought of as a block repertoire or vocabulary. This measure has been used to model informal learning in Scratch in an earlier study. Using this approach, we hypothesized that users who remix more will have a faster rate of growth for their code vocabulary.

Controlling for a number of factors (e.g. age of user, the general level of activity) we found evidence of a small, but positive relationship between the number of remixes a user has shared and her block vocabulary as measured by the unique blocks she used in her non-remix projects. Intriguingly, we also found a strong association between the number of downloads by a user and her vocabulary growth. One interpretation is that this learning might also be associated with less active forms of appropriation, like the process of reading source code described by Minksy.

The second approach we used considered specific concepts in programming, such as loops, or event-handling. To measure this, we utilized a mapping of Scratch blocks to key programming concepts found in this paper by Karen Brennan and Mitchel Resnick. For example, in the image below are all the Scratch blocks mapped to the concept of “loop”.

scratchblocksctWe looked at six concepts in total (conditionals, data, events, loops, operators, and parallelism). In each case, we hypothesized that if someone has had never used a given concept before, they would be more likely to use that concept after encountering it while remixing an existing project.

Using this second approach, we found that users who had never used a concept were more likely to do so if they had been exposed to the concept through remixing. Although some concepts were more widely used than others, we found a positive relationship between concept use and exposure through remixing for each of the six concepts. We found that this relationship was true even if we ignored obvious examples of cutting and pasting of blocks of code. In all of these models, we found what we believe is evidence of learning through remixing.

Of course, there are many limitations in this work. What we found are all positive correlations — we do not know if these relationships are causal. Moreover, our measures do not really tell us whether someone has “understood” the usage of a given block or programming concept.However, even with these limitations, we are excited by the results of our work, and we plan to build on what we have. Our next steps include developing and utilizing better measures of learning, as well as looking at other methods of appropriation like viewing the source code of a project.

This blog post and the paper it describes are collaborative work with Sayamindu Dasgupta, Andrés Monroy-Hernández, and William Hale. The paper is released as open access so anyone can read the entire paper here. This blog post was also posted on Benjamin Mako Hill’s blog, on Sayamindu Dasgupta’s blog and on Medium by the MIT Media Lab.

Jackie Robinson Day (Re-estimated)

When major league baseball held its opening day on April 15 in 1947, a 28 year-old infielder made his highly-anticipated debut at first base for the Brooklyn Dodgers. He would go on to record an extraordinary season and career worthy of induction into the Baseball Hall of Fame, winning Rookie of the Year honors in 1947, a batting title and Most Valuable Player award in 1949, and a World Series title in 1955. He also produced two seasons that rank among the top 100 ever (by the metric of Wins-Above-Replacement among position players).

Jackie Robinson (1954 public domain photo by Bob Sandberg for Look Magazine).

 

Looking at the box score, Jackie Robinson didn’t make an overwhelming impact on the outcome of his first game, but his presence on the field challenged the racist status quo of professional baseball and American society. What’s more, the intense public-ness of the challenge made Robinson’s presence a symbol and a spectacle: of the roughly 26,500 spectators in attendance at Ebbets field, an estimated 14,000 were black. I cannot imagine what it was like to be at that game — one of those rare places and moments where it becomes possible to see an historic social transformation as it unfolds. Just the thought gives me goosebumps.

Every major league player, coach, and umpire will don Robinson’s iconic number 42 in recognition today. Watching games and highlights from Jackie Robinson Days past, I’ve been troubled by how easily such observances drift into a hagiographic reverie that sometimes even take on a self-congratulatory tone. Stories of Robinson’s incredible athletic and personal accomplishments sometimes efface his struggle against horrible, violent, and aggressive responses. Worse yet, the stories usually play down the persistence of racism and its effects today. Baseball celebrates Jackie Robinson Day out of a strange combination of guilt and pride; knowledge and ignorance; resistance and complicity.

As I indicated earlier, Robinson’s performance and impact qualified him for the Hall of Fame along multiple dimensions. However, another way to think about his unique contribution to baseball is to consider how such virulent racism likely affected his play and how unbelievably, mind-blowingly great a player he might have been under less racist conditions.

There’s no obviously valid way to construct a counterfactual Jackie Robinson, but research on the phenomenon of stereotype threat suggests a very simple, naive statistical adjustment strategy. To paraphrase a bunch of scholarly studies and the (pretty extensive) Wikipedia article, stereotype threat reduces the performance of individuals who belong to negatively stereotyped groups, largely by inducing feelings of anxiety.

Stereotype threat affects various kinds of behaviors including athletic achievement. A 1999 study by Jeff Stone and colleagues (pdf) estimates the effects of some typical forms of stereotype threat on a sample of black men’s athletic performance, reporting that race-based priming resulted in a 23.5% worse outcome on a miniature golf (!) task than a control condition with no priming.

Consider that the priming in this Stone et al. study was done in a fairly polite, impersonal, non-hateful, non-threatening way in relation to a mini-golf task with absolutely nothing at stake. Consider just how personal, vitriolic, and violent the responses to Jackie Robinson were — many of them coming directly from opposing players and “fans” who went to great pains to heckle him in the middle of at-bats, physically target him with violent slides and more on the field, or issue death threats to him and his family. Consider how much Robinson had at stake and just how public his successes and his failures would have been.

 

Some people may like to imagine (and filmmakers may like to depict) that the hatred helped to motivate and focus Robinson, spurring him to even greater performance. Similarly, part of the mystique of the greatest athletes is that they seem to empty their heads of all the noise and distractions that would debilitate the rest of us at precisely those moments when the stakes and pressures are highest. It’s easy to say that Robinson didn’t respond to the pressure in the same way as most humans would, but the research on stereotype threat suggests that it probably affected him on the field anyway. Just being reminded — even in very subtle, socially-coded ways — that you belong to a socially excluded group reduced athletic performance by nearly a quarter. The sort of cognitive burden that comes along with being singled out and targeted by the kind of racial hatred that Robinson experienced must be orders of magnitude greater. What sort of impact would this burden have had on Robinson’s play?

Now, go look at the stat lines again from those two spectacular seasons (1949, WAR 9.6,  and 1951, WAR 9.7) that Robinson had and imagine them without the stress, the pain, the distraction of all that hate. Be a little bit generous and inflate the WAR statistics by the same 23.5% that Stone et al.’s subjects performance dropped in a laboratory study in ridiculously low-key conditions. Under these assumptions, Robinson’s two greatest seasons might have yielded WAR of 11.9 and 12.0 respectively — easily placing them both among the top 10 seasons by a position player ever.