Community Data Science Collective

May 23, 2018May 23, 2018

Testing the “wide walls” design principle in the wild

Seymour Papert is credited as saying that tools to support learning should have “high ceilings” and “low floors.” The phrase is meant to suggest that tools should allow learners to do complex and intellectually sophisticated things but should also be easy to begin using quickly. Mitchel Resnick extended the metaphor to argue that learning toolkits should also have “wide walls” in that they should appeal to diverse groups of learners and allow for a broad variety of creative outcomes. In a new paper, Benjamin Mako Hill and I attempted to provide the first empirical test of Resnick’s wide walls theory. Using a natural experiment in the Scratch online community, we found causal evidence that “widening walls” can, as Resnick suggested, increase both engagement and learning.

Over the last ten years, the “wide walls” design principle has been widely cited in the design of new systems. For example, Resnick and his collaborators relied heavily on the principle in the design of the Scratch programming language. Scratch allows young learners to produce not only games, but also interactive art, music videos, greetings card, stories, and much more. As part of that team, I was guided by “wide walls” principle when I designed and implemented the Scratch cloud variables system in 2011-2012.

While designing the system, I hoped to “widen walls” by supporting a broader range of ways to use variables and data structures in Scratch. Scratch cloud variables extend the affordances of the normal Scratch variable by adding persistence and shared-ness. A simple example of something possible with cloud variables, but not without them, is a global high-score leaderboard in a game (example code is below). After the system was launched, I saw many young Scratch users using the system to engage with data structures in new and incredibly creative ways.

cloud-variable-script — Example of Scratch code that uses a cloud variable to keep track of high-scores among all players of a game.

Although these examples reflected powerful anecdotal evidence, I was also interested in using quantitative data to reflect the causal effect of the system. Understanding the causal effect of a new design in real world settings is a major challenge. To do so, we took advantage of a “natural experiment” and some clever techniques from econometrics to measure how learners’ behavior changed when they were given access to a wider design space.

Understanding the design of our study requires understanding a little bit about how access to the Scratch cloud variable system is granted. Although the system has been accessible to Scratch users since 2013, new Scratch users do not get access immediately. They are granted access only after a certain amount of time and activity on the website (the specific criteria are not public). Our “experiment” involved a sudden change in policy that altered the criteria for who gets access to the cloud variable feature. Through no act of their own, more than 14,000 users were given access to feature, literally overnight. We looked at these Scratch users immediately before and after the policy change to estimate the effect of access to the broader design space that cloud variables afforded.

We found that use of data-related features was, as predicted, increased by both access to and use of cloud variables. We also found that this increase was not only an effect of projects that use cloud variables themselves. In other words, learners with access to cloud variables—and especially those who had used it—were more likely to use “plain-old” data-structures in their projects as well.

The graph below visualizes the results of one of the statistical models in our paper and suggests that we would expect that 33% of projects by a prototypical “average” Scratch user would use data structures if the user in question had never used used cloud variables but that we would expect that 60% of projects by a similar user would if they had used the system.

Model-predicted probability that a project made by a prototypical Scratch user will contain data structures (w/o counting projects with cloud variables)

It is important to note that the estimated effective above is a “local average effect” among people who used the system because they were granted access by the sudden change in policy (this is a subtle but important point that we explain this in some depth in the paper). Although we urge care and skepticism in interpreting our numbers, we believe our results are encouraging evidence in support of the “wide walls” design principle.

Of course, our work is not without important limitations. Critically, we also found that rate of adoption of cloud variables was very low. Although it is hard to pinpoint the exact reason for this from the data we observed, it has been suggested that widening walls may have a potential negative side-effect of making it harder for learners to imagine what the new creative possibilities might be in the absence of targeted support and scaffolding. Also important to remember is that our study measures “wide walls” in a specific way in a specific context and that it is hard to know how well our findings will generalize to other contexts and communities. We discuss these caveats, as well as our methods, models, and theoretical background in detail in our paper which now available for download as an open-access piece from the ACM digital library.

This blog post, and the open access paper that it describes, is a collaborative project with Benjamin Mako Hill. Financial support came from the eScience Institute and the Department of Communication at the University of Washington. Quantitative analyses for this project were completed using the Hyak high performance computing cluster at the University of Washington.

April 23, 2018April 23, 2018

Revisiting the ‘Rise and Decline’

This graph shows the number of people contributing to Wikipedia over time:

The Rise and Decline of Wikipedia — The number of active Wikipedia contributors exploded, suddenly stalled, and then began gradually declining. (Figure taken from Halfaker et al. 2013)

The figure comes from “The Rise and Decline of an Open Collaboration System,” a well-known 2013 paper that argued that Wikipedia’s transition from rapid growth to slow decline in 2007 was driven by an increase in quality control systems. Although many people have treated the paper’s finding as representative of broader patterns in online communities, Wikipedia is a very unusual community in many respects. Do other online communities follow Wikipedia’s pattern of rise and decline? Does increased use of quality control systems coincide with community decline elsewhere?

In a paper I am presenting Thursday morning at the Association for Computing Machinery (ACM) Conference on Human Factors in Computing Systems (CHI), a group of us have replicated and extended the 2013 paper’s analysis in 769 other large wikis. We find that the dynamics observed in Wikipedia are a strikingly good description of the average Wikia wiki. They appear to reoccur again and again in many communities.

The original “Rise and Decline” paper (I’ll abbreviate it “RAD”) was written by Aaron Halfaker, R. Stuart Geiger, Jonathan T. Morgan, and John Riedl. They analyzed data from English Wikipedia and found that Wikipedia’s transition from rise to decline was accompanied by increasing rates of newcomer rejection as well as the growth of bots and algorithmic quality control tools. They also showed that newcomers whose contributions were rejected were less likely to continue editing and that community policies and norms became more difficult to change over time, especially for newer editors.

Our paper, just published in the CHI 2018 proceedings, replicates most of RAD’s analysis on a dataset of 769 of the largest wikis from Wikia that were active between 2002 to 2010. We find that RAD’s findings generalize to this large and diverse sample of communities.

I can walk you through some of the key findings. First, the growth trajectory of the average wiki in our sample is similar to that of English Wikipedia. As shown in the figure below, an initial period of growth stabilizes and leads to decline several years later.

Rise and Decline on Wikia — The average Wikia wikia also experience a period of growth followed by stabilization and decline (from TeBlunthuis, Shaw, and Hill 2018).

We also found that newcomers on Wikia wikis were reverted more and continued editing less. As on Wikipedia, the two processes were related. Similar to RAD, we also found that newer editors were more likely to have their contributions to the “project namespace” (where policy pages are located) undone as wikis got older. Indeed, the specific estimates from our statistical models are very similar to RAD’s for most of these findings!

There were some parts of the RAD analysis that we couldn’t reproduce in our context. For example, there are not enough bots or algorithmic editing tools in Wikia to support statistical claims about their effects on newcomers.

At the same time, we were able to do some things that the RAD authors could not. Most importantly, our findings discount some Wikipedia-specific explanations for a rise and decline. For example, English Wikipedia’s decline coincided with the rise of Facebook, smartphones, and other social media platforms. In theory, any of these factors could have caused the decline. Because the wikis in our sample experienced rises and declines at similar points in their life-cycle but at different points in time, the rise and decline findings we report seem unlikely to be caused by underlying temporal trends.

The big communities we study seem to have consistent “life cycles” where stabilization and/or decay follows an initial period of growth. The fact that the same kinds of patterns happen on English Wikipedia and other online groups implies a more general set of social dynamics at work that we do not think existing research (including ours) explains in a satisfying way. What drives the rise and decline of communities more generally? Our findings make it clear that this is a big, important question that deserves more attention.

We hope you’ll read the paper and get in touch by commenting on this post or emailing me if you’d like to learn or talk more. The paper is available online and has been published under an open access license. If you really want to get into the weeds of the analysis, we will soon publish all the data and code necessary to reproduce our work in a repository on the Harvard Dataverse.

I will be presenting the project this week at CHI in Montréal on Thursday April 26 at 9am in room 517D. For those of you not familiar with CHI, it is the top venue for Human-Computer Interaction. All CHI submissions go through double-blind peer review and the papers that make it into the proceedings are considered published (same as journal articles in most other scientific fields). Please feel free to cite our paper and send it around to your friends!

This blog post, and the open access paper that it describes, is a collaborative project with Aaron Shaw, and Benjamin Mako Hill. Financial support came from the US National Science Foundation (grants IIS-1617129, IIS-1617468, and GRFP-2016220885 ), Northwestern University, the Center for Advanced Study in the Behavioral Sciences at Stanford University, and the University of Washington. This project was completed using the Hyak high performance computing cluster at the University of Washington.

April 2, 2018

Open Lab at the University of Washington

If you are at the University of Washington (or not at UW but in Seattle) and are interested in seeing what we’re up to, you can join us for a Community Data Science Collective “open lab” this Friday (April 6th) 3-5pm in our new lab space (CMU 306). Collective members from Northwestern University will be in town as well, so there’s even more reason to come!

The open lab is an opportunity to learn about our research, catch up over snacks and beverages, and pick up a sticker or two. We will have no presentations but several posters describing projects we are working on.

April 1, 2018April 2, 2018

Workshop on Casual Inference in Online Communities

Casual Inference Logo

The last decade has seen a massive increase in formality and rigor in quantitative and statistical research methodology in the social scientific study of online communities. These changes have led to higher reliability, increased reproducibility, and increased faith that our findings accurately reflect empirical reality. Unfortunately, these advancements have not come without important costs. When high methodological standards make it harder for scientists to know things, we lose the ability to speak about important phenomena and relationships.

There are many studies that simply cannot be done with the highest levels of statistical rigor. Significant social concepts such as race and gender can never truly be randomly assigned. There are relationships that are rare enough that they can never be described with a p-value of less than 0.05. To understand these phenomena, our methodology must be more relaxed. In our rush to celebrate the benefits of rigor and formality, social scientists are not exploring the ways in which more casual forms of statistical inference can be useful.

To discuss these issues and their impact in social computing research, the Community Data Science Collective will be holding the first ever workshop on Casual Inference in Online Communities this coming October in Evanston, Illinois. We hope to announce specific dates soon.

Although our program remains to be finalized, we’re currently planning to organize the workshop around five panels:

Panel 1: Relaxing Assumptions: A large body of work in statistics has critiqued the arbitrary and rigid “p < .05” significance standard and pointed to problems like “p-hacking” that it has caused. But what about the benefits that flow from a standard of evidence that one out of twenty non-effects can satisfy? In this panel, we will discuss some of the benefits of p-value standards that allow researchers to easily reject the null hypothesis that there is no effect.; For example, how does science benefit from researchers’ ability to keep trying models until they find a publishable result? What do we learn when researchers can easily add or drop correlated measures to achieve significance? We will also talk about promising new methods available to scientists for overcoming high p-values like choosing highly informative Bayesian priors that ensure credible intervals far away from 0. We will touch on unconstrained optimization, a new way of fitting models by “guesstimating” parameters.
Panel 2: Exputation of Missing Data: Missing data is a major problem in social research. The most common ways of addressing missing data are imputation methods. Of course, imputation techniques bring with them assumptions that are hard to understand and often violated. How might types of imputation less grounded in data and theory help? How might we relax assumptions to infer things more casually about data—and with data—that we can not, and will not, ever have? How can researchers use their beliefs and values to infer data?; Our conversation will focus on exputation, a new approach that allows researches to use their intuition, beliefs, and desires to imagine new data. We will touch on multiple exputation techniques where researchers engage in the process repeatedly to narrow in on desired results.
Panel 3: Quasi-Quasi Experiments: Not every study can be at the scientific gold standard of a randomized control experiment. The idea of quasi-experiments are designed to relax certain assumptions and requirements in order to draw similar types of inference from non-experimental settings. This panel will ask what might we gain if we were relax things even more.; What might we learn from quasi-quasi experiments, where shocks aren’t quite exogenous (and might not even be that shocking)? We also hope to discuss superficial intelligence, post hoc ergo propter hoc techniques, supernatural experiments, and symbolic matching based on superficial semantic similarities.
Panel 4: Irreproducible Results: Since every researcher and every empirical context is unique, why do we insist that the same study conducted by different researchers should not be? What might be gained from embracing, or even pursuing, irreproducible methods in our research? What might we see if we allow ourselves to be the giants upon whose shoulders we stand?

Panel 5: Research Ethics: [Canceled]

Although we are hardly the first people to talk about casual inference, we believe this will be the first academic meeting on the topic in any field. Please plan to join us if you can!

If you would like to apply to participate, please send a position paper or extended abstract (no more than 1000 words) to casualinference@communitydata.cc. We plan to post a list of the best submissions.

Workshop logo based on the “Hammock” icon by Gan Khoon Lay from the Noun Project.

March 21, 2018March 21, 2018

Adventures in applied data science

Many organizations have unprecedented access to data, experiments, and statistical inference. The diffusion of these resources has created pressure to develop the skills and practices necessary to use them. However, the distribution of these skills and practices has an organizational component, leading some teams and organizations to harness social scientific insights far more effectively than others.

Handy web-based tools like ABBA can make a-b testing more accessible

We hear plenty about examples of “bad” statistics in the news. For example, Brian Wansink and the Cornell Food Lab have gotten a whole lot of attention for problems in their statistical analysis and interpretation. More than sheer ignorance or malfeasance (although there may be some evidence of that too), I think the reproducibility crisis illustrates how pervasive pressure to produce statistical evidence has combined with uneven professional standards can lead to dodgy research.

Our capacity to gather data and apply inferential statistics may have gotten ahead of our collective ability to manage these resources skillfully. In academia, this might lead to publications with spurious findings. In other kinds of environments, it might lead to decisions based on evidence of questionable quality. In both cases organizational resource constraints and communication challenges shape whether, where, and how well data science and statistics get done.

A slightly long story illustrates how this can play out in a non-academic environment, specifically a fairly small technology company. I share the story as a cautionary tale that can hopefully provoke some useful reflection about how we (people who care about evidence-based decision making, data science, statistics, and applied social science) can improve our work. I have de-identified the organization and the individuals involved because this is really not about them per se. The challenges they face are common. I think the story can tell us something interesting about those challenges.

Within the organization, several teams conduct experiments, user tests, and other sorts of data-intensive, social scientific research. One of these teams had reached out because they had some questions about methods of analysis. Within the organization, this particular team had gotten positive feedback for their adoption of a data-driven pipeline of A/B testing, but there were concerns about whether the testing was being done well. I went to visit them planning to do a little bit of informal statistical consulting and to learn more about that part of the organization.

A few team members walked me through a typical field experiment with multiple (about 10) treatment conditions. Everything runs on a small stack of custom scripts that pulled summary data from the platform’s databases. The team uses spreadsheets to record the number of individuals assigned to each condition along with the number of “successful” trials (e.g., cases where an end-user has the desired response to a given design change).

The team then enters the raw summary information into an open source web-based tool called ABBA that runs some calculations and reports a “success rate” (a smoothed percentage) for each trial, a raw and percentage-based confidence interval for the success rate, and a p-value (based on a binomial cumulative distribution function or a normal approximation for large samples). ABBA also presents a handy little visualization plotting the interval estimated for each experimental condition along a bar colored either gray (not different from control), red (lower success rate than control), or green (higher success rate than control) depending on the results of the corresponding hypothesis test. I’ve included a screenshot of what this looks like at the top of the post and you can try it yourself.

Those of you with a statistical background following me into the weeds here might be nodding and thinking “okay, sounds maybe not ideal, but reasonable enough.” While the system puts too much faith in p-values, it follows a pretty standard approach. It’s also a great example of the kind of statistics-as-a-service approach to A/B testing that many organizations have adopted in response to various pressures to be more data driven.

That’s when things started to get weird. As we spoke more, it turned out that the ways members of the team conduct the tests, enter the data, and interpret the results raise major red flags.

For example, they regularly update the number of experimental conditions on-the-fly, dropping old conditions and adding new conditions when others already had thousands of observations (ABBA makes this super easy!).

When experimental conditions are dropped or added, the team routinely re-computes statistical tests and p-values with/without the new/old observations included. Mostly, conditions that do not seem to produce different outcomes from the control were silently removed from the analysis.

For some of the analysis itself, the team uses parametric tests that assume normal distributions on heavily skewed data.

Then, when it comes time to interpret the results, the analysts use the relative magnitude of p-values as an estimate of the magnitude of conditional effect sizes.

At this point, those of you with relevant training in applied statistics, experimental research methods, data science, etc. might be scratching your heads or experiencing full-on panic.

Separately, each of these steps are inferential howlers capable of invalidating results. Together, they render whatever results were coming out of this process untrustworthy in the extreme.

For the rest of the meeting, I did my best to identify a series of steps the team could take to avoid the problems above. But I still walked away disconcerted. This was a technically sophisticated organization with plenty of resources. The team was using a pretty well-designed tool for analyzing experimental data. They had gotten critical feedback on the work they were doing. How did a situation like this happen?

The individuals on the team were doing their best. Nobody is born with deep knowledge of applied statistics. Confronted with a challenging mandate from their supervisors, these people were all doing their absolute best to apply some tools they didn’t fully understand to solve a practical problem. They had generally been told that their work was good, knew they had some issues to fix, and reached out to someone with more knowledge (in this case me) for help.

What about the tools? Can we at least blame the tools? As I mentioned earlier, a bunch of companies are in the business of providing “statistics-as-a-service” or A/B testing platforms, but I’m not convinced that these are the root of the problem either. Sure, ABBA makes some mistakes a little too easy, but the tool was also built and shared by skilled data scientists who painstakingly documented everything before distributing it on GitHub. Their documentation is why I was able to sort out exactly what was happening in the first place and help the team members understand some of the issues involved. Indeed, nothing seems obviously or fundamentally wrong with the implementation of the underlying software or the statistical tests. Instead, the misuse of the system happened despite the software designers’ best efforts.

Here we get into one problem area: the incentives to produce specific kinds of outcomes. The team using the tool needed to run experiments and interpret them as decisive “wins” or “losses.” The reality was much less clear and, in this way, the p-values obscured some of that ambiguity. Imposing a dichotomous logic on experimental evidence is often impossible and will, even under the best conditions, lead to systematic abuses of statistical reasoning.

What about the organizational leadership then? Shouldn’t they be responsible for making sure that the company does high quality data science? On the one hand yes, and on the other hand, this is hard too and understandable problems arise. Executives and managers often lack the requisite statistical expertise to evaluate operations like this in a rigorous way. They have heard, through professional networks, industry publications, media, etc., that more data and more A/B tests are Good Things for their organization. At a certain point, they cannot do the auditing of experimental procedures and inference themselves.

Shouldn’t the managers just make sure someone else can audit the statistics then? This is probably where the most important breakdowns occurred. Turns out that other staff possess all the skills to diagnose and repair the issues I identified (and more). One of these people had even been assigned to work with the team in question for a while! However, that assignment had ended during a restructuring and statistical expertise had never returned to the team. In the meantime, managers continued to demand results without fully appreciating that the existing approach had deep problems.

So given this particular mix of data and organizational sciences gone awry, what lessons can we learn?

The future of data-intensive social science remains, as William Gibson might say, unevenly distributed. As the infrastructure for data collection and analysis has become more widely accessible, the choke-point in many organizations has become the dissemination of deeper knowledge of the techniques necessary to produce valid, reliable inference. These inequalities emerge both within and between organizations. Some companies and some teams have more expertise than others. Some have more effective systems for feedback and improvement than others.

In this sense, organizational (not just technical or statistical) obstacles stand in the way of more effective, accountable, and transparent uses of evidence to make decisions. Web-scale organizations can run 100,000 randomized trials and analyze the results very quickly. The results can look real and have p-values attached and the executives can believe that they have got the whole data science thing nailed down. However, the analysis might not mean much unless it is implemented skillfully.

The inundation of behavioral trace data does not guarantee that we will be similarly inundated by reliable findings, valid inference, or skilled implementation. High quality research design and interpretation may not scale so easily as the data or the analysis tools.

All of this has distributive implications. Organizations with access to the best social scientific knowledge as well as the organizational capacity to deploy and harness that knowledge will be the ones most likely to reap benefits from it. Others, such as many public administrations in the U.S. (especially those that deliver social services), smaller firms, non-profits, and community organizations will likely get inferior inference (to the extent they get any at all).

It takes time and effort to build organizational resources and cultures capable of supporting widespread, high quality, data-driven inference. Some recent work in HCI and related fields speaks to these issues. For example, some folks at CU Boulder have a 2017 CHI paper about how mission-driven organizations can struggle to do data-driven work. In a more interventionist vein, Catherine D’Ignazio and Rahul Barghava have launched the Data Culture Project in an effort to help smaller non-profits and community organizations use data more effectively.

Whatever the organizational context, high quality social scientific and statistical work requires more than just a clear understanding of p-values and massive A/B testing infrastructure. Statistical expertise also needs to be embedded and managed effectively within organizations and teams in order to produce reliable inference.

This is a cross-post from the CASBS Medium channel. Thanks to members of the CDSC, Margaret Levi, and some anonymous friends for feedback on earlier versions of the text.

January 22, 2018

Introducing Computational Methods to Social Media Scientists

The ubiquity of large-scale data and improvements in computational hardware and algorithms have provided enabled researchers to apply computational approaches to the study of human behavior. One of the richest contexts for this kind of work is social media datasets like Facebook, Twitter, and Reddit.

We were invited by Jean Burgess, Alice Marwick, and Thomas Poell to write a chapter about computational methods for the Sage Handbook of Social Media. Rather than simply listing what sorts of computational research has been done with social media data, we decided to use the chapter to both introduce a few computational methods and to use those methods in order to analyze the field of social media research.

A “hairball” diagram from the chapter illustrating how research on social media clusters into distinct citation network neighborhoods.

Explanations and Examples

In the chapter, we start by describing the process of obtaining data from web APIs and use as a case study our process for obtaining bibliographic data about social media publications from Elsevier’s Scopus API. We follow this same strategy in discussing social network analysis, topic modeling, and prediction. For each, we discuss some of the benefits and drawbacks of the approach and then provide an example analysis using the bibliographic data.

We think that our analyses provide some interesting insight into the emerging field of social media research. For example, we found that social network analysis and computer science drove much of the early research, while recently consumer analysis and health research have become more prominent.

More importantly though, we hope that the chapter provides an accessible introduction to computational social science and encourages more social scientists to incorporate computational methods in their work, either by gaining computational skills themselves or by partnering with more technical colleagues. While there are dangers and downsides (some of which we discuss in the chapter), we see the use of computational tools as one of the most important and exciting developments in the social sciences.

Steal this paper!

One of the great benefits of computational methods is their transparency and their reproducibility. The entire process—from data collection to data processing to data analysis—can often be made accessible to others. This has both scientific benefits and pedagogical benefits.

To aid in the training of new computational social scientists, and as an example of the benefits of transparency, we worked to make our chapter pedagogically reproducible. We have created a permanent website for the chapter at https://communitydata.science/social-media-chapter/ and uploaded all the code, data, and material we used to produce the paper itself to an archive in the Harvard Dataverse.

Through our website, you can download all of the raw data that we used to create the paper, together with code and instructions for how to obtain, clean, process, and analyze the data. Our website walks through what we have found to be an efficient and useful workflow for doing computational research on large datasets. This workflow even includes the paper itself, which is written using LaTeX + knitr. These tools let changes to data or code propagate through the entire workflow and be reflected automatically in the paper itself.

If you use our chapter for teaching about computational methods—or if you find bugs or errors in our work—please let us know! We want this chapter to be a useful resource, will happily consider any changes, and have even created a git repository to help with managing these changes!

January 15, 2018

OpenSym 2017 Program Postmortem

The International Symposium on Open Collaboration (OpenSym, formerly WikiSym) is the premier academic venue exclusively focused on scholarly research into open collaboration. OpenSym is an ACM conference which means that, like conferences in computer science, it’s really more like a journal that gets published once a year than it is like most social science conferences. The “journal”, in this case, is called the Proceedings of the International Symposium on Open Collaboration and it consists of final copies of papers which are typically also presented at the conference. Like journal articles, papers that are published in the proceedings are not typically published elsewhere.

Along with Claudia Müller-Birn from the Freie Universtät Berlin, I served as the Program Chair for OpenSym 2017. For the social scientists reading this, the role of program chair is similar to being an editor for a journal. My job was not to organize keynotes or logistics at the conference—that is the job of the General Chair. Indeed, in the end I didn’t even attend the conference! Along with Claudia, my role as Program Chair was to recruit submissions, recruit reviewers, coordinate and manage the review process, make final decisions on papers, and ensure that everything makes it into the published proceedings in good shape.

In OpenSym 2017, we made several changes to the way the conference has been run:

In previous years, OpenSym had tracks on topics like free/open source software, wikis, open innovation, open education, and so on. In 2017, we used a single track model.
Because we eliminated tracks, we also eliminated track-level chairs. Instead, we appointed Associate Chairs or ACs.
We eliminated page limits and the distinction between full papers and notes.
We allowed authors to write rebuttals before reviews were finalized. Reviewers and ACs were allowed to modify their reviews and decisions based on rebuttals.
To assist in assigning papers to ACs and reviewers, we made extensive use of bidding. This means we had to recruit the pool of reviewers before papers were submitted.

Although each of these things have been tried in other conferences, or even piloted within individual tracks in OpenSym, all were new to OpenSym in general.

Overview

Statistics
Papers submitted	44
Papers accepted	20
Acceptance rate	45%
Posters submitted	2
Posters presented	9
Associate Chairs	8
PC Members	59
Authors	108
Author countries	20

The program was similar in size to the ones in the last 2-3 years in terms of the number of submissions. OpenSym is a small but mature and stable venue for research on open collaboration. This year was also similar, although slightly more competitive, in terms of the conference acceptance rate (45%—it had been slightly above 50% in previous years).

As in recent years, there were more posters presented than submitted because the PC found that some rejected work, although not ready to be published in the proceedings, was promising and advanced enough to be presented as a poster at the conference. Authors of posters submitted 4-page extended abstracts for their projects which were published in a “Companion to the Proceedings.”

Topics

Over the years, OpenSym has established a clear set of niches. Although we eliminated tracks, we asked authors to choose from a set of categories when submitting their work. These categories are similar to the tracks at OpenSym 2016. Interestingly, a number of authors selected more than one category. This would have led to difficult decisions in the old track-based system.

distribution of papers across topics with breakdown by accept/poster/reject

The figure above shows a breakdown of papers in terms of these categories as well as indicators of how many papers in each group were accepted. Papers in multiple categories are counted multiple times. Research on FLOSS and Wikimedia/Wikipedia continue to make up a sizable chunk of OpenSym’s submissions and publications. That said, these now make up a minority of total submissions. Although Wikipedia and Wikimedia research made up a smaller proportion of the submission pool, it was accepted at a higher rate. Also notable is the fact that 2017 saw an uptick in the number of papers on open innovation. I suspect this was due, at least in part, to work by the General Chair Lorraine Morgan’s involvement (she specializes in that area). Somewhat surprisingly to me, we had a number of submission about Bitcoin and blockchains. These are natural areas of growth for OpenSym but have never been a big part of work in our community in the past.

Scores and Reviews

As in previous years, review was single blind in that reviewers’ identities are hidden but authors identities are not. Each paper received between 3 and 4 reviews plus a metareview by the Associate Chair assigned to the paper. All papers received 3 reviews but ACs were encouraged to call in a 4th reviewer at any point in the process. In addition to the text of the reviews, we used a -3 to +3 scoring system where papers that are seen as borderline will be scored as 0. Reviewers scored papers using full-point increments.

scores for each paper submitted to opensym 2017: average, distribution, etc

The figure above shows scores for each paper submitted. The vertical grey lines reflect the distribution of scores where the minimum and maximum scores for each paper are the ends of the lines. The colored dots show the arithmetic mean for each score (unweighted by reviewer confidence). Colors show whether the papers were accepted, rejected, or presented as a poster. It’s important to keep in mind that two papers were submitted as posters.

Although Associate Chairs made the final decisions on a case-by-case basis, every paper that had an average score of less than 0 (the horizontal orange line) was rejected or presented as a poster and most (but not all) papers with positive average scores were accepted. Although a positive average score seemed to be a requirement for publication, negative individual scores weren’t necessary showstoppers. We accepted 6 papers with at least one negative score. We ultimately accepted 20 papers—45% of those submitted.

Rebuttals

This was the first time that OpenSym used a rebuttal or author response and we are thrilled with how it went. Although they were entirely optional, almost every team of authors used it! Authors of 40 of our 46 submissions (87%!) submitted rebuttals.

Lower	Unchanged	Higher
6	24	10

The table above shows how average scores changed after authors submitted rebuttals. The table shows that rebuttals’ effect was typically neutral or positive. Most average scores stayed the same but nearly two times as many average scores increased as decreased in the post-rebuttal period. We hope that this made the process feel more fair for authors and I feel, having read them all, that it led to improvements in the quality of final papers.

Page Lengths

In previous years, OpenSym followed most other venues in computer science by allowing submission of two kinds of papers: full papers which could be up to 10 pages long and short papers which could be up to 4. Following some other conferences, we eliminated page limits altogether. This is the text we used in the OpenSym 2017 CFP:

There is no minimum or maximum length for submitted papers. Rather, reviewers will be instructed to weigh the contribution of a paper relative to its length. Papers should report research thoroughly but succinctly: brevity is a virtue. A typical length of a “long research paper” is 10 pages (formerly the maximum length limit and the limit on OpenSym tracks), but may be shorter if the contribution can be described and supported in fewer pages— shorter, more focused papers (called “short research papers” previously) are encouraged and will be reviewed like any other paper. While we will review papers longer than 10 pages, the contribution must warrant the extra length. Reviewers will be instructed to reject papers whose length is incommensurate with the size of their contribution.

The following graph shows the distribution of page lengths across papers in our final program.

histogram of paper lengths for final accepted papers In the end 3 of 20 published papers (15%) were over 10 pages. More surprisingly, 11 of the accepted papers (55%) were below the old 10-page limit. Fears that some have expressed that page limits are the only thing keeping OpenSym from publshing enormous rambling manuscripts seems to be unwarranted—at least so far.

Bidding

Although, I won’t post any analysis or graphs, bidding worked well. With only two exceptions, every single assigned review was to someone who had bid “yes” or “maybe” for the paper in question and the vast majority went to people that had bid “yes.” However, this comes with one major proviso: people that did not bid at all were marked as “maybe” for every single paper.

Given a reviewer pool whose diversity of expertise matches that in your pool of authors, bidding works fantastically. But everybody needs to bid. The only problems with reviewers we had were with people that had failed to bid. It might be reviewers who don’t bid are less committed to the conference, more overextended, more likely to drop things in general, etc. It might also be that reviewers who fail to bid get poor matches which cause them to become less interested, willing, or able to do their reviews well and on time.

Having used bidding twice as chair or track-chair, my sense is that bidding is a fantastic thing to incorporate into any conference review process. The major limitations are that you need to build a program committee (PC) before the conference (rather than finding the perfect reviewers for specific papers) and you have to find ways to incentivize or communicate the importance of getting your PC members to bid.

Conclusions

The final results were a fantastic collection of published papers. Of course, it couldn’t have been possible without the huge collection of conference chairs, associate chairs, program committee members, external reviewers, and staff supporters.

Although we tried quite a lot of new things, my sense is that nothing we changed made things worse and many changes made things smoother or better. Although I’m not directly involved in organizing OpenSym 2018, I am on the OpenSym steering committee. My sense is that most of the changes we made are going to be carried over this year.

Finally, it’s also been announced that OpenSym 2018 will be in Paris on August 22-24. The call for papers should be out soon and the OpenSym 2018 paper deadline has already been announced as March 15, 2018. You should consider submitting! I hope to see you in Paris!

This Analysis

OpenSym used the gratis version of EasyChair to manage the conference which doesn’t allow chairs to export data. As a result, data used in this this postmortem was scraped from EasyChair using two Python scripts. Numbers and graphs were created using a knitr file that combines R visualization and analysis code with markdown to create the HTML directly from the datasets. I’ve made all the code I used to produce this analysis available in this git repository. I hope someone else finds it useful. Because the data contains sensitive information on the review process, I’m not publishing the data.

November 15, 2017

Introduction to R workshop

I recently taught a two-session workshop introducing R to Kellogg MBA students. I had a few goals for the workshops:

Convince students of the benefits of using text-based programming for data exploration and analysis
Introduce basic programming concepts (e.g., variables, functions)
Give students a basic understanding of how to do some fundamental data analysis tasks in R: importing, cleaning, visualizing, and modeling

Those are really big goals for only four hours. I decided to use the tidyverse as much as possible and not even teach base R syntax like ‘[,]’, apply, etc. I used the first session to show and explain code using the nycflights13 dataset. For the the second session we did a few more examples but mostly worked on exercises using a dataset from Wikia that I created (with help from Mako and Aaron Halfaker‘s code and data).

Retrospection

Overall, I think that the workshops went pretty well. I think that students definitely have a better understanding and a better set of tools than I did after I had used R for four hours!

That being said, there was plenty of room for improvement. I am scheduled to teach another set of workshops early next year and I’m planning to make a few changes:

Make both of the workshops more hands-on and interactive. I think I’ll divide the topics covered: the first workshop will be on importing, cleaning, and grouping data and the second will be on visualizing and creating inferential models.
Get more help – teaching non-programmers R requires some hand-holding and individual attention. To be successful, I think a workshop like this requires 1 “TA” for every 8-10 students.
Find a more relevant dataset. Although I actually learned a few things about my dataset that will help with my papers that use it, I think it would be better to have a dataset that is as similar as possible to what students will be working with in their careers.
Connect the visualization and regression more directly to a specific analysis problem rather than as syntax-learning exercises.

Reuse this workshop!

I found some pretty good resources already in existence for introducing students to R, but none of them quite fit the scope of what I was looking for. All of the code that I used (as well as some slides for the beginning of class) are on github and GPL licensed. Please reuse my work and submit pull requests!

October 30, 2017January 15, 2018

Peer production between real utopia and naive Coaseanism?

Over at Crooked Timber, Henry Farrell and others recently held a book seminar to discuss Cory Doctorow’s Walkaway. The symposium led to an extended discussion between Henry, Cory, Henry again, and Yochai Benkler about Benkler’s early work on commons-based peer production, spaces of resistance in the contemporary information economy, and the state of peer production a little over fifteen years since Benkler introduced the term. This (far too long) post summarizes some of their key points as a way of starting to collect my own thoughts on these questions.

I haven’t read Walkaway yet (downloaded my DRM-free digital copy, but the fiction slot in my brain is currently occupied by Philip Pullman’s totally engrossing La Belle Sauvage), but I can’t wait to get to it. Cory says the book started as an exercise in projecting how the sociotechnical transformations Benkler laid out in Coase’s Penguin might facilitate the spread of utopian energies at the periphery of radically unequal societies not so different from our own:

It’s been 15 years since Benkler made the connection between “commons-based peer-production” and Coase…

Down and Out in the Magic Kingdom projected Slashdot karma and Napster superdistribution across a whole society as a way of illuminating the strengths and weaknesses of both. Walkaway tries to do the same with commons-based peer-production: what would a skyscraper look like if it was a Wikipedia-style project? How about a space program?

As a Coasean tale, Walkaway is one the battleground between the technological, Promethean left—which has promised to lift peasants up to the material comfort of lords—and the de-growth green left, which promises to bring lords down to the level of the peasants in the name of saving the planet.

and later:

This is (in my view) a Utopian vision. It supposes that the Bohemian projects that even the most buttoned-down societies allow at their margins can breed real discontent and nurture and sustain it into something that genuinely challenges its host… They provided real-world lessons on which tactics worked and where the weaknesses were. They were battles, not the war. The only thing more extraordinary than a social justice prevailing at all is for it to prevail on its first outing, or second, or third.

In his contribution to the seminar, Henry points to Cory’s assumption that “exit” (in Hircshman’s sense) remains viable in a society pervaded by vast power inequalities, surveillance capabilities, and an (increasingly weaponized) disregard for privacy:

Again, Doctorow’s book isn’t an exercise in predictive science – he’s not saying that things will be so. But he is saying, I think, that things could and should be so, or sort-of so. Walkaway is quite unashamedly a didactic book in the way that earlier books such as Homeland were didactic – he has a very clear message to get across. In conversations with Steve Berlin Johnson years ago, I came up with the term BoingBoing Socialism to refer to a specific set of ideas associated with Doctorow and the people around him – that free exchange of ideas unimpeded by intellectual property law and the like, together with transformative technologies of manufacture, could open up a path towards a radically egalitarian future. Unless I’m seriously mistaken (in which case I’m sure that Doctorow will tell me), Walkaway wants to do two things – to argue for why such a future might be attractive, and to suggest that something like this future could be feasible.

For Henry, the implications boil down to questions of power and the role powerful entities play in shaping the lives of even the most peripheral, socially excluded groups within a society. He also (later on) expresses skepticism at the political prospects of the revolutionary vision of “BoingBoing Socialism” that adopts a rhetoric of contingency and self-marginalization as its platform for change.

In a followup post, Henry elaborates a claim that Benkler engaged in a sort of naive Coasean disregard for power relations when he laid out the definitional statements on peer production. Henry says Benkler emphasized transaction cost and efficiency-centric explanations for the potential of peer production to substitute for firm or market-based modes of knowledge production and exchange:

Power relationships often explain who gets what, and which forms of organization are taken up, and which fall by the wayside. In general, forms of production that are (a) more efficient, but (b) inconvenient or unprofitable for powerful actors, are probably not going to be taken up, since those powerful actors will block them. Yet if one starts from an efficiency perspective, it is very hard to build power relations in, since one believes that change in practices and institutions is not driven by power relations but by efficiency.

and later:

What this means, if you take it seriously, is that Coaseian coordination is a special case of bargaining. Broadly speaking, Coaseian processes will lead to efficient outcomes only under very specific circumstances – when the actors have symmetrical breakdown values, as in the first game, so that neither of them is able to prevail over the other. More simply put, the Coase transaction cost account of how efficient institutions emerge will only work when all actors are more or less equally powerful. Under these conditions, it is perfectly alright to assume as Coase (and Benkler by extension) do, that efficiency considerations rather than power relations will drive change. In contrast, where there are significant differences of power, actors will converge on the institutions that reflect the preferences of powerful actors, even if those institutions are not the most efficient possible.

and finally:

In short – we need to distinguish between the rhetorical claims that technological change will bring openness along with it, and the (far more sustainable) claim that technology will probably only have openness enhancing benefits in a world where we are already dealing with the underlying power relations.

Benkler responds that Farrell is right to question his (Benkler’s) approach to power, but wrong in that the failure of his (Benkler’s) arguments in Coase’s Penguin and The Wealth of Networks is not driven by naive Coaseanism, but a different dimension of power entirely:

My primary mistake in my work fifteen years ago, and even ten, was not ignoring the role of power in shaping market patterns, but in understating the extent to which the new “market actors who will build the tools that make this population better able…” will themselves become the new incumbent market actors who will shape the environment to increase and lock-in their power. That is certainly a mistake in reading the landscape of power grabs, and I have tried to correct over the intervening years, most recently by offering a map of what has developed in the past decade…

In other words, today’s Benkler argues that yesterday’s Benkler underestimated the adaptive capacities of various incumbent powers as well as the way that a continuously shifting technical, regulatory, and political environment would alter the landscape along the way.

All of this speaks to an ongoing conversation Mako and I have been having about the past, present, and future of peer production. A pessimistic account might run like this: peer production thrived from ~1995-2008 in part because incumbent firms and private actors had not figured out how to capitalize on the possibilities for community-based provision of resources unlocked by the diffusion of digitally networked communications infrastructure. Now that increasing numbers of firms have done so, there is no going back. Large firms as well as their venture-funded spawn will continue to eat peer production communities’ lunch, undermining their viability as well as their autonomy. Peer production as we know it will eventually disappear, becoming a curious relic of a more naive era when the electronic frontier remained an unsettled, experimental space.

Another possibility, arguably more optimistic, can be seen in Benkler and Doctorow’s contributions to this exchange. Rather than consigning peer production to the dustbin of history, they both suggest that room for maneuver (or “degrees of freedom” in Benkler’s terms) will remain at the margins of the networked information economy and that communities of “walkaways” may persist in experimenting with “real utopian” autonomous alternatives to the more extractive, winner-take-all models of “supercapitalist” knowledge production and exchange. Doctorow’s fiction seems to explore the (hopeful) potential of these walkaway communities to generate radical, systematic transformation. Benkler, in his more recent writings, holds out some hope, but of a highly contingent, tenuous, and circumscribed sort.

The original posts are worth a read.

October 13, 2017October 11, 2017

Ants!

I recently read Deborah M. Gordon’s Ant Encounters and thought I’d summarize some thoughts about it. Gordon is a Professor of Biology at Stanford. The book pulls together several decades of research (hers and others’) on the behavior and ecology of ants. In it, Gordon makes nuanced claims about the importance of communication and interaction for distributed collective behavior in clear, non-technical language. Many of the findings should inspire people (like me) interested in understanding the organization of collective behavior in humans.

Gordon argues that ant behavior and colony dynamics encompass a complex system driven by patterns of interactions, information exchange, and environmental influences. She contrasts this with more deterministic accounts of ants prevalent in earlier scientific literature and popular culture. Gordon emphasizes how ants operate by behavioral heuristics and information processing rather than a fixed set of rules or genetically encoded traits.

Consider the division of labor within an ant colony. The prevailing (wrong) view depicts ants born into a pre-specified, genetically determined “caste” which has a clearly-defined task within a hierarchically structured colony. Following this story, the Queen of the colony births out larva who grow into task-specialized sterile adults. Individuals within each caste supposedly possess physical traits that support their specialization as foragers, trash removers, larva-tenders, patrollers, or whatever. Each individual supposedly pursues their specialized task tirelessly until death.

It turns out that this account reflects a mixture of reasonable misinterpretation and fantastical thinking. First off, Gordon notes, ants change tasks within their life course. Today’s larva-tender may be tomorrow’s forager. These changes do not entail biological changes within each ant (although there seems to be evidence that ants do tend to adopt specific tasks at specific stages of their lives within a colony), but instead reflect responses to interactions with other members of the colony and external forces shaping those interactions. In a younger, less populous colony, ants may change tasks in response to immediate needs and threats that arise suddenly. In larger, more mature colonies where things are less likely to change suddenly, many ants may have more stable activities. Some ants in large colonies even literally sit around doing nothing because the information they receive from their nest-mates indicates that the colonies needs are being met. None of this is fixed by genetic encoding or hierarchical commands.

Second, Gordon shows how ants respond probabilistically to local stimuli. Individual ants, it turns out, act a lot like heuristic distributed sensors or nodes in a communications network—each with some likelihood of changing its behavior depending on the feedback it receives from its environment. They are not automatons with deterministic programming to pursue a single-minded course of action.

Third, Gordon shows how colonies as a whole change in reaction to their environments and collective interactions. If one colony finds itself in proximity to another, the individuals within it may alter how much collective effort is dedicated to specific tasks depending on the species, size, and temperament of its neighbors. Individual ants respond to the number of nest-mates and neighbors they encounter. If their last ten encounters were with foragers from their home nest returning with food to feed the larval brood, they may continue to go about their business uninterrupted. As the portion of recent interactions includes outsiders or nest-mates responding frantically to an unwelcome intruder of some sort, the probability rises that the next ant will change its behavior in response (maybe to start running around in a panic or bite an intruder).

A picture of harvester ants — Harvester ants collecting seeds (cc-by-sa Donkey Shot, Wikimedia Commons)

Through many examples, Gordon conveys how patterns of collective ant behavior emerge and adapt to local circumstances without a centralized coordination mechanism or hierarchy of control. She describes this almost entirely without recourse to the jargon of complexity theory or complex systems research.

A concrete, measured, and example-driven account of how actually existing complex systems work is maybe the most impressive achievement of the book. Many texts discuss complexity in human and ecological systems, but none that I have read do so with the clarity of Ant Encounters. While I should read more books on these topics, more people in my little corner of the research world should read Gordon’s work too.

Ant Encounters ultimately left me excited to pursue some of the potential extensions and connections between Gordon’s work and research on human social systems and organizations. For example, I’d love to follow up on her comment that higher interaction frequency is associated with colony growth or survival (I currently forget which). Would such a finding hold up in the context of human organizations? If so, what would it look like and mean in the context of building effective peer production systems? Gordon has also written elsewhere about some of the potential connections between ant behavior, human organization, communication protocols. Recent findings from Gordon and her collaborators show how ants follow a set of behavior protocols very similar to those encoded in the TCP specification (apparently, she likes to refer to this idea as “the Anternet“). I’m eager to read more of the scientific publications from Gordon and her collaborators to understand these ideas more deeply and to see how well they travel when applied to a species I know a little bit more about.