Women in Data Science Puget Sound is part of a 50+-country conference series founded and organized in cooperation with Stanford University’s Data Science coalition. Anyone may attend, regardless of gender: events feature a speaker lineup composed of women in data science. The Puget Sound event is Tuesday, April 25 at the Expedia HQ in Seattle, and numerous affiliated regional and online events are scheduled in the coming weeks.
If you’re in the Seattle area, you might like to catch CDSC member Kaylea presenting a workshop! Here’s the pitch for attending her beginner-friendly session:
Let’s Re-think Political Bias & Build Our Own Classifier
How can we think about political bias without falling into assumptions about who's on what side and what that means?
Data science and ML offer us an alternative: we can parse political speech about a topic and use NLP/ML techniques to classify articles we scrape from the web.
In this hands-on workshop, we'll parse the Congressional Record, build a classifier, scrape search results, and analyze texts. You'll walk away with your own example of how to use data science to analyze political framing.
The full lineup of speakers for the Puget Sound conference is posted here. Tickets for the single-day event are $80 (see this link to request a discount code for half off).
Topics on the schedule for this event look juicy if quant work is your jam: AI, BERT, hypergraphs, visualization, forecasting, quantum computing, causal inference, survival analysis, writing better code and career management, with examples ranging from search, sales, and supply chain to economic disparity, DNA sequencing and saving wildlife!
In The Wealth Of Networks, Yochai Benkler describes the opportunities and decisions presented by networked forms of production. Writing in the mid-2000s, Benkler describes a wide range of future policy battlegrounds: copyrights and patents, common carrier infrastructure, the accessibility of the public sphere, and the verification of information.
Benkler predicts: “How these battles turn out over the next decade or so will likely have a significant effect on how we come to know what is going on in the world we occupy, and to what extent and in what forms we will be able…to affect how we and others see the world as it is and as it might be.”
Benkler uses two simple search examples, reporting the results of searching for “Viking ship” and “Barbie”. He finds that enthusiastic individuals and independent voices dominate the content we see on the web and that various search engines construct meaning in varying ways. I repeat his examples (searches conducted 7/3/2018 and 12/1/2022, from my home near Seattle, WA and using my personal laptop).
So how do ‘we come to know what is going on in the world we occupy’? Who creates what we see online? And what implications does that have for our own freedom to shape the world? The short version of the answer to this question seems to be: if there was a battle, it’s over now and the wreckage has disappeared; individuals and independent voices are marginalized and commercial content is dominant — and this picture does not vary among search engines.
I used the same search engine (Google) and the same term (Viking Ship): what I see is that the individual hobbyists Benkler saw in 2006 are eclipsed by institutions. The materials on the current sites sound similar to those Benkler saw – photos, replicas, and scholarly information, as well as links and learning materials – but the production is generally institutional and formal in contrast to the individual and informal sources Benkler reports.
One other shift: in 2022, simply listing links in order is not sufficient to report what searchers see. Search results are interspersed with many other features: a widget with “sources from across the web”, an images display with associated keywords, a “People also ask” widget, and a related searches widget; to reach the 9th “result” in the classic sense, I have to browse to the second page of results.
When I follow Benkler’s lead and search for ‘Barbie’ using three different search engines, the results are even more different from 2006. Benkler describes differences in search engine results as revealing different possibilities – via Google, Barbie was portrayed as “a culturally contested figure”, whereas on Overture (a now-defunct shopping-oriented search engine), the searcher encountered “a commodity toy.”
By contrast, my 2018 search via the then-current top 3 search engines, inclusive of widgets and other features, revealed:
The top search engines in 2022 are the same three firms, although I observe that some sources suggest DuckDuckGo, Baidu (Chinese language only) and Yandex (Russian) belong in a top 5; other sources treat YouTube and Amazon as “top search engines” although they are not actually search engines. My 2022 search, inclusive of widgets and other features, revealed:
The modern Barbie searcher encounters primarily a multiplatform brand, with some hints of cultural constructions. In 2018 this took the form of extreme plastic surgery and brand-friendly fan fiction, in 2022 weight loss and fan TikTok. To whatever degree search engine algorithms continue to give weight to alternate voices in this case, they are largely drowned out by the volume of the commercial voice: the meaning of a search query for the single term “Barbie” has been substantially narrowed since Benkler’s time, and perhaps has narrowed even further in the last four and a half years.
The web in 2006 was indeed a different place, and I have commented on additional dimensions of analysis not present in Wealth: embedding of visual and social media content, and the widgetizing of content. In 2018, these visual components were less dominant: a stripe of Viking Ship images and a stripe of Barbie videos. In 2022 search, the page can scarcely be described without them.
We can now answer Benkler’s challenge: how did “these battles” over the last decade and a half “turn out”?
How do we “come to know what is going on in the world we occupy”?
How are we able “to affect how we and others see the world as it is and as it might be”?
The answer seems to be, it’s unclear to what degree there was a battle at all: collectives have triumphed over individuals on the Web insofar as search engines represent it. These collectives are generally firms, although some formal institutions are also present: news media, Wikipedia, and (in the case of Viking Ship) museums.
The implications of our search environment are significant, and underscore the necessity of efforts to archive and capture the search landscape as it appeared. The role of platforms and institutions in constructing our understanding of the world should be of key concern in information and communication sciences.
For civil society groups, these results suggest alienation: the commercializing of the web has been accompanied by a narrowing of outlets for individual expression and critique, with Wikipedia and its community co-construction of knowledge a vital bright spot. For journalists, these results suggest the vital role of cultural reporting. For firms, the challenge is one of authenticity and connection: to the extent that the web has become a broadcast medium focused on official paid messaging, the opportunity to engage with consumers is lost, and along with it a spark for innovation. Search platforms benefit in the mean time, as jockeying for ad positioning between manufacturers and retailers drives revenue, at least until commercialism turns consumer attention elsewhere.
CDSC members Molly deBlanc and Kaylea Champion will be presenting at this year’s Aaron Swartz Day and International Hackathon. Molly will speak at 2:50 p.m. Pacific (talk title: My (Extended) Body, My Choice). Kaylea will speak at 3:15 p.m. Pacific (talk title: The Value of Anonymity: Evidence from Wikipedia). Registration and live stream details are available here: https://www.aaronswartzday.org/
If you’re attending ACM-CSCW this year, you are warmly invited to join CDSC members during our talks and other scheduled events. CSCW is not only virtual but spread across multiple weeks and offering sessions multiple times to accommodate timezones. We hope to see you there — we are eager to discuss our work with you!
CDSC members and affiliates are involved in CSCW beyond these public presentations. Nicholas Vincent, Sohyeon Hwang, and Sneha Narayan are part of the organizing team for the “Ethical Tensions, Norms, and Directions in the Extraction of Online Volunteer Work” workshop, where Molly de Blanc is scheduled to present and Kaylea Champion will be giving a lightning talk. Katherina Kloppenborg and Kaylea Champion are presenting in the Doctoral Consortium.
Although we might not notice it, much of the technology we rely on, from cell phones to cloud servers, is fueled by decades of effort by volunteers who create innovative software as well as the organizations necessary to sustain it. Despite this powerful legacy, we now are facing a crisis: not all of these critical components have been sufficiently maintained. Can we detect that an important software component is becoming neglected before major failures occur? Are these neglected packages just a matter of resources — old code and too few contributors — or can we see broader patterns that play a role, such as collaboration and organizational structures? Kaylea Champion has been working to answer these questions in her dissertation. As part of this work, she joined the software community metrics enthusiasts gathered at this year’s CHAOSSCon EU on September 12, 2022 as part of the Open Source Summit.
Kaylea’s presentation shares work in progress about the sources of underproduction, or when highly important packages see low quality development, in open software development. This presentation marks her second time at CHAOSSCon and builds on her work shared at last year’s conference in a lightning talk about detecting underproduction in Debian (see coverage of this work as presented to Debian folks here). Engaging with communities is a key part of this work: when we understand practitioner perspectives on underproduction and its causes, we can do science that supports taking immediate action. If you are interested in measuring the health of your collaborative community, let’s talk!
How should search engines be regulated? What are the implications of the EU Digital Services Act? If you are interested in technology policy, mark your calendar for an upcoming pre-conference virtual event: “Harms and Standards in Content Platform Governance” (October 13, 2022 at 5:00 a.m. PST, 8:00 a.m EST, 2:00 p.m. CEST). As part of the upcoming European Communication Research and Education conference, the Communication Law and Policy section has invited Kaylea Champion to present work she did with Benjamin Mako Hill and University of Washington students Jacinta Harshe, Isabella Brown, and Lucy Bao.
We examine the information landscape as manifested in search results during the Covid-19 pandemic using data we collected as part of the Covid-19 Digital Observatory Project. Our results provide evidence for the powerful ways that search engines shape our information environment–in terms of what information gets seen, the sources of that information, the market sectors that those sources operate within, and the partisan bias of those results.
This free event is oriented to connecting technology researchers and policymakers, and will include presentations of research from legal, communication, and critical perspectives.
SANER is primarily focused on software engineering practices, and several of the projects presented this year were of interest for social computing scholars. Here’s a quick rundown of presentations I particularly enjoyed:
Comparison of online help communities: This article by Mahshid Naghashzadeh, Amir Haghshenas, Ashkan Sami and David Lo compares two question/answer environments that we might imagine as competitors—the Matlab community of Stack Overflow versus the Matlab community hosted by Matlab. These sites have similar affordances and topics, however, the two sites seem to draw distinctly different types of questions. This article features an extensive hand-coded dataset by subject matter experts: How Do Users Answer MATLAB Questions on Q&A Sites? A Case Study on Stack Overflow and MathWorks[VIDEO]
Feedback: What goes wrong when software developers give one another feedback on their code? This study by a large team (Moataz Chouchen, Ali Ouni, Raula Gaikovina Kula, Dong Wang, Patanamon Thongtanunam, Mohamed Wiem Mkaouer and Kenichi Matsumoto) offers an ontology of the pitfalls and negative interactions that can occur during the popular code feedback practice known as code review: confused reviewers, divergent reviewers, low review participation, shallow review, and toxic review: Anti-patterns in Modern Code Review: Symptoms and Prevalence [VIDEO]
Critical Infrastructure: This study by Mahmoud Alfadel, Diego Elias Costa and Emad Shihab was focused on traits of security problems in Python and made some comparisons to npm. This got me thinking about different community-level factors (like bug release/security alert policies) that may influence underproduction. I also found myself wondering about inter-rater reliability for bug triage in communities like Python. The paper showed a very similar survival curve for bugs of varying severities, whereas my work in Debian showed distinct per-severity curves. One explanation for uniform resolution rate across severities could be high variability in how severity ratings are applied. Another factor worth considering may be the role of library abandonment: Empirical analysis of security vulnerabilities in python packages[VIDEO]
Critical software we all rely on can silently crumble away beneath us. Unfortunately, we often don’t find out software infrastructure is in poor condition until it is too late. Over the last year or so, I have been leading a project I announced earlier to measure software underproduction—a term I use to describe software that is low in quality but high in importance.
Underproduction reflects an important type of risk in widely used free/libre open source software (FLOSS) because participants often choose their own projects and tasks. Because FLOSS contributors work as volunteers and choose what they work on, important projects aren’t always the ones to which FLOSS developers devote the most attention. Even when developers want to work on important projects, relative neglect among important projects is often difficult for FLOSS contributors to see.
In the paper—coauthored with Benjamin Mako Hill—we describe a general approach for detecting “underproduced” software infrastructure that consists of five steps: (1) identifying a body of digital infrastructure (like a code repository); (2) identifying a measure of quality (like the time to takes to fix bugs); (3) identifying a measure of importance (like install base); (4) specifying a hypothesized relationship linking quality and importance if quality and importance are in perfect alignment; and (5) quantifying deviation from this theoretical baseline to find relative underproduction.
To show how our method works in practice, we applied the technique to an important collection of FLOSS infrastructure: 21,902 packages in the Debian GNU/Linux distribution. Although there are many ways to measure quality, we used a measure of how quickly Debian maintainers have historically dealt with 461,656 bugs that have been filed over the last three decades. To measure importance, we used data from Debian’s Popularity Contest opt-in survey. After some statistical machinations that are documented in our paper, the result was an estimate of relative underproduction for the 21,902 packages in Debian we looked at.
One of our key findings is that underproduction is very common in Debian. By our estimates, at least 4,327 packages in Debian are underproduced. As you can see in the list of the “most underproduced” packages—again, as estimated using just one more measure—many of the most at risk packages are associated with the desktop and windowing environments where there are many users but also many extremely tricky integration-related bugs.
We hope these results are useful to folks at Debian and the Debian QA team. We also hope that the basic method we’ve laid out is something that others will build off in other contexts and apply to other software repositories.
Paper Citation: Kaylea Champion and Benjamin Mako Hill. 2021. “Underproduction: An Approach for Measuring Risk in Open Source Software.” In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2021). IEEE.
Contact Kaylea Champion (firstname.lastname@example.org) with any questions or if you are interested in following up.
Although Wikipedia is the encyclopedia that anybody can edit, not all edits are welcome. Wikipedia is subject to a constant deluge of vandalism. Random people on the Internet are constantly “blanking” Wikipedia articles by deleting their content, replacing the text of articles with random characters, inserting outlandish claims or insults, and so on. Although volunteer editors and bots do an excellent job of quickly reverting the damage, the cost in terms of volunteer time is real.
Why do people spend their time and energy vandalizing web pages? For readers of Wikipedia that encounter a page that has been marred or replaced with nonsense or a slur—and especially for all the Wikipedia contributors who spend their time fighting back the tide of vandalism by checking and reverting bad edits and maintaining the bots and systems that keep order—it’s easy to dismiss vandals as incomprehensible sociopaths.
In a paper I just published in the ACM International Conference on Social Media and Society, I systematically analyzed a dataset of Wikipedia vandalism in an effort to identify different types of Wikipedia vandalism and to explain how each can been seen as “rational” from the point of view of the vandal.
Leveraging a dataset we created in some of our other work, the study used a random sample of contributions drawn from four groups that vary in the degree to the editors in question can be identified by others in Wikipedia: established users with accounts, users with accounts making their first edits, users without accounts, and users of the Tor privacy tool. Tor users were of particular interest to me because the use of Tor offers concrete evidence that a contributor is deliberately seeking privacy. I compared the frequency of vandalism in each group, developed an ontology to categorize it, and tested the relationship between group membership and different types of vandalism.
I found that the group that had engaged in the least effort in order to edit—users without accounts—were the most likely to vandalize. Although privacy-seeking Tor contributors were not the most likely to vandalize, vandalism from Tor-based contributors was less likely to be sociable, was more likely to be large scale (i.e. large blocks of text, such as by pasting in the same lines over and over), and more likely to express frustration with the Wikipedia community.
Thinking systematically about why different groups of users might engage in vandalism can help counter vandalism. Potential interventions might change not just the amount, but also the type, of vandalism a community will receive. Tools to detect vandalism may find that the patterns in each category allow for more accurate targeting. Ultimately, viewing vandals as more than irrational sociopaths opens potential avenues for dialogue.
Paper Citation: Kaylea Champion. 2020. “Characterizing Online Vandalism: A Rational Choice Perspective.” In International Conference on Social Media and Society (SMSociety’20). Association for Computing Machinery, New York, NY, USA, 47–57. https://doi.org/10.1145/3400806.3400813
A paper recently published at CSCW describes the results of a forensic qualitative analysis of contributions made to Wikipedia through the anonymous browsing system Tor. The project was conducted collaboratively with researchers from Drexel, NYU, and the University of Washington and complements a quantitative analysis of the same data we also published to provide a rich qualitative picture of what anonymity-seekers are trying to do when they contribute to Wikipedia. The work also shows how the ability to stay anonymous can play a important role in facilitating certain types of contributions to online knowledge bases like Wikipedia.
Media reports often describe how online platforms are tracking us. That said, trying to live our lives online without leaving a trail of our personal information can be difficult because many services can’t be used without an account and systems that protect privacy are often blocked. One popular approach to protecting our privacy online involves using the Tor network. Tor protects users from being identified by their IP address which can be tied to a physical location. However, if you’d like to contribute to Wikipedia using Tor, you’ll run into a problem. Although most IP addresses can edit without an account, Tor users are blocked from editing.
Other research by my team has shown that Wikipedia’s attempt to block Tor is imperfect and that some people have been able to edit despite the ban. This work also built a dataset of more than 11,000 contributions made to Wikipedia via Tor and used quantitative analysis to show that the contributions of people using Tor were about the same quality as contributions from other new editors and other contributors without accounts. Of course, given the unusual circumstances Tor-based contributors faced, we wondered if a deeper look into the content of their edits might tell us more about their motives and the kinds of contributions they seek to make. I led a qualitative investigation that sought to explore these questions.
Given the challenges of studying anonymity seekers, we designed a novel “forensic” qualitative approach that was inspired by the techniques common in the practice of computer security as well as criminal investigation. We applied to this new technique to a sample of 500 different editing sessions and sorted each session into a category based on what the editor seemed to be intending to do.
Most of the contributions we found fell into one of the two following categories:
Many contributions were quotidian attempts to add to the encyclopedia. Tor-based editors added facts, they fixed typos, and they updated train schedules. There’s no way to know if these individuals knew that they were just getting lucky in their ability to edit or if they were patiently reloading to evade the ban.
Second, we found harassing comments and vandalism. Unwelcome conduct is common in online environments, and sometimes more common when the likelihood of being identified is decreased. Some of the harassing comments we observed were direct responses to being banned as a Tor user.
Although these were most of what we observed, we also found evidence of several types of contributor intent:
We observed activism, as when a contributor tried to bring attention to journalistic accounts of environmental and human rights abuses being committed by a mining company, only to have editors traceable to the mining company repeatedly remove their edits. Another example included an editor trying to diminish the influence of alternative medicine proponents.
We also observed quality maintenance activities when editors used Wikipedia’s rules about appropriate sourcing to remove personal websites being cited in conspiracy theories.
We saw edit wars with Tor editors participating in a back-and-forth removal and replacement of content as part of a dispute, in some cases countering the work of an experienced Wikipedia editor who even other experienced editors had gauged to be biased.
Finally, we saw Tor-based editors participating in non-article discussions such as investigations of administrator misconduct, and protesting the mistrust of Tor editors by the Wikipedia platform.
In all, these themes led us to reflect on how the risks that individuals face when contributing to online communities are sometimes out of alignment with the risks the communities face by accepting their work. Expressing minoritized perspectives, maintaining community standards even when you may be targeted by the rulebreaker, highlighting injustice or acting as a whistleblower can be very risky for an individual, and may not be possible without privacy protections. Of course, in platforms seeking to support the public good, such knowledge and accountability may be crucial.
This project was conducted by Kaylea Champion, Nora McDonald, Stephanie Bankes, Joseph Zhang, Rachel Greenstadt, Andrea Forte, and Benjamin Mako Hill. This work was supported by the National Science Foundation (awards CNS-1703736 and CNS-1703049) and included the work of two undergraduates supported through an NSF REU supplement.
Paper Citation: Kaylea Champion, Nora McDonald, Stephanie Bankes, Joseph Zhang, Rachel Greenstadt, Andrea Forte, and Benjamin Mako Hill. 2019. A Forensic Qualitative Analysis of Contributions to Wikipedia from Anonymity Seeking Users. Proceedings of the ACM on Human-Computer Interactaction. 3, CSCW, Article 53 (November 2019), 26 pages. https://doi.org/10.1145/3359155