Back in the blogging game

Hello world! It’s been a while since I’ve done any blogging, but I’ve been wanting to return for some time now, so here we are. My old blog was a hodge podge that hovered at the edges of my research. Current events featured prominently, especially those having to do with governance in online communities, knowledge production and access, and research ideas. I have a few different goals for this blog.

Sunrise-LakeMichigan-2014-10-8
A new day dawns for blogging on the shores of Lake Michigan…

 

 

First, since it’s part of the Community Data Science Collective site, I plan to talk about our research, affiliates, community events, and related topics. Second, I want to use the blog as a space to sketch out research ideas more regularly. When I blogged previously, I was a graduate student. I had more unstructured time in which to brainstorm and reflect. The transition to faculty and the subsequent accumulation of responsibilities, projects, students, and commitments has left me seeking time to think broadly and with less structure. I need a semi-structured space and time to do so. As a result, I return to blogging.

This relates to a third goal: a minimum of one post per week. In the old days, Mako coordinated the Cambridge instance of Iron Blogger, a group blogging accountability project in which all the participants agreed to write one post per week or pay $5 into a common pot (that we then used to throw a party whenever it got big enough). The incentives sound misaligned, but the semi-public commitment, a deadline, and the nominal material cost of failure got a weekly post out of me roughly 90% of the time.

There is no iron blogger group in Chicago (yet?), but I’m going to recreate the structure with a little public accountability infrastructure with some friends. So far, Rachel and I have committed to posting weekly and tracking our posts. If others want to join, we can add further infrastructure as needed. No fines for now, but if I fail to post frequently between now and the end of the academic year, I’ll revisit.

Finally, since I do a lot more mentoring and teaching now than I used to, I imagine that these activities will occupy a fair amount of my attention as well. I feel more comfortable publishing material about my teaching now than when I first started at Northwestern. I am also realizing that my approach to teaching would lend itself really well to blogging as I am continually tinkering with the structure of my assignments, readings, evaluations, and lessons. A space to reflect on my experiences more actively and to solicit feedback from students and others seems like a helpful thing.

That’s it for this opening post. Thanks for reading.

Community Data Science Workshops in Spring 2015

The Community Data Science Workshops are a series of project-based workshops being held at the University of Washington for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, Twitter, free and open source software, and civic media.

The workshops are for people with absolutely no previous programming experience and they bring together researchers and academics with participants and leaders in online communities.  The workshops are run entirely by volunteers and are entirely free of charge for participants, generously sponsored by the UW Department of Communication and the eScience Institute. Participants from outside UW are encouraged to apply.

There will be a mandatory evening setup session 6:00-9:00pm on Friday April 10 and three workshops held from 9am-4pm on three Saturdays (April 11 and 25 and May 9). Each Saturday session will involve a period for lecture and technical demonstrations in the morning. This will be followed by a lunch graciously provided by the eSciences Institute at UW.  The rest of the day will be followed by group work on programming and data science projects supported by more experienced mentors.

Setup and Programming Tutorial (April 10 evening) — Because we expect to hit the ground running on our first full day, we will meet to help participants get software installed and to work through a self-guided tutorial that will help ensure that everyone has the skills and vocabulary to start programming and learning when we meet the following morning.

Introduction to Programming (and April 11) — Programming is an essential tool for data science and is useful for solving many other problems. The goal of this session will be to introduce programming in the Python programming language. Each participant will leave having solved a real problem and will have built their first real programming project.

Importing Data from web APIs (April 25)  — An important step in doing data science is collecting data. The goal of this session will be to teach participants how to get data from the public application programming interfaces (“APIs”) common to many social media and online communities. Although we will use the APIs provided by Wikipedia and Twitter in the session, the principles and techniques are common to many other online communities.

Data Analysis and Visualization (May 9) — The goal of data science is to use data to answer questions. In our final session, we will use the Python skills we learned in the first session and the datasets we’ve created in the second to ask and answer common questions about the activity and health of online communities. We will focus on learning how to generate visualizations, create summary statistics, and test hypotheses.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

  • Are new contributors in Wikipedia this year sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who join through a Wikipedia outreach event staying involved? How do they compare to people who decide to join the project outside of the event?

An earlier version of the workshops was run in Spring and Fall 2015 and the curriculum we used for both are online.

Sign up and Participate!

Participants! If you are interested in learning data science, please fill out our registration form here. The deadline to register is Friday April 3.  We will let participants know if we have room for them by Monday April 6. Space is limited and will depend on how many mentors we can recruit for the sessions.

Interested in being a mentor? If you already have experience with Python, please consider helping out at the sessions as a mentor. Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required. And we’ll feed you!  Because we want to keep a very high mentor-to-student ratio, recruiting more mentors means we can accept more participants. If you’re interested you can fill out this form or email makohill@uw.edu. Also, thank you, thank you, thank you!

About the Organizers

The workshops are being coordinated, organized by Benjamin Mako Hill, Dharma Dailey, Jonathan Morgan, Ben Lewis, and Tommy Guy and a long list of other volunteer mentors. The workshops have been designed with lots of help and inspiration from Shauna Gordon-McKeon and Asheesh Laroia of OpenHatch and lots of inspiration from the Boston Python Workshop.

These workshops are an all-volunteer effort. Fundamentally, we’re doing this because we’re programmers and data scientists who work in online communities and we really believe that the skills you’ll learn in these sessions are important and empowering tools.

The workshops are being supported by the UW Department of Communication and the eScience Institute.

If you have any questions or concerns, please contact Benjamin Mako Hill at makohill@uw.edu.

Dept.Comm_UW_vertical_small_square escience_logo

 Photo from the Boston Python Workshop - a similar workshop run in Boston that has inspired and provided a template for the CDW.
Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.

New Paper: Consider the Redirect

This post was originally published on Benjamin Mako Hill‘s blog Copyrighteous.

In wikis, redirects are special pages that silently take readers from the page they are visiting to another page. Although their presence is noted in tiny gray text (see the image below) most people use them all the time and never know they exist. Redirects exist to make linking between pages easier, they populate Wikipedia’s search autocomplete list, and are generally helpful in organizing information. In the English Wikipedia, redirects make up more than half of all article pages.

seattle_redirectOver the years, I’ve spent some time contributing to to Redirects for Discussion (RfD). I think of RfD as like an ultra-low stakes version of Articles for Deletion where Wikipedians decide whether to delete or keep articles. If a redirect is deleted, viewers are taken to a search results page and almost nobody notices. That said, because redirects are almost never viewed directly, almost nobody notices if a redirect is kept either!

I’ve told people that if they want to understand the soul of a Wikipedian, they should spend time participating in RfD. When you understand why arguing about and working hard to come to consensus solutions for how Wikipedia should handle individual redirects is an enjoyable way to spend your spare time — where any outcome is invisible — you understand what it means to be a Wikipedian.

That said, wiki researchers rarely take redirects into account. For years, I’ve suspected that accounting for redirects was important for Wikipedia research and that several classes of findings were noisy or misleading because most people haven’t done so. As a result, I worked with my colleague Aaron Shaw at Northwestern earlier this year to build a longitudinal dataset of redirects that can capture the dynamic nature of redirects. Our work was published as a short paper at OpenSym several months ago.

It turns out, taking redirects into account correctly (especially if you are looking at activity over time) is tricky because redirects are stored as normal pages by MediaWiki except that they happen to start with special redirect text. Like other pages, redirects can be updated and changed over time are frequently are. As a result, taking redirects into account for any study that looks at activity over time requires looking at the text of every revision of every page.

Using our dataset, Aaron and I showed that the distribution of edits across pages in English Wikipedia (a relationships that is used in many research projects) looks pretty close to log normal when we remove redirects and very different when you don’t. After all, half of articles are really just redirects and, and because they are just redirects, these “articles” are almost never edited.

edits_over_pagesAnother puzzling finding that’s been reported in a few places — and that I repeated myself several times — is that edits and views are surprisingly uncorrelated. I’ll write more about this later but the short version is that we found that a big chunk of this can, in fact, be explained by considering redirects.

We’ve published our code and data and the article itself is online because we paid the ACM’s open access fee to ransom the article.


For more details see the paper: Hill, Benjamin Mako, and Aaron Shaw. (2014) “Consider the Redirect: A Missing Dimension of Wikipedia Research.” In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press, 2014.

Community Data Science Workshops in November 2014

The Community Data Science Workshops in November 2014 are a series of project-based workshops being held at the University of Washington for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, Twitter, free and open source software, and civic media.

The workshops are for people with absolutely no previous programming experience and they bring together researchers and academics with participants and leaders in online communities.  The workshops are run entirely by volunteers and are entirely free of charge for participants, generously sponsored by the UW Department of Communication and the eScience Institute. Participants from outside UW are encouraged to apply.

There will be a mandatory evening setup session 6:30-9:30pm on Friday November 7 and three workshops held from 9am-4pm on three Saturdays in November (November 8, 15, and 22). Each Saturday session will involve a period for lecture and technical demonstrations in the morning. This will be followed by a lunch graciously provided by the eSciences Institute at UW.  The rest of the day will be followed by group work on programming and data science projects supported by more experienced mentors.

Setup and Programming Tutorial (November 7 evening) — Because we expect to hit the ground running on our first full day, we will meet to help participants get software installed and to work through a self-guided tutorial that will help ensure that everyone has the skills and vocabulary to start programming and learning when we meet the following morning.

Introduction to Programming (and November 8) — Programming is an essential tool for data science and is useful for solving many other problems. The goal of this session will be to introduce programming in the Python programming language. Each participant will leave having solved a real problem and will have built their first real programming project.

Importing Data from Wikipedia and Twitter APIs (November 15)  — An important step in doing data science is collecting data. The goal of this session will be to teach participants how to get data from the public application programming interfaces (“APIs”) common to many social media and online communities. Although we will use the APIs provided by Wikipedia and Twitter in the session, the principles and techniques are common to many other online communities.

Data Analysis and Visualization (November 22) — The goal of data science is to use data to answer questions. In our final session, we will use the Python skills we learned in the first session and the datasets we’ve created in the second to ask and answer common questions about the activity and health of online communities. We will focus on learning how to generate visualizations, create summary statistics, and test hypotheses.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

  • Are new contributors in Wikipedia this year sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who join through a Wikipedia outreach event staying involved? How do they compare to people who decide to join the project outside of the event?

An earlier version of the workshops was run between April and May 2014 and the curriculum we used in the Spring is available online.

Sign up and Participate!

Participants! If you are interested in learning data science, please fill out our registration form here. The deadline to register is Thursday October 30.  We will let participants know if we have room for them by Saturday November 1. Space is limited and will depend on how many mentors we can recruit for the sessions.

Interested in being a mentor? If you already have experience with Python, please consider helping out at the sessions as a mentor. Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required. And we’ll feed you!  Because we want to keep a very high mentor-to-student ratio, recruiting more mentors means we can accept more participants. If you’re interested,  email makohill@uw.edu. Also, thank you, thank you, thank you!

About the Organizers

The workshops are being coordinated, organized by Benjamin Mako Hill, Frances Hocutt, Jonathan Morgan, and Tommy Guy and a long list of other volunteer mentors. The workshops have been designed with lots of help and inspiration from Shauna Gordon-McKeon and Asheesh Laroia of OpenHatch and lots of inspiration from the Boston Python Workshop.

These workshops are an all-volunteer effort. Fundamentally, we’re doing this because we’re programmers and data scientists who work in online communities and we really believe that the skills you’ll learn in these sessions are important and empowering tools.

The workshops are being supported by the UW Department of Communication and the eSciences Institute.

If you have any questions or concerns, please contact Benjamin Mako Hill at makohill@uw.edu.

Dept.Comm_UW_vertical_small_square escience_logo

 Photo from the Boston Python Workshop - a similar workshop run in Boston that has inspired and provided a template for the CDW.
Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.

Community Data Science Workshop Post-Mortem

[Reposted from Benjamin Mako Hill’s blog Copyrighteous.]

Earlier this year, I helped plan and run the Community Data Science Workshops: a series of three (and a half) day-long workshops designed to help people learn basic programming and tools for data science tools in order to ask and answer questions about online communities like Wikipedia and Twitter. You can read our initial announcement for more about the vision.

The workshops were organized by myself, Jonathan Morgan from the Wikimedia Foundation, long-time Software Carpentry teacher Tommy Guy, and a group of 15 volunteer “mentors” who taught project-based afternoon sessions and worked one-on-one with more than 50 participants. With overwhelming interest, we were ultimately constrained by the number of mentors who volunteered. Unfortunately, this meant that we had to turn away most of the people who applied. Although it was not emphasized in recruiting or used as a selection criteria, a majority of the participants were women.

The workshops were all free of charge and sponsored by the UW Department of Communication, who provided space, and the eScience Institute, who provided food.

cdsw_combo_images-1The curriculum for all four session session is online:

The workshops were designed for people with no previous programming experience. Although most our participants were from the University of Washington, we had non-UW participants from as far away as Vancouver, BC.

Feedback we collected suggests that the sessions were a huge success, that participants learned enormously, and that the workshops filled a real need in the Seattle community. Between workshops, participants organized meet-ups to practice their programming skills.

Most excitingly, just as we based our curriculum for the first session on the Boston Python Workshop’s, others have been building off our curriculum. Elana Hashman, who was a mentor at the CDSW, is coordinating a set of Python Workshops for Beginners with a group at the University of Waterloo and with sponsorship from the Python Software Foundation using curriculum based on ours. I also know of two university classes that are tentatively being planned around the curriculum.

Because a growing number of groups have been contacting us about running their own events based on the CDSW — and because we are currently making plans to run another round of workshops in Seattle late this fall — I coordinated with a number of other mentors to go over participant feedback and to put together a long write-up of our reflections in the form of a post-mortem. Although our emphasis is on things we might do differently, we provide a broad range of information that might be useful to people running a CDSW (e.g., our budget). Please let me know if you are planning to run an event so we can coordinate going forward.

Community Data Science Workshops at UW

 Photo from the Boston Python Workshop - a similar workshop run in Boston that has inspired and provided a template for the CDW.
Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.

The Community Data Science Workshops are a series of project-based workshops being held at the University of Washington for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, Twitter, free  and open source software, and civic media.

The workshops are for people with no previous programming experience. The goal is to bring together both researchers and academics as well as participants and leaders in online communities.  The workshops will all be free of charge. Participants from outside UW are encouraged to apply.

There will be three workshops held from 9am-4pm on three Saturdays in April and May. Each session will involve a period for lecture and technical demonstrations in the morning. This will be followed by a lunch graciously provided by the eSciences Institute at UW.  The rest  of the day will be followed by group work on programming and data science projects supported by more experienced mentors.

Introduction to Programming (April 5) — Programming is an essential tool for data science and is useful for solving many other problems. The goal of this session will be to introduce programming in the Python programming language. Each participant will leave having solved a real problem and will have built their first real programin their group. We will be relying on the curriculum from the Boston Python Workshops. Because we expect to hit the ground running, we will also run a session in the evening of Friday April 4 to help participants get software installed.

Importing Data from Wikipedia and Twitter APIs (May 3)  — An important step in doing data science is collecting data. The goal of this session will be to teach participants how to get data from the public application programming interfaces (“APIs”) common to many social media and online communities. Although, we will use the APIs provided by Wikipedia and Twitter in the session, the principles and techniques are common to many online communities.

Data Analysis and Visualization (May 31) — The goal of data science is to use data to answer questions. In our final session, we will use the Python skills we learned in the first session and the datasets we’ve created in the second to ask and answer common questions about the activity and health of online communities. We will focus on learning how to generate visualizations, create summary statistics, and test hypotheses.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

  • Are new contributors to an article in Wikipedia sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who participated in a Wikipedia outreach event staying involved? How do they compare to people that joined the project outside of the event?

Our first session will be modeled after the Boston Python Workshops, but the curriculum of the later sessions is still in development and will be influenced by the needs of the participants.

Sign up and Participate!

Participants! If you are interested in learning data science, fill out our registration form here. The deadline to register is Wednesday March 26th.  We will let participants know if we have room for them by Saturday March 29th. Space is limited and will depend on how many mentors we can recruit for the sessions.

Interested in being a mentor? If you already have experience with Python, please consider helping out at the sessions as a mentor. Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required. And we’ll feed you!  Because we want to keep a very high mentor to student ratio, recruiting more mentors means we can accept more participants. If you’re interested,  email makohill@uw.edu. Also, thank you, thank you, thank you!

About the Organizers

The workshops are being coordinated, organized, and led by Benjamin Mako Hill at the University of Washington Department of Communication and Jonathan Morgan at the Wikimedia Foundation. They have been designed with lots of help and inspiration from Shauna Gordon-McKeon and Asheesh Laroia of OpenHatch and lots of inspiration from the Boston Python Workshop.

These workshops are an all-volunteer effort. Fundamentally, we’re doing this because we’re programmers and data scientists that work in online communities and we really believe that the skills you’ll learn in these sessions are important and empowering tools.

The workshops are being supported by the UW Department of Communication and the eSciences Institute.

If you have any questions or concerns, contact Benjamin Mako Hill at makohill@uw.edu.

Dept.Comm_UW_vertical_small_square escience_logo

 Photo from the Boston Python Workshop - a similar workshop run in Boston that has inspired and provided a template for the CDW.
Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.