The Introduction of Documentation in FLOSS Projects

Community decay and abandonment are persistent risks to free/libre and open source software (FLOSS) projects. As such, large institutions such as GitHub or Mozilla offer advice to FLOSS projects on how to organize their work for sustainability and community-building. Guides recommend the production of README files and CONTRIBUTING guides as useful tools in recruiting new project contributors and driving activity. Yet though the development of these documents is widely-suggested, there is little empirical study of how projects use these files and what happens when documents are introduced to projects.

This is a plot of the moving average of weekly commit counts to the focal FLOSS project in the weeks surrounding the publication of README files or CONTRIBUTING guides. Both moving averages show a steep increase in commit activity in the weeks preceding the documents' publication, before sharp decreases in the weeks immediately following the publication. Across the 10 weeks included in the plot (5 weeks before/after document publication) projects publishing README files had fewer weekly contributions than those publishing CONTRIBUTING guides.
Plot of average (log-transformed) weekly contribution counts over time around the point of document introduction (weeks offset from document publication date) for README (red) and CONTRIBUTING (blue) files. The Y-axis has been scaled to real count values.

In one of the first empirical studies of the initial publication of documentation files, our findings suggest a disconnect between institutional recommendations and FLOSS projects’ actual use the documents. Instead of being proactively developed and community-oriented, first-version files are published following an increase of activity and focus on the functional details of using or contributing to the library.  Often, documents are published with hardly any content at all, with projects publishing empty or minimal files. We found no support for any causal claims around the nature of a document’s depth or focus and subsequent project activity. 

Our results suggest that projects may use these documents to perform a norm. The publication of empty documentation files implies that an empty file in their home directory was more important to projects than any benefits of document contents. Our results also suggest that projects may use these documents to ‘get their house in order’ after an influx of activity.

The guides and recommendations that we examined did not specify when projects should take what actions to grow sustainably. This lack of specificity limits the utility for projects trying to figure out how to sustain themselves in ever-changing environments. The work necessary to develop meticulous, community-oriented files may not be a good time investment for early-stage projects with only a handful of contributors. More research is necessary to develop useful context-situated recommendations to support FLOSS projects adaptation. 

This paper was presented a few weeks ago in Ottawa at the International Conference on Cooperative and Human Aspects of Software Engineering (CHASE) 2025. A pre-print of the paper can be found here; the data and code for the project can be found here.

This research wouldn’t be possible without the work of the volunteers producing FLOSS who have made their work available for inspection. We also gratefully acknowledge support from the Ford/Sloan Digital Infrastructure Initiative (Sloan Award 2018-113560) and the National Science Foundation (Grant IIS-2045055). This work was conducted using the Hyak supercomputer at the University of Washington as well as research computing resources at Northwestern University.

FLOSS project risk and community formality

What structure and rules are best for communities producing high-quality free/libre and open source software (FLOSS)? The stakes are high: cybersecurity researchers are raising the alarm about cybersecurity risk due to undermaintained components in the global software supply chain—much of which is FLOSS. In work that’s just been accepted to the IEEE International Conference on Software Analysis, Evolution and Reengineering (‘SANER’), we studied 182 Python-language packages in the GNU/Linux Debian distribution, examining the relationship between their levels of engineering formality and software risk. We found that more formal developer organization is associated with higher levels of software risk, and more widely spread developer responsibility is associated with lower levels of software risk.

We studied software risk through the underproduction metric initially developed by Champion and Hill (2021). Underproduction is a measurement of misalignment between the usage demands of a software project and the contributions of the project’s developer community. As such, underproduction measures the risk that software will be undermaintained, possibly including a security bug.

Our work examines the relationship between risk due to underproduction and governance formality. We employed measures initially developed by Tamburri et al. (2013) and later re-implemented in Tamburri et al. (2019). These metrics use multiple measures of software project formality — such as the average contributor type, usage of GitHub milestones, and age — to evaluate how formally structured a given project is.

Plot of the relationship between mean underproduction factor and mean membership type (MMT), a metric encapsulating the diffusion of merge responsibility across a project’s developer community.



We used linear regression to conclude that more formal project structures are associated with higher levels of underproduction and thus, increased project risk. We also found that the share of community-members who have merged code into the main development branch is also related to underproduction, with lower levels of underproduction correlated with larger shares of community mergers.

Evaluated together, these two conclusions suggest that operating less formally and sharing power more equally is associated with lower underproduction risk. The development of FLOSS project engineering is a process laden with tradeoffs, we hope that our conclusions can help better inform community decision making and organization.

For more details, visualizations, statistics, and more, we hope you’ll take a look at our paper. If you are attending SANER in March 2024, we hope you’ll talk to us in Rovaniemi, Finland!

—————

The full citation for the paper is:

Gaughan, Matthew, Champion, Kaylea, and & Hwang, Sohyeon. (2024) “Engineering Formality and Software Risk in Debian Python Packages.” In 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER2024) (Short Paper and Posters Track). Rovaniemi, Finland.

We have also released replication materials for the paper, including all the data and code used to conduct the analyses.

This blog post and the paper it describes are collaborative work by Matt Gaughan, Kaylea Champion, and Sohyeon Hwang.