Citing Software Sources

Note: I am aware that there are guidelines for citing software developed by the FORCE11 research group. However, the focus there seems to be citing software which was used in the production of research. My concern is citing the contents of software repositories.

Citations have traditionally referred to printed texts. While there are many citation styles, they generally communicate the same pieces of information such as author(s), page number(s), and publication date. However, modern information distribution systems format text differently. In particular, source code does not have page numbers, is frequently authored by a large number of people (the list of whom changes over time), and cannot always be uniquely identified by publication date (source repositories can have multiple commits on the same day). Additionally, source code repositories frequently include documentation which has similar characteristics.

The University of Washington library publishes a set of "Research Guides" which includes citation guides. In "Basic of Citing", they list 4 useful outcomes of citations:

  1. Give credit
  2. Add strength and authority to your work
  3. Place your work in a specific context
  4. Leave a trail for other scholars

These outcomes are facilitated by the previously mentioned pieces of common information.

Documentation

Documentation often appears in source form as well as built form. For example, the GCC repository contains a set of .texi files which can be built into an Info manual. It is common for Rust projects to provide a set of markdown files which are built into a website.

It would seem natural to cite the documentation source directly. This is the format that a person would see with a simple clone of the relevant repository, removing barriers to accessing the information (eg, obtaining and running the toolchain used to build the documentation as well as a program which can read the built files). However, these files are primarily mean for a machine audience. The structured markup instructs the toolchain about the appropriate format for human consumption. For example, text which is meant to be formatted as a code block will have markup specifying this. The human could read this markup and understand its meaning, but it is more useful to ensure that the text is monospace and possibly syntax-highlighted. Similarly, it is more useful for hyperlinks to appear as clickable text instead of making the reader process a raw URL. Therefore, both the source and built documentation meet outcome 1 equally well, but the built documentation meets the outcomes 2-4 better.

In addition to the authorial and publication date challenges mentioned previously, neither Info manuals nor websites have specific page numbers which can be used. Websites are organized into pages, but while print pages are of uniform length web pages can be arbitrarily long. Info manuals have distinct pages based on the structure of the text but are not numbered, and again can be of arbitrary length.

The exact version of the text can be specified precisely by providing an appropriate public-facing URL from which it can be cloned and the exact revision (commit hash) that the author referenced. Additionally, the author should specify the format of documentation that they used in their research (Info manual, PDF, Website, etc). However, due to the diversity of documentation styles it does not seem useful to prescribe a specific format for locating a block of text within a piece of documentation. Instead, when a specific block of text needs to be identified the author should use their judgment to determine the best way to accomplish this objective.

For example, a citation to the GCC Info manual might look like this:

GNU Compiler Collection. Info manual.
  https://gcc.gnu.org/git/gcc.git (53fb2cf75965e4dbcf145a12d8ae41f4667a8498).
  Ch. 11 "GENERIC", Sec. 6 "Expressions" Subsec. 3 "Unary and Binary Expressions".

It would have been reasonable to write '11.6.3 "Unary and Binary Expressions"' or even just '11.6.3', because the commit version ensures that the reader can rely on these numbers to locate the correct section. This is similar to relying on the publication date to locate the correct page of a traditional text. However, due to the constantly evolving nature of software it is useful to note the title of each component in the path to the text section so that a reader who has immediate access to a future version of the manual, which might have added a new chapter preceding the current one, can easily identify the relevant section. This also helps the reader check if the information that the text relies on is still accurate for the current version of the software. Therefore, in my judgment this is the best way to identify the relevant section of text.

Program Code

In contrast to documentation, source code is the primary format meant for human consumption. It is expected that source code is viewed with a text editor that displays line numbers, so specifying a line (or range of lines) is sufficient and useful. It might also be useful to note what the range of lines is referring to, for the benefit of readers who have not yet downloaded the repository or are reading in a setting where they do not have access to a full computer (for example, on their phone while commuting). As with documentation, the repository location and commit should be specified.

For example, a citation to the GCC source code might look like this:

GNU Compiler Collection. Program source.
  https://gcc.gnu.org/git/gcc.git (53fb2cf75965e4dbcf145a12d8ae41f4667a8498).
  FOR_EACH_CALL_EXPR_ARG definition. gcc/tree.h lines 6027-6031.

In some cases developers provide releases in the form of compressed sources without a version control system (and, sometimes, with some amount of pre-processing applied). In these cases a hash of the release file should be provided, as this provides more confidence in the release identity than a filename or even a URL, which could both refer to different contents over time.

Program Behavior

It is sometimes necessary to discuss the behavior of a particular program separately from the human-readable source code. It does not make sense to provide citations for these notes because it is a recording of observed behavior not a reference to another's work or observations. However, in this case it is important for the work to be reproducible. Fortunately, software based on the Nix build system provides the ability to precisely identify all software inputs. The author should provide such a specification in addition to complete instructions for reproducing the behavior. This should include a publicly-accessible URL containing complete technical information required for reproduction. If it is useful, they might also provide a printed version of the technical information in appendices (this is expected to increase the length of the paper significantly).

For example, these commands can be used to record a reproducible environment with GNU Guix:

$ guix shell --export-manifest $PACKAGES > manifest.scm
$ guix describe -f channels > channels.scm

Then, a user can get the exact same environment with the time-machine command:

$ guix time-machine -C channels.scm -- shell --pure -m manifest.scm

Download the markdown source and signature.