3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
- ISO Online browsing platform: available at https://www.iso.org/obp
- IEC Electropedia: available at http://www.electropedia.org/
3.1 branch
In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the "main" or "master" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch.
3.2 git
Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server.
3.3 hierarchical file system
A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically. It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.
3.4 intrinsic identifier
An identifier that can be computed directly from the object that it identifies, without needing access to a registry. Typical examples are cryptographically strong hashes.
3.5 repository
In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, and documentation. It often includes metadata about the stored items, such as version number, author, and date of the last modification. Repositories can be local or remote and are managed by version control systems like Git.
3.6 SHA1
SHA-1 (short for "Secure Hash Algorithm 1", also stylized as "SHA1") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value. The returned value is called SHA1 checksum, or simply SHA1 when there is no risk of ambiguity between the function and the returned value. A detailed description of how to compute SHA1 is available in RFC-3174.
In the wake of the Shattered attack of 2017 (see paper: Stevens2017Shattered), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums. It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: Stevens2013Counter).
As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums.
For the purpose of this specification document, the SHA1 function is therefore considered to be a partial function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in Stevens2013Counter and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (version stable-v1.0.3
, corresponding to Git commit ID 38096fc021ac5b8f8207c7e926f11feb6b5eb17c
).
When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it.
Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making "trivial" collision-prone files still perfectly valid and hence referenceable using SWHIDs.
3.7 version control system
A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and the branching and merging of code. Examples include Git, Mercurial, and Subversion.
3.8 software artifact
A software artifact, also referred to as a digital artifact and a software object, represents a distinct entity identifiable by a SWHID. This entity can be as granular as a single line of code within a source file or as expansive as an entire codebase comprising multiple source files. In addition to source files, a software object can also be a binary file resulting from code compilation or multiple binary files linked together to produce an executable file.
3.9 metadata
Within the context of this specification, metadata refers to supplementary information associated with a software object. It serves to provide a deeper understanding of the object by detailing attributes such as the programming language used, its functionality, or its dependencies. Metadata can also enumerate the individuals involved in the software's development, elucidate its licensing terms, offer a record of version history, and more. Essentially, metadata encapsulates the broader context, provenance, and attributes of the software object, ensuring a comprehensive understanding of its nature and purpose.
3.10 UNIX epoch
The UNIX epoch is a time reference point that denotes the precise moment at 00:00:00 Coordinated Universal Time (UTC) on 1 January 1970. In UNIX-based systems, time is often represented as the total number of seconds that have transpired since this specific moment. This convention is widely used in computing for time-stamping and date-time representations.