The numbers offered in the previous section are no more than estimations. They can give us at least orders of magnitude, and allow for comparisons. But they should not be taken as exact data, there are too much sources of error and field for interpretation. In this section, we will discuss some of the more important assumptions made, and the possible sources of error. We will also compare the SLOC counts with the SLOC counts for other system, with the aim of giving the reader some context to interpret the numbers.
Since we rely on David Wheeler's sloccount tool for counting SLOC, we also rely on his definition for "physical source lines of code". Therefore, we could say that we identify a SLOC when sloccount identifies a SLOC. However, sloccount has been carefully programmed to honor the usual definition for physical SLOC: " A physical source line of code is a line ending in a newline or end-of-file marker, and which contains at least one non-whitespace non-comment character.';" .
There is other similar measure, the "logical" SLOC, which sometimes is preferred. For instance, a line written in ANSI C with two semicolons would be counted as two logical SLOC, while it would be counted as one physical SLOC. However, for the purposes of this paper (and almost for any purpose), the differences between both definitions of SLOC are negligible, specially when compared to other sources of error and interpretation.
The counts of lines of code presented in this paper are no more than estimations. By no mean we imply that they are exact, specially when they refer to aggregates of packages. There are several factors which cause this inaccuracy of the numbers, some due to the tools used to count, some others due to the selection of packages:
Some files may have not being counted accurately.
Although sloccount includes carefully designed heuristics to detect source files, and to distinguish source lines from comments, those heuristics do not always work as expected. In addition, in many cases it is difficult to distinguish automatically generated files (which should not be counted), although sloccount makes also a good effort to recognize them.
Not all programming languages are recognized.
To fetch the data we used release 1.9 of sloccount, which recognizes about 20 different languages. However, some languages present in Debian (as is the case of Modula-3 or Erlang) are not currently supported. This obviously leads to some underestimation in the packages with files written in those languages.
Different perceptions in the aggregation of package families and the selection of a representative.
As we comment in the subsection where we discuss the selection of the list of packages to count, the reasons to take a given package in or out of the list are not unquestionable. Should we count different releases of the same package? Should we count only once code present in several packages, or not? The usual criteria for measuring SLOC is "delivered source lines of code". From this point of view, all packages should be considered as they appear in the Debian release. However, this is difficult to apply when some packages are clearly evolutions of other packages. Instead of considering all of them as "delivered", it seems more productive to consider the older ones as "beta releases". However, in the libre software world it is rather common to deliver stable releases every 6 or 12 months. Those stable releases have a lot of work behind them, only to ensure stability, even if they are also the foundation for later releases.
In most cases, we have adopted an intermediate decision: to count only once families of packages which are a line of evolution (as is the case of emacs19 and emacs20, but to count separately families of packages which happen to share some code but are in themselves different developments (as is the case of gcc and gnat).
Current estimation models, and specifically COCOMO, only consider classical, proprietary development models. But libre software development models are rather different, and therefore those models are not directly applicable. That way, we can only estimate the cost of the system, had it been developed using classical development models, but not the actual cost (in effort or in money) of the development of the software included in Debian 2.2.
Some of the differences that make it impossible to use those estimation models are:
Continuous release process (frequent releases). The COCOMO model is based around the concept of "delivered SLOC", which implies one point in the history of the project where the product is released. From there on, the main development effort is devoted to maintenance. On the contrary, most libre software projects deliver releases so frequently that it could be considered as a continuous release process. This process implies the almost continuous stabilization of the code, at the same time that it evolves. Free software projects are used to improve and modify their software at the same time that they prepare it for end users.
Bug reports and fixes. While every proprietary software system needs expensive debugging cycles, libre software can count on the help of people external to the project, in the form of valuable bug reports, and even fixes for them.
Reuse, evolution, and inter-fertilization of code. It is common in libre software projects the reuse of code of other libre software projects as an integral part of the system being developed. It is also common that several projects develop evolutions of the same base system, in many cases with all of them using code of the others all the time. Some of this cases can also happen in proprietary developments, but even in large companies, with many open projects, they are not common, while they are the norm in libre software projects.
Distributed development model. Although some proprietary systems are developed by geographically distributed teams, the degree of distributed development found in libre software projects is several orders of magnitude greater. There are exceptions, but usually libre software projects are carried out by people from different countries, not working for the same company, devoting different amount of effort to the project, interacting mainly through the Internet, and in most cases (specially in large projects), the developer team have never been physically together.
Some of these factors increase the effort needed to build the software, while some others decrease it. Without analyzing in detail the impact of these (and other) factors, the estimation models in general, and COCOMO in particular, are not directly applicable to libre software development.
To put the numbers shown above into context, here we offer estimations for the size of some operating systems, and a more detailed comparison with the estimations for the Red Hat Linux distribution.
As reported in "From NT OS/2 to Windows 2000 and Beyond. A Software-Engineering Odyssey" (for Windows 2000), "More Than a Gigabuck: Estimating GNU/Linux's Size" (for Red Hat Linux), and "Software Complexity and Security" (for the rest of the systems), this is the estimated size for several operating systems, in lines of code (all numbers are just approximations):
Microsoft Windows 3.1: 3,000,000
Sun Solaris: 7,500,000
Microsoft Windows 95: 15,000,000
Red Hat Linux 6.2: 17,000,000
Microsoft Windows NT 5.0: 20,000,000
Microsoft Windows 2000: 29,000,000
Red Hat Linux 7.1: 30,000,000
Debian 2.2: 56,000,000
Most of this estimations (in fact, all of them, except for Red Hat Linux) are not detailed, and is difficult to know what they consider as a line of code. However, the estimations should be close enough to SLOC counts to be suitable for comparison.
Note also that, while both Red Hat and Debian include many applications, in a lot of cases even several applications in the same category, both Microsoft and Sun operating systems are much more limited in this way. If the more usual applications used in those environments were counted together, their size would be much larger. However, it is also true that all those applications are not developed neither put together by the same team of developers, as is the case in Linux-based distributions.
From these numbers, it can be seen that Linux-based distributions in general, and Debian 2.2 in particular, are some of the largest pieces of software ever put together by a group of developers.
The only operating system for which we have found detailed counts of source lines is Red Hat Linux (see "Estimating Linux's Size" and "More Than a Gigabuck: Estimating GNU/Linux's Size;"). Since it is also a Linux-based distribution, and the software packages included in Debian and Red Hat distributions are rather similar, the comparison with it can be illustrative. In addition, since Red Hat Linux very common, and probably the better known Linux-based distribution, comparing with it can provide a good context for the reader already familiar with it.
The first data that surprised us when we counted Debian 2.2 was its size compared to Red Hat 6.2 (released in March 2000) and Red Hat 7.1 (released in April 2001). Debian 2.2 was released in August 2000, and is roughly twice the size of Red Hat 7.1 (released about eight months later) and more than three times the size of Red Hat 6.2 (released five months earlier). Some of these differences could be due to different considerations of which packages to include when counting, but they provide a good idea of the relative sizes, even considering these considerations.
The main factor causing these differences is the number of packages included in each distribution: in the case of Debian we have considered 2630 source packages (with a mean of about 21,300 SLOC per package), while Red Hat 7.1 includes only 612 packages (about 49,000 SLOC per package).
When comparing the largest packages in both distributions, we can find in Debian all those included in Red Hat. The same is not true the other way around: several packages that amount a good quantity of SLOC to Debian are not present in Red Hat. For instance, among the 12 largest packages in Debian 2.2, the following are missing from Red Hat 7.1: PM3 (about 1,115,000 SLOC), OSKit (about 859,000 SLOC), Stalin (805,000), GNAT (688,000), NCBI (591,000). On the contrary, among the 12 largest packages in Red Hat 7.1, none is missing in Debian 2.2.
However, there is a large collection of software packages which is missing in Debian 2.2 and not in Red Hat 7.1: the KDE desktop environment and related utilities. Due to license problems, Debian decided not to include KDE software until after Debian 2.2, when the license for Qt changed to GPL. Therefore, we can say that Debian 2.2 is larger, even missing such a large piece of code as KDE. Just to give an idea, the largest KDE packages in Red Hat 7.1 are kdebase, kdelibs, koffice, and kdemultimedia, which amount for about 1,000,000 SLOC. All of them are missing from Debian. This suggest that should the measures had been made on the current Debian archive (still not officially delivered), the differences would have been greater.
The differences between the same package in each distribution are accountable to the different releases included in them. For instance, the Linux kernel amounts for 1,780,000 SLOC (release 2.2.19) in Debian 2.2, while the same package it amounts for 2,437,000 SLOC (release 2.4.2) in Red Hat 7.1, or XFree includes 1,270,000 SLOC (release 3.3.6) in Debian 2.2, while the release included in Red Hat 7.1 amounts for 1,838,000 (XFree 4.0.3). This differences in releases make it difficult to directly compare the figures for Red Hat and Debian.
The reader should also note that there is a methodological difference between the study on Red Hat and ours on Debian. The former extracts all the source code, and uses MD5 checksums to avoid duplicates across the whole distribution source code. In the case of Debian, we have extracted the packages one by one, only checking for duplicates within packages. However, the total count should not be very affected for this difference.