Examining Web Server Usage, Market Share, and Related Business Relevance
By Port80 Software
A November 2003 survey published by the UK-based Internet services company Netcraft made the claim that the Apache Web server "has a significant percentage gain" over its chief rival, Microsoft's Internet Information Services (IIS), and now controls over two-thirds of the global Web server market. Only days later, Port80 Software released a survey stating that "Microsoft IIS maintains dominance of the corporate Web server market" with 53.8 percent of the market. With two seemingly similar surveys drawing contradictory conclusions, clearly the question of whose software powers the majority of the Web server market demands a deeper examination.
Since its inception in 1995, Netcraft's Web server survey has received widespread attention in the technology industry press to the point that their surveys are accepted by many as the standard picture of Web server market share. However, as part of its November 2003 Web server survey, Port80 Software has raised serious doubts as to the business relevance of the Netcraft survey. A close examination of Netcraft's methodology reveals systemic biases that may inflate Apache's market share while under-representing Microsoft IIS. More importantly, understanding the specific methodologies of both the Netcraft and Port80 surveys reveal that they are, in fact, asking very different questions.
Netcraft Casts a Wide Net (With a Few Significant Holes)
What's in a Name?
The concepts host name and domain name are often used interchangeably to mean the human readable name that is mapped to an IP address. Some authorities argue that this is a mistake, because there is a clear and hierarchical distinction between the two types of name, with one of them being contained within the other. Unfortunately, these same authorities tend to differ about which name contains which.
Some say that hostname is the more general term, being a combination of the machine name (e.g., "www") plus the domain name (e.g., "port80software.com"). Others insist that domain name is the more general term, being a combination of the host name ("www") and the domain ("port80software.com").
This ambiguity is probably rooted in the way the domain name service (DNS) evolved into the scheme described in RFC 1034. All the computers on ARPAnet (the forerunner of the Internet) were once identified by a lookup table of host names and their associated IP addresses stored in a single file, called HOSTS.TXT, that lived on a single machine in the Network Information Center (NIC) at Stanford Research Institute (SRI). As the ARAPnet grew into a network of networks, this way of keeping track of name/IP mappings became untenable, so a bunch of smart people hashed out the ideas that eventually became DNS: a hierarchical name space, separated by dots, in which names only have to be unique to their parent name, with the whole thing stored in a distributed database.
The old RFCs that guided the changeover from HOSTS.TXT to the new DNS scheme talk about transitioning from "old style host names" to "domain style host names." In essence, the old host names were folded into the new domain names, which thereby became the new host names. You can see why even experts differ about what to call these things!
Fortunately, none of this ambiguity about host names and domain names affects the fundamental issues involved in assessing the relevance of any Web server survey. The simple point to keep in mind is that a given host (or domain) name does not necessarily represent a single physical server on the Internet.
On the Web, in particular, it is far more accurate to think of hostnames or domain names as representing a single Web site. For example, a host name might point to a server farm where many physical servers are handling the requests for a single Web site. Alternatively, thanks to shared or virtual hosting, many hundreds or even thousands of active sites, each with its own host name, might be served by the same computer.
In short, whether a study uses "hostname" or "domain name" to indicate a Web site matters little, so long as it doesn't assume either one is necessarily the name of a physical server.
In compiling its most recent survey, Netcraft reports responses from 44,946,965 sites, "[collected and collated] from as many hostnames as can be found providing an HTTP service." Exactly 30,298,060 are running Apache, while 9,449,180 are running IIS. Precisely what does this mean? It goes almost without saying that no one has a definitive picture of what the Web looks like -- how many sites there are, where they are located, who is running them, and who is visiting them. Even compiling a reliable list of the most popular sites on the Web is the province of market research firms commanding high prices for their educated best guesses. No company has the resources or the means to hand-count every site on the Web, checking for mirrored content, dead links, private sites, or numerous other variables -- and any site administrator who readily reveals to any visitor exactly how his/her site works has serious security issues!
Netcraft's solution to making sense of this muddy picture is to cast as wide a net as possible and then sift out what they believe to be live sites with unique content. To do this, their survey does not count Web servers per se, but sites, or more precisely, hostnames (see sidebar). This is a sensible step, since a single server may run numerous sites or a single site may be hosted on multiple servers. For example, a visitor to Yahoo! may be connected to any number of different servers depending on his/her location, traffic loads, and which servers are active at that moment. But, if that visitor then navigates to a friend's GeoCities homepage, he/she is likely directed to a single server hosting hundreds of other such sites.
Netcraft's focus on sampling as many sites as possible introduces a systematic overrepresentation of servers that are used to host high numbers of sites, including parked domains (hostnames pointed to a physical server but not actively used as Web sites) and sites of extremely variable quality and traffic. As an extreme example of this sampling bias, the single largest gain for Apache and loss for IIS in the November Netcraft survey came from the migration of Register.com, a domain parking service that accounts for over 1.4 million Web server responses alone in their survey.
Answers and Qualifications in Netcraft's Methodology?
Netcraft does acknowledge the problem of templated, mirrored, and parked domains in their methodology, however one must dig deeply into their site to the July 2000 survey to discover their approach to this problem. Here they address the radical changes in the Web in the first five years of their survey: "[W]hereas in the early days of the Web, hostnames were a good indication of actively managed content providing information and services to the internet community, the situation now is considerably more blurred." Conspicuous in its absence is any discussion of changes over the past three years in either the Web or Netcraft's survey methodology.
Netcraft utilizes a logarithmic formula to "correct" for parked domains in their survey. For example, Netcraft reported that futuresite.register.com's 1,414,626 sites were whittled down with their formula to only 515 "active" sites in the July 2000 report. Similarly, the 44.9 million sites found in the November 2003 survey are reduced to less than 20 million "active" sites. The obvious problem is that, even in Netcraft's "corrected" numbers, Register.com's choice of Apache is still being counted 515 times as opposed to Disney's choice of IIS being counted only once. Less obvious, but also worth considering, are the questions of which companies or organizations are choosing Apache, IIS, or other server platforms and why. By their own admission, Netcraft's method of reducing the number of parked domains leaves in place a sampling bias in favor of "the cheap or free bulk hosters." The fact that a bulk hoster chose to revert to Apache to run 1.4 million domains may have more to do with its lower up-front cost than with its performance, security, or features.
Headlines Have Been Promoting the Wrong Numbers
Port80's analysis of the Netcraft survey reveals other problems. First of all, Netcraft's November 2003 survey headline uses the "uncorrected" figure of 44.9 million sites, numbers that their own methodology acknowledges as inaccurate. They go on to discuss the recent migrations of Register.com, Network Solutions, and several other domain parking services and bulk hosting providers as signaling a "significant" change in Web server market share: A 2.8 percent increase in Web sites running Apache is set against a 2.44 percent drop for Microsoft IIS. Scrolling past the splashy lead to look at the "corrected" November results reveals a much more modest gain of 1.25 percent for Apache and loss of only 1.06 percent for IIS. Of course, given the extremely rapid changes in Web technology and business, one must also be wary of calling one-month variations of a couple percentage points "significant." Port80's monthly surveys have shown contrary fluctuations in market share -- according to the Port80 Web server survey, Apache's market share has actually decreased by 2.2 percent since January 2003, while IIS has surrendered a paltry 0.1 percent over the same period.
Returning briefly to Netcraft's choice of data, the company does possess a means to estimate numbers of physical servers at a given ISP as opposed to hostnames, but they choose not to apply this technique to the Web server survey released to the public.
Netcraft also conducts a survey of secure Web sites, which might be expected to provide market share data more relevant to e-business decision makers, but they do not generally highlight the results of that survey or provide them without a cost. The last publicly available, free results from the Netcraft SSL survey are from January 2001. These results showed IIS in the lead with 47.4 percent to Apache's 28.1 percent in a sample of 121,542 secure sites.
Unfortunately, these qualifications and other data supporting alternate conclusions are not a focus for Netcraft and are not the metrics that make headlines in the technology community. Netcraft's survey may be an accurate representation of "Web server software usage on Internet connected computers," as their methodology states, but such statistics are not synonymous with Web server market share.
Port80's Alternate Methodology Has More Relevance -- and Significance
In contrast, Port80 Software's Top 1000 Corporations Web Server Survey focuses exclusively on the corporate sites of Fortune 1000 companies. Technically, the two surveys are very similar -- each uses a header check to determine which software is being used to serve a particular hostname, and are therefore surveys of domains rather than of actual servers. What distinguishes the two is their choice of data sample. While Netcraft's wide net captures far too many hostnames to be individually verified, and therefore includes a great deal of undesirable data points, Port80's small net captures exactly 1000 live sites with unique content. A metaphor would be two fishing boats: one using an acres-wide drift net, the other fishing with individual lines. While the first boat goes home at the end of the day overflowing with tuna of all sizes, shrimp, dolphins, and garbage to be sorted out later, the second boat ends the day with nothing but prime adult tuna -- if anything else bites, the fisherman can easily determine "I don't want this" and throw it back.
Beyond the question of scale, the two surveys also differ dramatically in the quality of their data sample. As opposed to asking which Web server software is most common across the whole of the Internet, the Port80 survey attempts to determine the Web server of choice among large corporations with high-volume sites and demanding business requirements. Each Fortune 1000 Web site administrator needs a server that will effectively manage high volumes of traffic and each business decision maker must carefully invest large amounts of money in Web technology. To return to the fishing metaphor, a researcher interested in maintaining healthy tuna stocks is probably more interested in what is sustaining the big fish than the scrawny ones.
With this more focused approach, Port80's monthly surveys represent a clearer picture of what technology large businesses are choosing to run their Web sites. Among Port80's Fortune 1000 sample base, Microsoft IIS can accurately claim a much more impressive market share of 53.8 percent, including such heavily trafficked industry leaders as BankOne, Walt Disney and the Gap, compared to 15.4 percent for Apache. Port80's survey also reports that many Fortune 1000 companies running IIS are upgrading from Windows NT to Windows 2000 and Windows Server 2003, including Intel, Martin Marrieta, and Goody's Online, reflecting a long-term commitment to the Microsoft platform. Port80 Software's Top 1000 Web Servers Survey demonstrates that Apache's market share is likely much smaller in dedicated hosting and corporate environments.
Clearly, there are numerous ways of slicing up the Web, and therefore defining server market share. While Netcraft's survey of hostnames is touted and accepted in the industry press as representative of the actual Web server market, one should be cautious of using Netcraft data to make claims of this kind. Instead of truly examining market share, Netcraft's Web server survey is ultimately limited to the question "which server software is most common?" and thus their conclusions are of questionable relevance to business decision makers concerned with which Web server technology to deploy for e-commerce or Web-based applications. In contrast, Port80's Fortune 1000 Web server survey seeks to answer a far more pressing question: "Which Web server software should I trust with my business?"