What the Web’s most popular sites are running on
With its Web 2.0 focus it is easily one of the most popular blogs out there. Linked to by over 15,400 other blogs according to Technorati makes it the 5th most popular blog on the Web. Technorati also has 151,000 feed subscribers according to Feedburner.
FeedBurner provides RSS feed management for bloggers and other online news sources and is currently offering more than 566,000 feeds from more than 334,000 publishers. FeedBurner delivers 310 million feed views every day.
An online, royalty-free stock photo provider. iStockPhoto is arguably the most visited stock photo provider in the world.
A service allowing its users to move and share large files with others over the Web instead of sending them over email. The service has more than 3 million registered users and transfers more than 1 million files daily, accounting for over 30 terabytes (30,000 gigabytes) of data.
Meebo is an in-browser messaging application supporting a variety of messaging applications such as Yahoo! Messenger, Windows Live Messenger, ICQ, AIM and Jabber. Meebo users exchange more than 70 million messages every day.
Though it doesn’t have the gargantuan user base of YouTube, Vimeo is one of the most popular video sharing sites out there and has won over many users with its elegant interface.
The tagline says it all: “Website traffic comparisons on steroids.” Alexaholic uses data from Alexa to display web traffic trends.
PDF with survey results
TechCrunch, FeedBurner, iStockPhoto, YouSendIt, Meebo, Vimeo and Alexaholic. These are some of the most popular websites on the Internet. You have heard about them, you have read about them and you have most likely used or visited at least one of them. But how often have you read about what these websites are actually running on? This article dives into the facts and figures about the underlying hardware and software that keep these sites running smoothly in spite of their massive popularity.
Pingdom performed a survey of these seven “super sites” that focused on web, database and file server numbers and setup, operating systems, bandwidth usage, network hardware and other technical questions relevant to maintaining a site. For those who are interested in the nitty-gritty details there is a PDF matrix with the survey results attached to this article.
The variety of websites in the survey gives a good cross section of different kinds of setups. They all represent the crème de la crème in their respective categories, including blogging and blogging tools, stock photo libraries, file sharing, instant messaging, video sharing, and web statistics.
Though statistically these seven sites only constitute a small drop in an ocean of websites (there are more than 100 million domain names on the Internet) in many aspects they show a surprising consistency when it comes to their choices of underlying technology, usually with a strong bias towards open source. These are some of the common trends we found during the survey.
Penguin the most popular server animal
Linux rules the game with these sites. All except one use Linux exclusively, with Alexaholic being the standout since it’s hosted on Windows. Not a single site uses the otherwise so popular FreeBSD operating system.
“Linux was selected for multiple reasons,” says Jan Mahler, network operations manager at YouSendIt. “It has a proven track record in scaling, open source code to allow for altering code as necessary, price, excellent support if necessary and ease of finding talent to support and maintain it.”
Similar sentiments are shared by all of the companies in the survey that use Linux.
“Initially, the fact that the software stack was free (as in beer) had a major influence on our decision,” says Brent Nelson, senior systems administrator at iStockPhoto. “But moving forward, standardness and supportability started becoming major factors. Using the big-name Linux distributions gives us support with big-name hardware vendors, and vice versa. Commercial solutions for backup and site acceleration are supported under Linux on x86.”
Apache serves the most pages
With Linux hosting comes the common use of the Apache web server. It’s by far the most deployed web server on the Internet with a 58.7% market share (Netcraft), so it’s only natural that it would also be used by a majority of the sites in the survey.
However, even though it’s the behemoth in the web server market, Apache is slowly losing ground to competing platforms such as Microsoft’s IIS and up-and-comers like Lighttpd, at least according to data from companies such as Port 80 and Netcraft.
MySQL dominates the databases
With open source ruling the game it shouldn’t come as a surprise that the database of choice for all but one of the sites is MySQL, the ultra-popular Swedish open-source database.
“The features that you get for free on MySQL, with replication, in-memory and fault-tolerant databases (if using MySQL cluster), transaction support, and the wicked performance, cost thousands of dollars with other database engines,” says Joseph Kottke, director of network operations at FeedBurner.
These sentiments are echoed by the other participants in the survey as well.
“We needed something proven, flexible and low-cost,” says Simon Yeo, director of operations at Meebo. (His alternative title is “ops guy.” Other laid-back titles at Meebo include “marketing dude,” “server chick,” and “Mr. Sparkle.”)
These sites are far from alone in favoring MySQL. According to the MySQL website it’s the fastest-growing database in the industry, with more than 10 million active installations and 50,000 daily downloads.
PHP rules server-side scripting
Just like Apache is the most common web server software, PHP rakes in another “win” for open source when it comes to server-side scripting languages. PHP has been the most popular server-side scripting language for years and will probably remain so for some time, despite the hype around Ruby on Rails and other frameworks and scripting languages that are growing in popularity.
As of November 2006, there were more than 19 million websites (domain names) using PHP.
Clustering for reliability and performance
Clustering servers can improve both availability, performance and help with load balancing. Five of the seven sites use clustering for their web servers, and four of them use it for their database servers.
It should be noted that since TechCrunch only uses one web server and one database server, it can’t, and doesn’t need to, do clustering. In other words you could say that five out of six sites with multiple web and database servers in this survey use clustering.
Going against the grain
Just like the sites have some things in common there are those that stand out from this (admittedly small) crowd with slightly more unconventional software choices.
Meebo with Lighttpd
Even though it runs on Linux servers, Meebo has avoided the use of Apache in favor of Lighttpd (pronounced “lighty”), a smaller and more lightweight web server with better performance than Apache.
“Lighttpd tends to work really well with AJAX-based sites like ours,” says Simon Yeo.
Incidentally, Lighttpd is also used by giants such as YouTube and Wikipedia, and is also very popular with the Ruby on Rails community.
Lighttpd may only have a small percentage of the web server market right now, but it is growing extremely fast and will be a serious challenger down the line. Netcraft data from February shows a jump from 170,000 websites to 700,000 websites in just a month. That’s a 400% increase in a very short time frame.
Alexaholic with Windows, IIS and MS SQL Server
Alexaholic is the only website in the survey that doesn’t run on Linux servers. It uses ASP.NET 2.0 on Windows with Internet Information Server (IIS) together with MS SQL Server. Ron Hornbaker, the man behind Alexaholic, cites his familiarity with the .NET platform as the main reason for his choice of platform.
“I’m most comfortable coding with C#.NET, and this was a personal project,” he says.
Ron Hornbaker built the first version of Alexaholic in just one (admittedly intense) weekend, which can definitely be seen as proof that the ASP.NET environment can be very productive.
As noted earlier in this article, IIS is actually gaining ground on Apache according to several sources, so even though Alexaholic is the only site in this survey to use Windows and IIS, it has plenty of company, with 31.1% of the Internet’s websites hosted on IIS (Netcraft).
Server-side Java with Apache Tomcat
Both TechCrunch and FeedBurner run Apache Tomcat for Java servlet support. Even then, the ever-popular PHP is still used as well. (TechCrunch uses the WordPress open-source blog software which uses PHP and MySQL.)
In addition to this, FeedBurner also uses Perl, the previous champion of CGI scripts.
Different needs = different setups
Blogs deliver mostly static content and even very popular ones can run very well with just a couple of servers, like the case is with TechCrunch. Alexaholic, in spite of delivering massive amounts of statistics, can get away with two web servers and two database servers since it can pull a lot of its data directly from Alexa. Add more dynamic content, or streaming content, and the game changes.
In addition to their web and database servers, YouSendIt has 170 file servers split between the U.S. east and west coast just to deliver files. Vimeo has 100 content delivery servers for the sole purpose of streaming video. Meebo has more than 40 web servers to handle their AJAX-based messaging application, and FeedBurner uses 70 web servers and 15 database servers to, pardon the pun, feed its feeds. It even has a replicated second site with just as many servers.
The greatest technical challenge
“The greatest challenge was finding the most efficient ways to locate hotspots and bottlenecks in the application,” says Joseph Kottke (FeedBurner.) “Once we came up with a loose methodology for locating problems, the analysis became very easy. Detailed monitoring was crucial in this, keeping track of disk, CPU and memory usage, slow database queries, handler details in MySQL, etc.”
There seems to be a general agreement that it’s a difficult challenge to scale a website gracefully as the number of visitors grows. It’s also extremely important if you don’t want to lose your momentum. The Web is fickle, and if your service doesn’t perform and deliver the goods, users will soon go somewhere else. For FeedBurner, and any site for that matter, it’s extremely important to know where your bottlenecks are, a sentiment echoed by Brent Nelson (iStockPhoto) and the other participants in the survey:
“Pretty much every aspect of the site has been a bottleneck at some time,” he says. “Database servers, PHP sessions, web server load due to PHP execution, network, storage systems have all caused performance issues in the past. No single approach would be appropriate for every decision – and previous decisions need to be re-evaluated as you scale to the next level.”
There is another aspect of scaling that also comes into play, and that is functionality. How much functionality can you add before it starts to confuse your users?
“It’s a challenge to balance simplicity with functionality,” says Ron Hornbaker (Alexaholic). “I’m always tempted to throw more features in, but sometimes less is more.”
The big money sinks
The general consensus is that bandwidth costs account for a large share of the operating expenses. Several of the sites stated that bandwidth is the single most expensive aspect of running their site, but the more hardware you have, server costs and power consumption also become significant expenses, as well as co-location hosting.
Joseph Kottke from FeedBurner sums it up as he comments on FeedBurner’s main operating expenses: “Hosting costs: Cages and cabinets, and power. Sweet, precious power.”
The power costs are a real problem for anyone with a lot of hardware. Google, operating a server park with more than 200,000 servers, has long been lobbying for server hardware with more efficient power consumption.
As could have been guessed, LAMP (Linux, Apache, MySQL and PHP) is by far the most common setup of the surveyed websites. The dominance is far from total, though, and the elements of LAMP are all challenged by alternatives that are growing in popularity.
It’s worth noting that even among the alternatives, most are open source. Open source rhymes well with the Internet’s focus on standardization, more so now than ever before.
It’s also interesting that most of the companies motivated their choice of technology partly with the word “familiarity.” They used the technology they were most comfortable with, all the way from the one-man project Alexaholic to the larger sites in the survey.
“iStockphoto grew out of a web development and hosting company,” says Brent Nelson. “We used PHP, MySQL and Linux to build client sites, so it was a logical choice to use it to build our own.”
Since it’s human nature to stick with what we know and what we feel comfortable with (which is often a smart choice, productivity wise), any new technology needs to be significantly better and a significantly more comfortable alternative than existing market leaders before a switch will be made by the majority of people.
Pingdom intends to do a similar survey every year to follow the technical trends and provide insights into the workings of massively popular websites such as these.
We would like to thank all of those who participated in the survey for being so helpful and generous in providing information and insights about the operation of their websites.
Learn and be inspired by these companies and individuals. If you have a blog with the hopes of becoming big, you may not need that much in terms of hardware. But if you’re building a web app that’s going to handle a million file transfers daily, take a look at YouSendIt and prepare for some serious investments.
How the survey was made
The seven participants all responded to a set of 28 survey questions (all responses available in the PDF matrix) plus a number of follow-up questions about their website infrastructure where they could further explain their choices.