Crawling results

Recently I made a web crawler that leaves a YouTube link to a rickroll video as the referrer to rick roll webmasters that watches the server logs like hawks. I also used the opportunity to collect some data about the websites I visited.

The crawler itself is very basic, it visits a site and crawls that site, and adds any external links to a queue of new sites to visit. When an external link is found it is stripped before it is put on the queue, so if it sees the link "" it strips it and only is added to the queue. The crawler also has a depth limit, so it will only visit 500 subpages of a domain, to stop it from spending too much time on one domain.

Whenever the crawler visited a new site I stored the response in the server header, and timed how long it took to download the page, it does not download any images, style sheets or scripts, just whatever is returned when that URL is requested.


In about a weeks time the crawler visited over 11 thousand different domains, that is over 5 million subpages visited if the crawler visited 500 subpages for each domain. I doubt that that many of the domains had that many subpages, but I did not record how many pages that was visited, so it is only a estimation of how many subpages there could have been.

The server software was determined by looking at the Server field in the response header, this field is optional and around half of the servers did not have the server field in the response header. So I only ended up with around 6100 entries. It is also possible to change what is sent back in that field and a few sites had changed this to some funny phrase, or used it as a shout-out to honour some developer or friend.

Reddit's Server header: '; drop table servertypes; --

Alt text

Apache is the most common server, and that is no surprise, but nginx had a much higher market share than I thought it would have. Microsoft IIS also has a decent share of the market. The reason GSE has such a big share is because the crawler ended up crawling a lot of Google owned domains, such as blogspot, on blogspot each blog has it's own subdomain, and the crawler sees them as individual pages servers and stores the data as such. I have labelled a bunch of different servers as others, at they include such trash values as the one from reddit, and software that was only seen a few times.

Alt text

nginx is the fastest of the top 3 servers, with IIS not far behind. On average Apache is almost 600ms slower than nginx. I noticed this myself when I was hosting this blog on my VPS before, at first I ran Apache and later switched to nginx. Apache wasn't slow since my blog is not big and does not receive a lot of hits, but even for this small blog I noticed that it loaded faster when running nginx. Other blog post in the blogosphere also says the same, they notice that pages loaded faster, and busy sites saw that they could handle more users with less hardware.

There is a lot of hate towards Microsoft products from geeks in general, but it seems that Microsoft is at least doing something right with IIS, which by the results scores better than Apache on page load speeds.