Hardware Sizing

To correctly size resources for a crawler, you must get the following input:

  • The total number of sites and pages expected.

  • The average page size. It is usually 30kB on the web, more on Intranet sites containing office documents.

  • The crawl speed and refresh period expected (consider the throttle time, crawl speed limit).

This section explains how to:

This page discusses:

Determine the Hardware Requirements

The total number of pages and average size gives roughly the disk space requirement.

Disk Space

The crawler uses space to store crawled documents. Pages are compressed. For an internet crawl, the average compressed page size is about 5kB. For 1 million documents, this amounts to 5GB.

The crawler needs between 2-3 disk accesses (I/O) for each crawled document. This corresponds to 2 IO/s/thread, so 100 IO/s for 50 threads. For reference, this saturates a standard SATA disk. Fast crawlers require RAID arrays, SAS disks, or SSDs.

During compaction operations, the crawler storage may temporarily use 50% more disk space.

CPU

Nowadays CPU is not a bottleneck for the crawler. A single AMD Opteron 2346 HE core (2007 CPU) supports up to 40 crawl threads.

Network Bandwidth

Network bandwidth use is directly computed from average document size and crawl speed.

On the internet, an average of 30kB per document means about 200kbps of bandwidth per thread, or 20Mbps for 100 threads.

Memory

A crawler has a base memory consumption of 500MB to 1GB, plus 50-100MB per thread, depending on crawled documents.

Stored documents also use about 40B. Example: with 50 threads and 10 million crawled documents, a crawler uses 500MB + 50*50MB + 10000000*40B, or about 3.5GB.

Temporary Memory Use

During compaction operations, the crawler may temporarily use twice the size of the 80 biggest crawled documents.

For 30MB pdf documents, this amounts to 4.8GB.

Determine the Crawl Speed

The number of pages you need to crawl per day (refresh speed) determines the crawl speed.

On a single server, divide by the response time added to the throttle time. On the internet, 0.5 to 1 page per second per thread is the average speed, so one crawl thread is usually enough to crawl 2 to 5 hosts in parallel.

Example: Crawl 100 different hosts with an average of 10000 pages on the internet. A single host cannot be crawled faster than one page every 2.5 seconds, so each host can be crawled in 10000 * 2.5s = 25000s, about 8 hours. If you need to refresh every page every day, it makes 100 * 10000 = 1 million pages per day, or about 12 docs/s. With a conservative thread crawl speed of 0.5 doc/s/thread, it corresponds to 24 threads.