Determine the Hardware Requirements
The total number of pages and average size gives roughly the disk space requirement.
Disk Space
The crawler uses space to store crawled documents. Pages are compressed. For an internet crawl, the average compressed page size is about 5kB. For 1 million documents, this amounts to 5GB.
The crawler needs between 2-3 disk accesses (I/O) for each crawled document. This corresponds to 2 IO/s/thread, so 100 IO/s for 50 threads. For reference, this saturates a standard SATA disk. Fast crawlers require RAID arrays, SAS disks, or SSDs.
During compaction operations, the crawler storage may temporarily use 50% more disk space.
CPU
Nowadays CPU is not a bottleneck for the crawler. A single AMD Opteron 2346 HE core (2007 CPU) supports up to 40 crawl threads.
Network Bandwidth
Network bandwidth use is directly computed from average document size and crawl speed.
On the internet, an average of 30kB per document means about 200kbps of bandwidth per thread, or 20Mbps for 100 threads.
Memory
A crawler has a base memory consumption of 500MB to 1GB, plus 50-100MB per thread, depending on crawled documents.
Stored documents also use about 40B. Example: with 50 threads and 10 million crawled documents, a crawler uses 500MB + 50*50MB + 10000000*40B, or about 3.5GB.
Temporary Memory Use
During compaction operations, the crawler may temporarily use twice the size of the 80 biggest crawled documents.
For 30MB pdf documents, this amounts to 4.8GB.