Power Computing

I'm in the market for a new computer since my old machine just can't grok the large datasets that I am throwing at it.  I asked Paul Heaton, a very smart and productive econometrician with RAND who works with very big datasets, for his advice.  He sent me the following which I thought might interest others.  Your comments appreciated as well. 

1. It is very hard to find
a desktop system that accepts more than 8 GB of RAM, and RAM is probably
the biggest factor affecting Stata performance.  A 64 bit workstation or server architecture allow for more processors and more RAM, but these components usually cost 3-4 times as much as a
comparably performing desktop. If you want the absolute best performance
(i.e. more than 4 processor cores, 16 or 32 GB of RAM), you'll probably
need to go the workstation route. A good configuration will run you
$4K versus probably $1K for a top-end desktop.

2. I've use a top-end desktop configuration with a quad-core processor
and 8 GB of RAM to run things like NIBRS or value-added models using all
the students in New York City and gotten adequate performance but expandability is key.

3. If you want to run Windows, you'll need a 64-bit version. I use
Vista business which seems to work well for me. You'll need Stata to
send you a 64-bit version and a new license; converting your Stata
license from 32 to 64-bit is cheap. You'll also want to pay to upgrade
Stata to support the appropriate amount of processor cores in your new
machine (much more expensive), this boosts performance appreciably.

4. I suggest setting up your hard drives in a RAID configuration. You
buy four identical hard drives of size X GB instead of just one and a
controller card. The controller card spreads your data across two of
the drives and makes a mirror copy of those drives on the other two;
this is done transparently so from the user's perspective it is as
though you have a single drive of size 2X GB (there are other ways of
doing RAID, but these are less relevant for your situation). There are
2 major advantages to this: 1) The hard drive is often the bottleneck,
particularly when loading large datasets; by parallelizing the
operations across four drives instead of one, your datasets load and
write a lot faster. 2) Because there is a complete copy of your data
that is maintained on-the-fly, when one of your hard drives fails,
instead of losing data or being forced into an onerous restoration of
backups, you simply see an alarm alerting you to the problem.  Decent RAID cards run about $200, and disk storage is cheap, so I think
this is something everyone who does serious data analysis ought to be


Comments for this post are closed