Power Computing

by on January 20, 2009 at 1:10 pm in Web/Tech | Permalink

I'm in the market for a new computer since my old machine just can't grok the large datasets that I am throwing at it.  I asked Paul Heaton, a very smart and productive econometrician with RAND who works with very big datasets, for his advice.  He sent me the following which I thought might interest others.  Your comments appreciated as well. 

1. It is very hard to find
a desktop system that accepts more than 8 GB of RAM, and RAM is probably
the biggest factor affecting Stata performance.  A 64 bit workstation or server architecture allow for more processors and more RAM, but these components usually cost 3-4 times as much as a
comparably performing desktop. If you want the absolute best performance
(i.e. more than 4 processor cores, 16 or 32 GB of RAM), you'll probably
need to go the workstation route. A good configuration will run you
$4K versus probably $1K for a top-end desktop.

2. I've use a top-end desktop configuration with a quad-core processor
and 8 GB of RAM to run things like NIBRS or value-added models using all
the students in New York City and gotten adequate performance but expandability is key.

3. If you want to run Windows, you'll need a 64-bit version. I use
Vista business which seems to work well for me. You'll need Stata to
send you a 64-bit version and a new license; converting your Stata
license from 32 to 64-bit is cheap. You'll also want to pay to upgrade
Stata to support the appropriate amount of processor cores in your new
machine (much more expensive), this boosts performance appreciably.

4. I suggest setting up your hard drives in a RAID configuration. You
buy four identical hard drives of size X GB instead of just one and a
controller card. The controller card spreads your data across two of
the drives and makes a mirror copy of those drives on the other two;
this is done transparently so from the user's perspective it is as
though you have a single drive of size 2X GB (there are other ways of
doing RAID, but these are less relevant for your situation). There are
2 major advantages to this: 1) The hard drive is often the bottleneck,
particularly when loading large datasets; by parallelizing the
operations across four drives instead of one, your datasets load and
write a lot faster. 2) Because there is a complete copy of your data
that is maintained on-the-fly, when one of your hard drives fails,
instead of losing data or being forced into an onerous restoration of
backups, you simply see an alarm alerting you to the problem.  Decent RAID cards run about $200, and disk storage is cheap, so I think
this is something everyone who does serious data analysis ought to be
doing.

1 petrilli January 20, 2009 at 1:17 pm

I know it’s heresy, but the Apple Mac Pro actually is a great machine for this kind of application, whether you run Mac OS X or Windows on it. It’ll hold 8 cores and 32GB of RAM.

2 TripAZ January 20, 2009 at 1:21 pm

Most of the new Core i7 motherboards (when used with 64-bit OS) allow for 12GB of RAM.

3 secretivek January 20, 2009 at 1:24 pm

Another reason to go the “workstation” route (and another reason they are more expensive) is that the channel between the memory and the CPU is an issue when you are working with large amounts of data. You can have a lot of memory and lots of CPU, but you really want a lot of memory bandwidth to keep the CPUs fed with data. This is one place that desktops (even high-end ones) don’t provide you as much as a workstation or server machine.

(Expensive desktops used for video games don’t have the same kind of memory bandwidth either — they do a lot of computation, but on relatively small amounts of data. The exception is the graphics, where they have dedicated memory *and* a dedicated high-bandwidth channel between the graphics memory and the graphics processing unit.)

4 secretivek January 20, 2009 at 1:27 pm

And yeah, the Mac Pro is a good piece of workstation-class hardware.

5 Noah Yetter January 20, 2009 at 1:28 pm

A “decent” RAID card that can do 10/1+0 as described will probably run you more than $200, even if you’re willing to brave eBay as I did. Most consumer RAID cards don’t actually do any processing on board and buy you little performance as a result, and most don’t support 10/1+0 anyway, though you could use 5.

This is the card I have:
http://www.adaptec.com/en-US/products/Controllers/Hardware/sas/value/SAS-31205/

You of course wouldn’t capital-N Need the 12-port model, there are 8 and 4 port versions of the same card that are cheaper, though not much.

6 Clyde January 20, 2009 at 1:31 pm

All that sounds like good advice; one additional thing to look at is the hard drive speed. 10000 RPM drives really do make quite a bit of difference. The tradeoff is that “disk storage is cheap” sometimes isn’t quite as true for those drives. But if there’s room in your budget after the other stuff, it can be worth it. Make sure all drives in the RAID array are the same speed/size though.

The only other concern is that windows 64-bit won’t run some piece of hardware that you find critical. If you only use this machine for the OS, web browser, and Stata then you’ll be fine, but if you have something like a webcam, or a weird network adapter you might want to test those out somewhere.

7 eric January 20, 2009 at 1:33 pm

I just bought a new system that I built myself for about 2300$. A couple items in my build would work well for you in my opinion.

The processor is a i7 920, newegg for about 300
http://www.newegg.com/Product/Product.aspx?Item=N82E16819115202

i bought a gaming motherboard but this would work well for you because it has 6 slots for ram. 380$
(they apparently have a couple open boxed ones for 300)
http://www.newegg.com/Product/Product.aspx?Item=N82E16813131352R

Last I bought 12 gigs of really fast ram. This particular model gives awesome price performance ratio.(300$ or so)
http://www.newegg.com/Product/Product.aspx?Item=N82E16820231200

I works great for games, would work really well for your application. You could probably get a grad student to set it all up for you as well. These particular components are very fast but have the potential to be “overclocked” for an additional 10-20% increase in performance.

8 BryanMD January 20, 2009 at 1:36 pm

I agree that an Apple Mac Pro is the way to go if cost is not THE primary factor. I use both Windows XP, Vista and OS X on a daily basis and find OS X to be the least stressful to operate in. I also find no discernible difference between STATA in OS X and Vista. However, I would advise AGAINST running STATA for Windows in Parallels if you do go the Apple route. Parallels eats a lot of memory own its own. Just get a native OS X version of STATA.

9 mickslam January 20, 2009 at 1:51 pm

Bail on Stata. Only economists use it and R/S/matlab are better.

10 Joey January 20, 2009 at 1:52 pm

If you go the Mac Pro route, for god sakes, don’t buy your memory from Apple. To configure 32GB of RAM (8 x 4GB), it’s a $9100 upgrade. To buy the RAM from Crucial, it’s $1600. Holy shit.

11 Sergey Kurdakov January 20, 2009 at 1:57 pm

Intel X58 Express chipset ( for new Core i7 intel processors ) supports 24 Gb of memory ( depends on manufacturer though )

12 Douglas Knight January 20, 2009 at 2:02 pm

Using “workstation” as a summary for 64 bits, many cores, high bus speed, etc, is probably a bad idea.

13 Eli January 20, 2009 at 2:07 pm

Mac Pro with memory purchased somewhere else. Stata for Mac is good.

14 Gordon Mohr January 20, 2009 at 2:15 pm

@Martin Smith: The failures of drives from the same manufacturer, same batch (ie consecutive assembly in same plant) are not independent, so the advertised MBTF does not tell the whole story. Everyone I know who buys drives in bulk talks about receiving occasional ‘bad batches’ which show rapid or otherwise suspiciously similar failure characteristics.

Heaton appears to be recommending a RAID10 configuration — four X GB drives providing 2X GB space. So, if you have four consecutively-manufactured drives in such a configuration, and one fails, don’t think, “it’s been years since my last HD failure, what are the odds I’ll lose another this week?” Think, uh-oh, this batch may have hit its wall, I better replace immediately.

15 Scott McKuen January 20, 2009 at 2:32 pm

Some folks have recommended Mac Pro hardware and other folks have suggested moving to Matlab for your calculations. I like both a lot, but if you do them together, be aware that 64-bit Matlab for OSX is only in beta currently, so you are limited to 2GB RAM until that’s fixed – and that kinda hoses your goal of working with large datasets, unless you want to run Linux on your Mac (fresh install or dual-boot.) Might be more trouble than its worth for you, unless you like to tinker with your system.

16 Scott McKuen January 20, 2009 at 2:35 pm

Ah – I see that Alex has pre-empted my post with a Mac veto. Yeah, the IT department not supporting the system would be an issue, too.

17 Matt January 20, 2009 at 2:41 pm

Looking through the comments and the OP, I’m having a hard time understanding why people keep recommending he has to have a “workstation” as opposed to a high end desktop PC. The 3 primary reason’s I’ve seen people put forward are as follows: desktop PC’s will not allow more than 8GB of memory, desktops do not have fast enough bus speed, and desktop’s do not allow for enough processors. All three of the above arguments are patently false. A midrange P45 motherboard will support up to 16GB of memory and there are X58 motherboards that will support as much as 24GB. The Mac pro that everyone keeps recommending has a 1333mhz FSB, the same speed that P45 motherboards support and slower than the 1600MHZ that X58 motherboards support.

The amount of processors is the only argument that makes any sense at all to me. A Mac Pro will support up to 2 Core 2 Quads, for a total of 8 processsing cores. A P45 or X58 will only support a single quad core processor. However, in the case of the X58, it will support Intel’s new Core i7 chip, which is multi-threaded and can handle 8 processing stings at once, the same as the dual-chip Mac Pro. I realize that a multi-threaded quad core processor is not 1:1 with 2 single threaded quad core processors. However, it seems like petty distinction to make given the massive increase in price in order to step up to a dual processor system.

If I’m wrong on any of these accounts, then by all means someone please correct me. And of course, if he wants to spend $4000 on a workstation that’s his prerogative. I just don’t want him thinking he has to spend that much, when he can likely find all the computing power he needs for significantly less cost.

18 Mitch January 20, 2009 at 3:02 pm

It looks like the number of comments on this page is pushing it from “babel” and into “enough to get a sense of the consensus”, so I’ll go ahead and comment.

Re: R, the comment that it’s “terrible for large datasets” fails my personal smell test because it’s overly sweeping.

I suppose people have different definitions of “large”; I’m sure other people have larger data sets than mine (have to disclaim in case someone feels the need to engage in a size competition). My R usage was in statistical genetics, where I was looking for correlations between a data set with tens of millions of data points and a data set with a few thousand data points. I was pretty happy with how R handled it. This was a few years ago.

Using R the way it’s meant to be used (in a nutshell: use vectors/matrices wherever possible) is sometimes subtle, but I’ve found the R mailing lists to be very helpful for that (both googling through them and posting to them).

Compared to stata, R has lexical scope! Welcome to the eighties (70s? 60s?).

19 Sigivald January 20, 2009 at 3:12 pm

The obvious long-term solution (I don’t do modeling, so I don’t know what the current user-level products are like) will be, like Zamfir alludes to, a grid or cluster of computers.

Why worry about stuffing more than 8 gigs of ram in a single box, when you can have four boxes with 16 gigs each, all working on parts of your analysis in parallel (which makes the slow network between them less of a problem)?

(And if you really do just need more-than-8, there are plenty of MBs out there that’ll take 16, starting under $100 at, for example, NewEgg Not as good as a workstation-level board, but a lot cheaper; the tradeoff is yours to make.

It’s going to be hard, somewhat in contrast to his first comment’s implication, to find a system that isn’t 64 bit these days, at least at the hardware level.

As others have said, if money isn’t the primary issue, and you don’t want to build your own, buy a Mac Pro – perhaps a reconditioned one. They’re an excellently engineered machine with great performance, and comparable in price to the other-branded same-spec hardware.

But don’t buy ram from Apple.)

20 Yancey Ward January 20, 2009 at 3:21 pm

I didn’t realize Solitaire was so tasking to today’s systems.

21 Zamfir January 20, 2009 at 3:29 pm

Matt raises a good in point with GMU hardware support. Odds are there is some guy in your tech support dep who wouldn’t mind picking a good system for you, with the added bonus that if they were involved in the choice, they will be probably a lot more supportive if (when) you need their support.

22 anon January 20, 2009 at 3:53 pm

FYI on the Mac Pro route. I have 4 GB, though, but have had no problem with NIBRS, for instance. If you buy the memory third party, you get ridiculous cost savings.

23 Anonymous January 20, 2009 at 4:13 pm

An SSD will easily be the most performance for the dollar. Memory is not the bottle neck. The system bus and the harddrive (think actual moving device) are the bottle necks. Get a couple of those intel SSD’s for 500 dollars and you’ll fly.

24 Steve Sailer January 20, 2009 at 4:34 pm

Definitely check out Dell.com to see what’s the burliest Core i7 system you can buy at mass market prices.

25 Mike January 20, 2009 at 4:49 pm

The recommended RAID configuration (“RAID 1+0”) is not necessarily the best depending on what your needs are for speed, size, and reliability. For further info see this article: http://www.maximumpc.com/article/raid_done_right?page=0%2C1

A RAID array should not be considered a “back up” as it can only save you from physical problems with the drive. If the wrong file is permanently deleted you cannot recover it.

26 Mr. Beefy January 20, 2009 at 5:09 pm

you can get a desktop Cray for 9K and blow all of these other ideas out of the water. put a Hemi in it! :).

27 txxxxx January 20, 2009 at 5:36 pm

Some people have mentioned this: the new Intel x58 chipset and for the new iCore 7 CPU support 12gb of RAM (it has 3 DIMM slots); it’ll support twice that RAM once the size per dimm double.

You can buy whole system from, say, Dell but I encourage you to build it just for the fun of it.

For example,

iCore 7 CPU:

http://www.newegg.com/Product/Product.aspx?Item=N82E16819115201

Motherboard with new chipset:

http://www.newegg.com/Product/Product.aspx?Item=N82E16813128362

28 RobbL January 20, 2009 at 6:48 pm

Until recently a part of my job was specing heavy duty servers and workstations. What I find is that the bottleneck (limiting) part of the systems is not always obvious. What I suggest is that you put your data on an external drive and ask around to find someone who has a machine that is configured in the way suggested. Then load the data on and try a few operations to see if you are getting the performance you need.

Presently the i7/ssd setup is far ahead in performance at a reasonable price.

One final note. Don’t waste your money on raid mirroring. Use multiple drives in raid 0 to get performance and get a separate solution for back up. Almost nothing that goes wrong is protected against by a mirror. It only protects against a single drive failure. Much more frequent is controller faliure database corruption, accidental erasure, power issues that fry everything etc.

29 Doug January 20, 2009 at 7:19 pm

By the way, most of the i7 motherboards have raid controllers built in.

30 David Wright January 20, 2009 at 8:43 pm

I work in IT for a (still) profitable fortune 500 company and there is no way I could convince my employer to put that kind of a workstation on my desk. If GMU is willing to buy you that, they need to institute better cost controls.

31 Jason January 20, 2009 at 9:47 pm

Unfortunately there is no 64 bit version of stata for the mac.

32 HankP January 20, 2009 at 10:30 pm

I do this stuff for a living.

The best solution from a price standpoint is for you to build your own system, I’m guessing for somewhere around $1500 – 2000. you’d get a very fast machine with a quad core and 8 – 12 Gb RAM, although by playing around with the specs you could do slightly better or worse depending on your requirements and the specific parts you choose. Don’t take this step lightly, if you’ve never built your own machine before there are quite a few tricks you’ll need to be aware of or be willing to learn. The more cutting edge the hardware the more you’ll be expected to know what you are doing.

If you’re not comfortable building your own, the best idea is probably to buy a server machine from Dell or HP with quad core (or dual quad core) processors, up to 32 Gb RAM and RAID 1 hard drives (I prefer RAID 1 because each drive of the pair is bootable by itself if something goes seriously wrong). Don’t count on RAID as a backup, though, buy an external drive for that and keep it stored away from the machine itself in case there’s a fire or other disaster. You’ll pay a premium for buying it from a vendor, I did a sample pricing and a dual quad core machine with 16 Gb RAM and RAID 1 15K 300 Gb drives is about $3000.

Email me if you have questions you’d like answered about this.

33 Brock Palen January 21, 2009 at 12:13 am

You could go mac-pro, buy memory from third party,
Stata on mac/linux should be best, as stata has a unix heritage.

Does your university not have a local research computing/Cluster computing setup?
If so check with them, for example we provide sun x4600’s which costs $10’s of k’s. The have up to 128GB of memory and 8 sockets (thus modern chips 8*4 cores), runs stata_mp like a champ.

You will need stata_mp like your friend said not cheap, but worth it if you can take advantage of the extra cores. Or you can always run multiple copies of regular stata each will use one core. But then you need memory to support N statas running.

Get a Terragrid allocation,
http://www.teragrid.org/

When I talked to them at SuperComputing08 they has resources to space. On an Altix4700 you can get 24TB (~24,000GB)of shared memory from SGI (few million, but free on terra grid). Check first about Stata tokens, the terra grid support desk guys will help you out.

Brock,

HPC Sysadmin,

34 Kragen Javier Sitaker January 21, 2009 at 1:58 am

I just built my new system.

Motherboard: ASUS P5KPL-AM, US$65. Only holds up to 4GiB RAM, but there are motherboards in the same price range that support 8GiB. Integrated video and Ethernet, 8 USB ports. Supports DDR2-1066 and “Core 2 Quad” quad-core processors.

RAM: 2GiB Kingston DDR2-800 “PC2-6400” SDRAM, US$20.

CPU: Intel Celeron E1200 1.6GHz dual-core 64-bit, US$60. This is really a Core 2 Duo with a low clock rate and a small L2 cache.

Disk: a 500GB WD 7200RPM SATA-2 thingy, US$80.

I’m leaving out the keyboard, monitor, mouse, and power supply, since you’re really talking about doing an upgrade rather than buying a whole new system. The total cost of the items here is US$225, which is a lot less than you would think a dual-core 64-bit processor would cost. I estimate you could do a similar upgrade, but with 8GiB and a nice Core 2 Quad Q6600, for US$500.

As far as RAID cards: use the software RAID built into your OS! RAID cards are an extra thing that can fail and destroy your data or give you driver incompatibility problems. And of course don’t trust RAID to save you from bugs and accidents. rsync for offsite backups is your friend. (If you don’t have software RAID in your OS, try Ubuntu Linux.)

I’ve been running some biggish data analysis jobs on this machine; they’re running about 3× faster than they did on my old 2.4GHz Pentium 4 server.

35 Kragen Javier Sitaker January 21, 2009 at 2:15 am

A few other things.

Doing stuff on a cluster is a good idea.

I don’t have any experience with SSDs, and I’ll tell you why. Good SSDs are about 40× faster (for random access) than a good disk, or 4× faster than a good RAID, but they’re still 2000× slower than good RAM. And they cost almost as much as good RAM, unless you don’t have enough motherboard slots to put all those 4GiB DIMMs in. It’s true that your machine will boot faster from an SSD than from a disk, but boot time isn’t the best metric of workstation or cluster performance. You can keep your dataset in RAM (writing changes to disk, of course, but avoiding reloading it) by the simple expedient of not turning the machine off.

36 Sergey Kurdakov January 21, 2009 at 6:35 am

Just to note that is there is no brand name computer with required specs, thhere is an opportunity to have a custom build system
ex look here

http://www.pugetsystems.com/certified_sys.php?sys_id=59
( a better solution is to use http://www.pugetsystems.com/certified_sys.php?sys_id=51 and order a x58 motherboard, also note that Intel just slashed prices for processors and rolled new ones ( Core 2 Quad Q9550s, Q9400s and Q8200s ) ).

37 Jeff January 21, 2009 at 10:45 am

Bail on the large datasets. A statistical relationship that requires a really large dataset to find is not going to be empirically important.

38 Douglas Knight January 21, 2009 at 11:26 am

Ask your IT guys if they’d be willing to “build your own.” They’d probably find it fun.

I have 4K to spend of which an update to STATA (2 processor) will be $500 ($1000 for the 4 processor but I can hold off on that for another time.)

If you can’t afford stata, switch to a real program, like R.

39 Sergey Kurdakov January 21, 2009 at 11:52 am
40 GMU Student's Dad January 21, 2009 at 3:28 pm

My son’s a GMU student, majoring in Art and Econ (yes, “that makes for pretty graphs”). When he first started a couple of years ago, I asked one of the Art profs what machine to get, and he recommended a Mac–no surprise there. I didn’t actually buy a Mac (for reasons not relevant) but I’d bet, say, a chunk of Josh’s tuition that you could find IT support if you asked the right guy in the Art dept.

41 Dan January 22, 2009 at 1:25 am

Why buy a fancy computer?

If your software is parallelizable (in particular), rent 50 computers for a few minutes when you need them and don’t pay a cent the rest of the time. Your compute tasks will take far less time.

http://aws.amazon.com/ec2/

I’ve been very happy with the service and the cost is tiny.

42 Sergey Kurdakov January 22, 2009 at 5:25 am

http://www.revolution-computing.com/products/windows-64bit.php is a place for 64 bit windows version ( in beta now ), I’m not aware of their pricing though

43 Zathrus January 22, 2009 at 12:55 pm

As several others note, the latest Intel MBs support 24GB of memory (look for ones with 6 memory slots), so that’s not really an issue anymore. There’s little reason to think that they won’t support 48 GB once 4GB memory chips are widely available, unless there’s a physical limitation in the chipset.

I would agree with using a RAID 1+0 setup if you have a disk bottleneck, and definitely use a real RAID card (or use Windows’ built-in software RAID in Server 2003 or Server 2008; you can also hack XP (and possibly Vista) to do it as well). Do not use motherboard RAID — with software RAID or a separate HW RAID controller you can recover the data when the controller fails. With motherboard RAID (aka FakeRAID) you’re pretty much toast.

As others note, do not consider RAID to be backup. It’s not, as countless people have discovered. It’s only good for improving read I/O and for keeping the system from dying when a single drive fails.

And Martin Smith is incorrect. There have been numerous studies (in particular, one by Google) that show that drives from the same batch tend to die at roughly the same time. Which is why Google and a number of other data centers have given up on RAID5 — when one disk fails the chances of a second disk failure are considerably higher, and that wipes out all of your data in the array.

44 Richard February 5, 2009 at 9:00 am

For a cheapo upgrade – my desktop has 4GB of ram running 32 bit winxp. Obviously it is horribly constrained – winxp will only let you get at about 1.4GB of ram.

So I installed 64 bit ubuntu linux using “WUBI” and bought a 64 bit linux version of stata I am able to hit all the ram. Total cost: $400 for the stata version.

The wubi install of linux is very non-invasive (i.e. it won’t send the IT folks at your institution off the deep end).

45 Martin S March 15, 2009 at 10:22 am

Hey Alex, could you let us all know what you ended up buying, where, how much, and any regrets ? Thanks! Martin

46 jad jar February 10, 2010 at 7:27 pm

thank you for this information.sis jarMy local telecom is a monopoly, and it is out-of-control as far as wiretapping, eavesdropping, hacking, controling e-mail programs, phishing, spoof websites, etc.
No company should be immune from law suits and especially companies that control our communications.To give telecoms immunity will make “big brother”free nokia 6600 games“In this incidence and extent of formal coauthorship observed in economics against that observed in biology and discuss the causes and consequences of formal coauthorship in both disciplines. We then investigate the economic value (to authors) of informal comments offered by colleagues. This investigation leads us naturally into a discussion of the degree to which formal collaboration through coauthorship serves as a substitute for informal collaboration through collegial commentary. Data on manuscript submissions to the Journal of PolzticalEconomy permit us to shed additional empirical light on this subject. Finally, we demonstrate that while the incidence and extent of formal intellectual collaboration through coauthorship are greater in biology than in economics, the incidence and extent of informal intellectual collaboration are greater in economics than in biology. This leads us to search for evidence (which we find) of quids pro quo offered by authors to suppliers of free nokia n70 games

47 sex shop July 24, 2010 at 7:15 am

Hard drives purchased in the same batch and Raided together often fail at the same time (ie within the same week), so don’t think that this setup has you covered from a backup perspective

48 lv bags September 24, 2010 at 5:00 am

He’s way overestimating the price. The large PC manufacturers couple high end parts like quad core processors with other high end parts like video cards for PC gaming that you may not need to jack up margins at the high end of their product line. Check out smaller PC assemblers like ibuypower.com who let you spec out your entire system. Also, don’t buy RAM from the manufacturer. It’s dirt cheap (you can find 4GB for $40 on Newegg.com) and the simplest part of a PC to upgrade. If you’re spending more than $1000 (sans monitor) for a system that meets your needs you’re spending too much.

Comments on this entry are closed.

Previous post:

Next post: