How should economists write code?

From Matt Gentzkower and Jesse Shapiro (pdf), addressed to their RAs:

Every step of every research project we do is written in code, from raw data to final paper. Doing research is therefore writing software.

Over time, people who write software for a living have learned a lot about how to write it well. We follow their lead. We aim to write code that would pass muster if we worked at Google or Microsoft.

For the pointer I thank Bo Cowgill.


The contents of this pdf are applicable for anyone writing software code... Its like a set of guidelines on 'how to write better code/produce better software' - good reading for anyone in the business of writing software.

As an old programmer (see below), I kind of find their good/bad examples equally bad. The things you weigh in "good code" change over time, and even in application. When you are filling a foreign API, once and only once, "parameter[27]=1" can be fine if it matches the interface doc. Similarly "x = y * 3.28 // feet to meters" doesn't really benefit from a dedicated constant.

Probably the best thing, if you want to write good code, is to both read and write a lot of code. You then get a feel for what is easy for you to read and maintain, as well as the styles of others.

True, but in this context they're almost never going to be using foreign APIs the way us professional programmers do.

(And even then, by God, you should comment that "1" with a note saying it's to match the interface documentation and is a Magic Number With No Meaning Here.)

While their examples aren't perfect for General Programming Education, I suspect they'll be great improvements in context, if followed.

Certainly, in the first half of the document (the part I read), all their axioms and commands were excellent form, at the highest level.

(Contra zbicyclist below, even if it's true that it isn't going to be run often and only your buddies will run it... if anything ever needs to be checked or modified you'll regret taking shortcuts. You're saving yourself and your fellows trouble, labor, and hassle by doing it right the first time.

I can't remember how many times I've gone back to look at code that I wrote six or twelve months previous and only understood how it worked because I wrote it right and commented it thoroughly the first time.

Doing so saves hours and hours of re-learning and mistake, and makes problems much easier to find.

Cheaper than "training" your people to match your fragile, ill-written code, too.

All code is production code unless it's in beta or alpha to eventually be production code - with the sole exception of write-and-throw-away one-time conversion code and the like. If people will run it later, however infrequently, and anything depends on the output being correct? It is production code.)

Those are fair comments. I think I'm just making a case for experience based judgement to supersede rule based thinking at some point.

(On returning to code after six or twelve months .. I just try to avoid perl)

As an exhortation, it's fine and good.

But taken literally, it's BS. I write a LOT of code that's designed for research purposes (roughly the same demands as code written for academic purposes, I would think) and they key is that you want it accurate, and you want it to kick out lots of diagnostics for checking, but you aren't going to run it very often, and the people running it are likely to be close personal acquaintances.

In contrast, Microsoft, Google (etc.) are writing PRODUCTION code. I have people who write that, as well. This has different demands (efficiency, robustness, ease of use), and has to be able to be run by a large, not well trained audience (in this case, anybody).

I'm all in favor of clear code and adequate documentation, but I don't think we want to trivialize the contribution of the production programmers who make modern software products to easy to use.

That's what a lot people said early on in the software industry. Then they were all shocked how long code lives on, and how many people read it.

What are you doing if you are not building solid, fundamental building blocks of knowledge? Are your papers also throwaway work that future researchers won't read?

It would be a big improvement if journals required people to publish the final, commented, compiling, and running version of their code. It would be an improvement in the usefulness of journals and it would help spread good research practices. It would also provide an incentive to write clean and concise code.

On the other hand, owning a good code is often a ticket to several publications and consulting gigs, so I can see why people wouldn't want to just give it away.

I think its a good idea to provide code that produced scientific results/data analysis when its appropriate, though going in a journal is inappropriate (some codes are looooong). Providing it in an online archive would make sense.

I suppose I didn't mean they necessarily had to print the code out next to the article, just that the journal would publish the code (make available quality source and binary). That could easily be on the internet as you suggest. I suggest the journal because they already claim or aspire to the role of quality arbiter.

Did you know that several econ journals do claim to require code? The basic problem is that they don't enforce their requirement. Of course, if they checked that it compiled, they would check that it exists.

But the real problem is that transparently good code produces the correct answer and there is a strong incentive for nontransparent code hiding bugs producing the desired answer.

How cute.

Having recently moved from a technical job (aerospace engineer) that involved coding, to a job in real software development, I'm going to guess these economists' software won't ever come close to Google et al.'s because it will lack:

- version control (ability to track who added what and seamlessly roll back to previous versions)
- automated testing (as opposed to "yeah, it still looks alright when I ran some input on it")
- one-step deployment ("run this script" rather than, "okay, copy this here, download this, check that this matches this ...")
- code reviews
- environment files (scripts that make sure the person running the code is operating off the same dependencies as the authors)
- informative error messages
- etc.

Bah, didn't break into separate lines. Let's try that again:

- version control (ability to track who added what and seamlessly roll back to previous versions)

- automated testing (as opposed to “yeah, it still looks alright when I ran some input on it”)

- one-step deployment (“run this script” rather than, “okay, copy this here, download this, check that this matches this …”)

- code reviews - environment files (scripts that make sure the person running the code is operating off the same dependencies as the authors)

- informative error messages

- etc.

(I didn't expect economists to have a preview button either.)

The sad thing is that, as a professional software developer, this are mostly solved problems that are sort of unavailable unless you live in the software development culture.

I think it's a crying shame papers don't automatically come with github links.

Even then...

I have a multibillion-dollar client doing do a huge implementation of a major software package, they've been running mostly in-house stuff up till now. Their programmers' jaws just drop when we explain the nineteen steps of version control we use.

"But... all that overhead!" Yeah, welcome to real software development, folks.

What I'm hearing is that it is a horrific waste of resources to expect RAs to achieve the level of quality aspired to by multibillion-dollar corporations and the Googles who spread the overhead across billions of users.

Andrew, small software startups of 1-2 people follow these guidelines and use these tools. There is free, open source tools for all of this. Especially for popular, higher level languages like ruby and python.

It takes like 3-4 days of research (find a good fit for you) and 1 day of setup to get in an environment close to what google web developers have.

The toolset is extremely basic. I personally consider version control a basic form of programming literacy - do not hire anyone who does not use it.

Something isn't quite right.

It's either easy to be a professional economist AND professional programmer, or it's a little hard to be both.

There aren't any advanced quality control techniques that aren't used by every person who does even a little bit of coding?

Those sound like 'quality systems' as distinct from quality itself.

I'm impressed that they have an RA manual, let alone that they let other people have access to it.

It is easy to program, but hard to be a good programmer. The programmers, however, have little self awareness of their problems and are shocked to discover how much their methods really cost.

Programming well is hard work but it is not expensive. On the contrary, good quality is substantially faster and cheaper and involves low tech techniques such as checklists, personal reviews, disciplined design, and peer inspections. I explain the numbers in an article in "Software Quality Professional"

The market, however, seems to be driven by tools rather than by developing human capital.

There's no need to be condescending. For one thing, they imply that they use version control (by talking about "checking in" code). They also talk about the value of automated tests. Sure, they may not quite reach the Google standard they are aiming for, but if people actually follow what they're advocating, it's likely to lead to a significant improvement to code quality and maintainability.

I'm an old programmer, and my name actually links to a set of how-to-program pages. One thing that has crept up on me is that programming (or building anything, really) grounds the practitioner in reality. Your ideas are constantly tested. Bad logic, or unrecognized conditions, generate errors. It makes, especially after 20 or 30 years, for a strict view of factual reality. So sure, economists could use some of that ;-)

About 90% of programmers also write terrible, terrible code that is poorly documented, non-robust, and inconsistently formatted.

This is generally a plus for me, though, since it makes supporting, maintaining, and debugging code a much more valuable skill :))

Writing code that would pass muster at Microsoft isn't exactly a high standard.

What is the name of your operating system, and where can I buy it?


Case closed.

Although that crime against humanity is probably the fault of design rather than coding.

If you don't like it, don't buy it.

I get that response a lot. You are basically saying there is no objectivity. Zune is bad, objectively. It's not just my opinion.

What was so bad about the Zune? "Crime against humanity"? What, did a Zune fall on your dog's head and kill it?

You are correct, in my understanding, however - whatever problems it had (apart from marketing!) were design, not code. As far as I know the Zune products all ran fine and did their designed tasks correctly. The code was fine.

(Full disclosure: I always ran Apple music-players since the first iPod came out and they plainly knew the way to do it. But I thought the Zune ones were perfectly capable competitors at the feature and stability level.)

"If you don’t like it, don’t buy it."

Few did. That's why it tanked. The players are fine because, like the I-Pod they simply rip off pre-existing technology. "It's the interface, silly." The software was horrible. I find it hard to believe that it is all design because I can't believe they'd want it that way. I suspected that they way the modules were put together caused the illogical interface.

Every knowledge worker should be able to write software. It would be like stating in 1970 (Industrial Age) that only factory workers should be able to operate machines, or in 1770 (Agricultural Age) only farmers can use a shovel. We are in the Information Age, and almost everyone should be able to work with information.

Code is law.

They can already.

I bet you those RAs are working with information in Matlab and other similar analysis tools all the time. (Hell, one can do a staggering amount of data-churning with Excel...)

They're just not writing the code directly, and more than someone in 1970 working a factory job was making the machines directly with a file. Nothing wrong with using tools to make tools for you; abstraction is power.

Why doesn't Matlab count as writing code directly? Matlab is a Turing-complete language! So are Stata, SAS, Gauss, etc.

Knowledge workers need to have access to specialists who can write code.

Instead of learning to write software they can spend the time learning how to be better at their own specialities.

That seemed to work for Adam Smith's pin factory workers.

The advice, a mix of design, style, code advice, is well intentioned, but inadequate.

The typical programmer injects about 100 defects per every 1000 statements ( I sear I am not making that up, I can more or less prove it with tens of thousands of observations) . Modern IDEs find about half of these. Typical professionally produced software typically escapes 10 to 20% of these defects through all test into production. Even Microsoft and Google, allow 2-5% to escape into production. This level is about a factor of 5 to 10 times more defective than economically justified by the effective application of practices not listed in the above. See Capers Jones' Software Engineering Best Practices. . I can guarantee you that most academic software is at the poorer end of the spectrum and few have the training or awareness to produce at the better end. How many defects do you want to escape into your research? The advice from the attached PDF is only a very small step.

I find it a source of constant astonishment that not only software developers, but their managers, executives, and even professional economists do not grasp the reality of how inefficient software development is because of poor defect management. Nor do regulators really know how to regulate it. As Jerry Weinberg once quipped, "If builders built buildings the way programmers write programs, then the first woodpecker that came along would destroy civilization." It is not only unnecessary, but inefficient and shortsighted. As with baseball statistics in the late 20th century, opinion rather than facts dominate the discussion.

Fred Brooks learned a lot of lessons the hard way, but it was Watts Humphrey who saved that IBM 360 project with a focus on quality. Any discussion of software quality should start with Watts . That book on discipline and measurement is the key to applying Moneyball concepts to software development.

I am a recovered software writer myself. The bloated, buggy software and virus-ridden computers that are the norm today are the result of popular demand: companies like the now-defunct DEC and Digital Research, both of which took the time to make their products really reliable, simply couldn't compete with companies like Microsoft, which rushed every new version of their products out the door with new features as quickly as possible and told anyone who complained about bugs, "wait for the next version."

Thus I would not want to see us learn how to write software by imitating Microsoft, though they and their major competitors of today can serve well as bad examples. Instead, we should be looking at the market incentives that caused Microsoft to succeed and its superiors to fail. What we need is a way to make the next DEC succeed, so we can all benefit from its products. I'm not sure whether this is a pure public-good problem, or a result of too-strict intellectual property law (it would be nice if the public were allowed to take over, say, old versions of Windows and work on debugging them as an open-source project, which would probably produce something much more worth owning than the newest version.)

I personally had a DEC machine that ran for over 10 years without a reboot, which is proof that it can be done.

If that was an VMS machine, some of the code ended up in Windows NT. The cynic might say entropy rather than evolution.
I had one kernel panic in 4 years my SGI workstation. The other time it was off was for a power failure.

I'm guessing the incentives of researchers writing to RAs is to push the RAs to write code that can be easily understood by replacement RAs.

Absolutely. Incoherent code can cause major communication breakdowns in code-centered projects.

Oh my god, economists using LaTeX?!

pretty much all economists under 50 do, some economic historians excepted.

And also economists over 50. Peter Cramton wrote a set of TeX macros that preceeded LaTeX when he was a Ph.D. student at Stanford GSB. And got the university to accept the results for the officially filed copy of his dissertation.

I'm a little surprised at the comments here. Clearly the net effect of having a document like this is that you'll get better code. Of course no PDF can make every RA as fluent a coder as a Google engineer. But a little exhortation to write code others can read is harmless. It will make for better science.

Anyway--these guys are clearly already way ahead of the game, because they're using version control.

Version control is the one key thing every scientist who writes code should learn. If you have it, working with code is tolerable. If you don't, working with code is endless pain and frustration. Learn more here:

LOL, I looked at the document: they are using Matlab "C" as an example and what looks like Pascal. Any serious programmer --and that includes me, some of my stuff is used on the web today-- nowadays is using C#, C++, Java, in that order, and maybe F# if they want to be fancy. Hardware people use C. There are no exceptions to this rule. Scripting languages like Perl, Python and Matlab and Visual Basic are not serious OOP languages, nor is Ruby, an interpreted scripting type language. I have spoke, and no I don't want to debate this nor start a holy flame war.

They're using domain-specific languages aimed at math/stats. It would be inefficient to use something like Java or C++ for this kind of thing.

they use Stata and Matlab code because that's what their RA's are going to write. They don't claim to be serious programmers. They're serious economists. They just want their code to come close to levels of clarity strived for with serious programmers.

Instagram, which just sold for a billion dollars mind you, has a Python back-end. Pinterest, probably the hottest new thing on the web right now (though I can't fathom why) also has a Python back-end. The original Twitter was written in Ruby. Redis, a rising star in the NoSQL/cache server space, is written in C. The worst language I know of -- PHP -- is something like the 4th most popular language in the world, powering zillions of websites including Wikipedia, and early versions of Facebook.

"No exceptions to this rule?" Take a hike, man.

"My language is better than yours" arguments are the mark of a troll, not a "serious programmer".

"But a little exhortation to write code others can read is harmless"

That is what I dispute. They are blowing smoke up the RAs skirts with exhortations to be Google quality so that the head researchers can give the code to other RAs. On net this could just be a transfer and could yield better research, but I doubt it. However, it is just an example of the problem with academia: "Be me, nay, be better than I am, even though there is little hope of you getting my job." Better research on net will be reform of the research system, not making RAs give even more in return for less.

Meh. I am surprised that any experienced programmer would find much to argue with in that document; it's all good advice. However, I am not surprised that even an experienced programmer would nitpick relentlessly on some total side-issue. As any experienced programmer knows, something like 90% of programmers are dicks...

The 'problem' with programming is that you can start doing it so easily - anyone can download the compiler and start straight away. So you don't necessarily get introduced to the good practices at the same time. Thanks to Tyler for hopefully saving a few more beginner programmers from the usual mistakes.

Comments for this post are closed