The AEA’s New Data Policy

The AEA has long had a data repository but no one was responsible for examining the data or replicating a paper’s results and confidential data was treated as an exception. All that is about to change. The AEA has hired a Data Editor, Lars Vilhuber. Vilhuber will be responsible for verifying that the author’s code produces the claimed results from the given data. In some cases Vilhuber will even verify results from raw data all the way to table output.

The new data policy is a significant increase in the requirements to publish in an AEA journal. It takes an immense amount of work to document in a replicable way every step of the empirical process. It’s all to the good, of course, but it is remarkable how little economists train our students in these techniques and make no mistake writing code to be replicable from day one is an art and a science and it needs to be part of the econometrics sequence. All hail Gentzkow and Shapiro!

Here’s more information:

On July 10, 2019, the Association adopted an updated Data and Code Availability Policy, which can be found at https://www.aeaweb.org/journals/policies/data-code. The goal of the new policy is to improve the reproducibility and transparency of materials supporting research published in the AEA journals by providing improved guidance on the types of materials required, increased quality control, and more review earlier in the publication process.

What’s new in the policy? Several items of note:

  • A central role for the AEA Data Editor. The inaugural Data Editor was appointed in January 2018 and will oversee the implementation of the new policy.

  • The policy now clearly applies to code as well as data and explains how to proceed when data cannot be shared by an author. The Data Editor will regularly ask for the raw data associated with a paper, not just the analysis files, and for all programs that transform raw data into those from which the paper’s results are computed. Replication archives will now be requested prior to acceptance, rather than during the publication process after acceptance, providing more time for the Data Editor to review materials.

  • Will the Data Editor’s team run authors’ code prior to acceptance? Yes, to the extent that it is feasible. The code will need to produce the reported results, given the data provided. Authors can consult a generic checklist, as well as the template used by the replicating teams.

  • Will code be run even when the data cannot be posted? This was once an exemption, but the Data Editor will now attempt to conduct a reproducibility check of these materials through a third party who has access to the (confidential or restricted) data.  Such checks have already been successfully conducted using the protocol outlined here.

Comments

When I was a young academic it seemed potty to me that there was typically no way of checking computed results; one was simply expected to accept them. Of course I was so young that my concern was with honest mistakes; it was some time before I learnt that liars and cheats infested the universities.

Are they liars and cheats if they truly believe their (cherry-picked, data mined) numbers are "valid" based on the fact the numbers "prove" their ideologies and narratives?

A bigger problem among intellectuals is they believe stuff that no normal person ever could.

'The new data policy is a significant increase in the requirements to publish in an AEA journal.'

And a real surprise, considering that one had assumed after that littleReinhart-Rogoff contretemps that people were already paying more attention to the data.

Though possibly that explains this web site's reaction to Piketty,

'It takes an immense amount of work to document in a replicable way every step of the empirical process.'

Yet oddly, that seems to be generally considered normal in fields that actually handle empirical data, not merely empirical 'processes.'

'it is remarkable how little economists train our students in these techniques'

No, it really is not remarkable at all.

"Will code be run even when the data cannot be posted? This was once an exemption, but the Data Editor will now attempt to conduct a reproducibility check of these materials through a third party who has access to the (confidential or restricted) data. Such checks have already been successfully conducted using the protocol outlined here."

Expect to see the availability of detailed company data diminished by this. At three of the insurance companies I've worked for, 3rd party data verification was a non-starter for papers that weren't sponsored by one of three industry organizations. Even those had all sorts of hoops to jump through and the data-scrubbing process applied before allowing the data to be used for studies on behalf of the industry was both rigorous (as it should be) and confidential (arguable--I'd argue against).

"it is remarkable how little economists train our students in these techniques"
To tack onto what clockwork_prior said above, I don't find it remarkable at all. A little sad, but I think if you look at other university departments, you'll find yourself in good company on that point. At least, that what I saw in grad school and several of my colleagues from other schools reported the same. It's one of the reasons most of the managers at my employer are biased against hiring PhDs. Too much theoretical training, not enough practical background.

Now we just need the same for Greenhoax Effect research.

I must equate hate with dying lust. Egyptian Stars. Each one another, a tax rate with spending on each pupil. I herded a sheep once. He did, I heard, he spoke with a savior accent.

today we gonna make a poll of how many people at princeton university, (United states)actually believe that developmentally disabled people were surgically modifed by the government and used in ufo experiments in area 51 as described by a princeton university historian in the book area 51!
we looked at the evidence and it seems really thin for such a bold historical claim by princeton

Do ya' wanna car pool to Area 51?

I'll bring the beer. You bring the pretzels.

Crikey!
don't need to car pool
actually don't need a car
s.u. area 51 sept.!
princeton is in new jersey right?

Forensic accounting.
If you accuse a group based on circumstantial evidence then you got to have your evidence in order.

Reminds me of Don Knuth taking a little time to ensure the math papers in the math society journals were correctly printed, along with ensuring his books were error free from his errors but all steps along the way. He then took time off, a year or so, to develop tools.

Prompted, enabled, by essentially the first computer controlled laser printer.

His time of for this side project ended up being years, with large blocks of work over decades.

He produced TeX (Leslie Lamport produced LaTeX built on top).

METAfont to create suitable font sets and glyphs.

CWEB when he was forced to change computer language for porting to Stanford's new computer.

User guides, for each, plus publishing the code so it was readable and logical.

Then, as he wrote TAoCP, write code to document each algorith, plus code for many problems, plus publish papers as he found new things. Then books that collected his papers on what he had been doing, how he did it, general observations.

And the book that started it in the 60s, which grew to seven volumes by the time he published the first edition circa 1970 is now five volumes, with three published in multiple editions, and volume 4 looking to be at least 4 books, and volume 5 might be just one book. And the process has resulted in 20-30 other books..

TeX math mode is replicated or copied in style in most document systems capable of doing math equations correctly.

But the dream of paper authors supplying everything to editors and reviewer which flowed into the author's final work going to editors to include unchanged into the code that was sent to the publisher for printing. Others did projects to create ebooks, online web pages, etc.

But the discipline required by authors and volunteer editors and reviewers is too great, so I wonder if anyone under the age of 30 has heard of LaTeX, and anyone under 40 has used it.

The cost/benefit is too high. Is AEA going to charge the author $5-10k or his employer $100k to accept the paper to pay stipends to the volunteers for their time lost to the proccess rules?

I recently used latex for my bachelors and still use it for my resume. I believe most of my fellow comp. sci. students did as well.

Why does the government publish the policy?How do they carry out it?What's the result of it?

(1) In math, isn't TeX and LateX universal? I think some journals may require submissions to be in it. I thought it was universal in econ too, but maybe I'm wrong. Of course, many economists use some kind of WYSIWIG interface.
(2) Science is much worse than economics in terms of authors providing data for replication. In economics, this has been standard courtesy, even before journals started requiring it. In science, my impression is that it is treated as extraordinary self-sacrifice. See Climategate-- half of that was the leaking of the abominable code of the leading temperature data providers.
(3) The new requirement is like my making my seniors do their problem sets. They know they ought to do them anyway to pass the tests, but it's hard to do without a bump. We all know we'd be happier in the end if we organized our data and code better, so we ourselves could replicate what we did, but human nature is such that we usually don't. So I don't think our effort will increase; I think it will fall, if we are required by the journal to get it so that at least we give them some data and code that generates what's in our tables.

TeX and LaTeX are wonderful in many ways, but they are not solutions to the problem, as they are images of the math, rather than the actual instructions to the computer.

Paul Romer recommends Jupyter Notebooks for reproducible research. See https://paulromer.net/jupyter-mathematica-and-the-future-of-the-research-paper/

"It takes an immense amount of work to document in a replicable way every step of the empirical process. "

While that's true, it's also fairly routine in the engineering world.

Because, you know, engineers who can't replicate their work for the customer stop getting paid.

Granted, I'm talking about third party engineering. It's not uncommon for internal engineers to not document the process thoroughly, though it would still be considered substandard.

I wonder about the rate of software defects, for projects that are more than a couple of Excel tabs of analysis. I'd be curious if its common practice to use code built up by several generations of grad students for this type of work.

Its been a while since I read Capers Jones, but the data indicates that every pretty modest sized commercial software (say 50 KLOC) will have a significant number of bugs, particuarly for software developed without benefit of systematic defect prevention (design reviews, inspections, etc.). Best in class development efforts have fewer, but still significant numbers. Defect repairs have a fair chance (~10-30%) of injecting new defects.

Defects in delivered code are notoriously expensive to detect. The data editor is going to be busy.

Here's some numbers from actual industry experience:

https://swreflections.blogspot.com/2011/08/bugs-and-numbers-how-many-bugs-do-you.html

I'm an engineer working for a large consulting firm. Customers don't really care how reproducible our work is, but it is a desirable feature for quality and efficiency. A checker needs to be able to follow the designer's line of reasoning, unless she is doing a fully independent check by performing all the calculations herself. Making work reproducible requires greater effort when doing it the first time, but decreases work for similar future tasks.

Difficult, quantitative, useful task -> Hire a man for success
Political, organizational, useless task -> Hire anyone or anything

Keep in mind that most academic papers are flops, even if they get published. For the typical paper, nobody cares enough to replicate data analysis or check the proofs, and that's fine, because the paper is not important enough. So the new policy is important because (a) for papers that become influential, replication *is* important, (b) it's nice to have the files available as teaching examples (part of the MIT econometrics program in my day (1982) was a take-home where you critiqued a published paper), and (c) to incentivize the author to be careful for his own sake.

+1 Keeping the data in a replicable format and checked for errors will also permit later researchers, with permission, to use the data for other purposes as well, such as meta-analysis, or running the same study in a different period with different data for that period.

I once tried to replicate a study which clearly stated it used Census SIC data, which you think would be easy, but the author had combined SIC codes without disclosing how he did it. Data clean up and how you handle missing data can affect outcomes, so anything to make the process clearer is to be applauded.

How about extending the policy to theory papers, including econometric theory? I'd like having the proofs laid out step-by-step in an online appendix, with an AER special editor making the author clarify all the hardest steps where the author is tempted to say "Obviously,..." or "Using standard methods,..." or even just "It follows that..." when it doesn't clearly follow.

I do agree, but this problem goes all the way down to undergrad math textbooks where being clear and explicit is supposed to be a primary function. Baby Rudin is a classic example, but many view the book's lack of clarity as a *positive*, as some sort of (idiotic) rite of passage.

I wrote a paper with a math professor, but it was rejected by econ journals, so we decided to send it to a math journal. My co-author told me we had to cut out most of the explanations, though, because the referees would be touchy about any implication that they might need a little help. See:

Christopher Connell and Eric B. Rasmusen, "Concavifying the Quasi-Concave," Journal of Convex Analysis, 24(4): 1239-1262(December 2017) We show that if and only if a real-valued function f is strictly quasi-concave except possibly for a flat interval at its maximum, and furthermore belongs to an explicitly determined regularity class, does there exist a strictly monotonically increasing function g such that g of f is concave. We prove this sharp characterization of quasi-concavity for functions whose domain is any Euclidean space or even any arbitrary geodesic metric space. http://rasmusen.org/papers/quasi-short-connell-rasmusen.pdf or in the longer working paper draft with more explanation, http://rasmusen.org/papers/quasi-connell-rasmusen.pdf

"The AEA has long had a data repository but no one was responsible for examining the data or replicating a paper’s results and confidential data was treated as an exception. "

This is not entirely true. It was my job from roughly 2011-2016 or so to review the submitted materials for reproducibility. To this end, I would try to run the code submitted on the data submitted and see if the output matched the paper. If I didn't have the right software tool or the data was proprietary, I was obviously unable to do that and so had to rely on a review of the code to see if it looked like it was doing what was described in the paper. I also was responsible for logging the data sources, the methodology, etc. as described in the methods section of the paper. I caught a fair few mistakes this way, but it seems like the Data Editor's mandate will be broader and they will be given more resources to truly ensure reproducibility (i.e. software licenses, access to confidential/proprietary data).

Also, that the archives will be requested prior to acceptance is also obviously a huge difference.

I would like to suggest Mathematica as a computational tool that simultaneously documents your work. It shows which files your read in (and shows the data in-line if you want), how you cleaned and summarized the data, which stats you used, and your plots and tables all in a Mathematica notebook that can be saved as pdf for others to read.

Comments for this post are closed