The Battle over Junk DNA

Last year the ENCODE Consortium, a big-data project involving 440 scientists from 32 laboratories around the world, announced with great fanfare that 80% of the human genome was functional or as the NYTimes put it less accurately but more memorably “at least 80 percent of this DNA is active and needed.” What the NYTimes didn’t say was that this claim was highly controversial to the point of implausibility. A fascinating, sharply-worded critique, On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE has recently been published by Graur et al. Here is some of the flavor:

This absurd conclusion was reached through various means, chiefly (1)
by employing the seldom used “causal role” definition of biological function and
then applying it inconsistently to different biochemical properties, (2) by committing
a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate
the crucial difference between “junk DNA” and “garbage DNA,” (4) by using
analytical methods that yield biased errors and inflate estimates of functionality, (5)
by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical
significance rather than the magnitude of the effect. Here, we detail the many logical
and methodological transgressions involved in assigning functionality to almost
every nucleotide in the human genome. The ENCODE results were predicted by one
of its authors to necessitate the rewriting of textbooks. We agree, many textbooks
dealing with marketing, mass-media hype, and public relations may well have to be

Graur et al. make a number of key points. ENCODE essentially defined “functional” as sometimes involved in a specified set of biochemical reactions (I am simplifying!). But every microbiological system is stochastic and sooner or later everything is involved in some kind of biochemical reaction even if that reaction goes nowhere and does nothing. In contrast, Graur et al. argue that “biological sense can only be derived from evolutionary context” which in this case means that functional is defined as actively protected by selection.

There are a variety of ways of identifying whether a sequence is actively protected by selection. One method, for example, looks for sequences that are highly similar (conserved) across species. Once evolution has hit on the recipe for hemoglobin, for example, it doesn’t want that recipe messed with–thus the chimp and human DNA that blueprints hemoglobin is very similar and not that different from that of dogs. In other areas of the genome, however, even closely related species have different sequences which suggests that that portion of the DNA isn’t being selected for, it’s randomly mutating because there is no value to its conservation. When functional is defined using selection, most human DNA does not look functional.

It’s also interesting to note that the size of the genome varies significantly across species but in ways that appear to have little to do with complexity. The human genome, for example, is about 3GB, quite a  bit more than the fruit fly at 170MB, But the onion is 17GB! (One of the reasons that onions are used in a lot of science labs, by the way, is that the onion genome is so big it makes onion nuclei large enough so that you can easily see them with low-powered microscopes.) Now one could argue that this lack of correlation between genome size and complexity is simply a result of an anthropocentric definition of complexity. Maybe an onion really is complex. Two points counter this view, however. First, similar species can have very different genome sizes. Second, we know why the genome of some species is really large. It’s filled with transposons, so-called jumping genes, sometimes analogized to viruses, that cut and paste themselves into the genome. Some of these transposons, like Alu in humans, are short sequences that repeat themselves millions of times. It’s very difficult to believe that these boring repetitions are functional (n.b. this is not to say that they don’t have an effect.) The fact that it’s this kind of repetitious, not-conserved DNA that accounts for a large fraction of the differences in genome size is highly suggestive of non-functionality.

Why discuss such an esoteric (for economists) paper on a (nominally!) economics blog? The Graur et al. paper is highly readable, even for non-experts. It’s even funny, although I laughed somewhat sheepishly since some of the comments are unnecessarily harsh. Many of the critiques, such as the confusion between statistical and substantive significance, arise in economics and many other fields. The Graur paper also makes some points which are going to be important in economics. For example they write:

“High-throughput genomics and the centralization of science funding have enabled Big Science to generate “high-impact false positives” by the truckload…”

Exactly right. Big data is coming to economics but data is not knowledge and big data is not wisdom.

Finally, the Graur paper tells us something about disputes in economics. Economists are sometimes chided for disagreeing about the importance of such basic questions as the relative role of aggregate demand and aggregate supply but physicists can’t even find most of the universe and microbiologists don’t agree on whether the human genome is 80% functional or 80% junk. Is disagreement a result of knaves and fools? Sometimes, but more often disagreement is just the way the invisible hand of science works.

Hat tip: Monique van Hoek.


Great post. Your comments on economics and the scientific method are well-taken.

"In other areas of the genome, however, even closely related species have different sequences which suggests that that portion of the DNA isn’t being selected for, it’s randomly mutating because there is no value to its conservation. When functional is defined using selection, most human DNA does not look functional."

Another explanation would be the presence of selection for a different allele of the gene (ie the allele has a different, more advantageous function in other closely related species). Neutrality tests are usually pretty good about detecting this type of selection.

I am very skeptical of claims about the human genome at this point. In plants we regularly have genomes sequenced and can perform controlled breeding experiments to detect a "causal" gene we often have difficulty getting closer than a few centimorgans for some genes. In humans, power for detection is much less since forced mating experiments are generally frowned upon.

I would not frown upon some forced mating right about now.

JR, Big Bertha. Big Bertha, JR.

Onions don't seem complex, but they never fail where our writers are only occasionally successful.

What is your definition of success? Very few writers end up being sliced into pieces and/or eaten alive. Though they do occasionally get skewered.

Dickens tries to make me cry, but an onion usually succeeds.

Yes, there are huge open questions in every scientific domain. But at least physicists and biologists can _occasionally_ offer useful predictions.

I was reading the Atlantic story on "we will never run out of oil" and I realized that on that important question (peak oil), the physicists have been repeatedly wrong and the economists repeatedly correct.

(Or you could say that the physicists have been right but they repeatedly answer the wrong question).

by 'physicists' , do you mean 'geologists' ?

By geologists, you mean geophysicists, who are far more likely to believe things such as peak oil than the kinds of geologists who interpret geophysical data. Geophysicists have the arrogance of all physicists, this is especially the case because most stratigraphers and structural geologists in the oil industry are trained in a different track in undergraduate and barely socialize.

OK, we're off-topic now, but now that you mention the Atlantic piece:

I was struck by the power of priors and narratives here:

"Some scholars today doubt how much the Netherlands was actually affected by Dutch disease. Still, the general point is widely accepted. A good modern economy is like a roof with many robust supporting pillars, each a different economic sector. In Dutch-disease scenarios, oil weakens all the pillars but one—the petroleum industry, which bloats steroidally."

The first sentence refers to the 'case study' from which 'Dutch disease theory' sprang. Despite the single case study now being questioned, the author assures us that the theory itself is sound.

Top it off with a wonderful just-so story about a tinkertoy economy as seen from the top-down, and Voila!

Physical systems are much easier to understand, at least outside the quantum/cosmic levels, than societies. The difficulty of prediction of most of the physical questions you're thinking of is comparable to the difficulty in predicting whether a free-market or central-command economy will generate more output, e.g. the outcomes of Chinese economic reform.

Physicists are often more enamored of command economies than biologists, since they tend to think that they will be in charge of the planning.

'by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,”'
So much for the notion that science should be written in English.

There is no way around the fact that science generates new concepts, and that these need names.

Of course; but there is always the option of using names that are precise and self-explanatory.

Actually the difference is perfectly understandable and easily explained. Garbage is what you want to actively get rid of. Junk is the stuff in your attic that isn't useful now but might someday might be useful so you hang on to it a bit.

Oi! Is AT you claiming that Economists can find the rest of the universe?

As a biologist, I was happy to find that this was all correct.

'In contrast, Graur et al. argue that “biological sense can only be derived from evolutionary context” which in this case means that functional is defined as actively protected by selection.'

So, human DNA involved in the appendix - functional or non-functional? Semi-functional? Just imagine it to be functional or non-functional, so the question disappears?

'physicists can’t even find most of the universe'

Nope - the most consistently accepted model(s) is/are the one(s) with the problem. Which might just lead a good physicist to question the model, the questioning being done within a framework involving testing based on data (though some consider this all relative, of course). But that would be real science - something physicists tend to be unable to find in economics at all.

(Just for those interested in the appendix - 'It is widely present in the Euarchontoglires and has also evolved independently in the diprotodont marsupials, monotremes, and is highly diverse in size and shape.'

I think you have hit on an important underlying question: at what level do you define "functional"? If I do nothing but sit on the couch eating potato chips, is none of my DNA functional?

I see the point you're making about economics, but your comparison to physics is really off-base, since most of the groundwork is laid by the theoretical physicists who cannot conduct tests on their models. (Simple example: what happens to matter at the event horizon of a black hole - good luck testing that one!)

prior_approval doesn't care about your criticism. MR blog post = opportunity to insult MR blog publishers. The actual content is not so relevant.

Ah yes the old slippery slope. What was absurd was the *original* idea that only a minute fraction of the genome is not "junk." Now, encode didn't really teach us anything about the function of "junk" DNA (nor about many other things)--I found it mostly to be fluff designed to give some early confirmation that it wasn't a waste of money because the real work will take a very long time indeed. But that's a very different argument.

On the *other* hand semantic debates over "function" are especially pointless. Do you know the completely unsatisfying way "function" is normally defined in *classical* gene studies, involving "deleting" and "overexpressing" as ways of claiming "necessity" and "sufficiency" without really telling us what the hell the gene does or how it does it? In any event, we will eventually come to understand the complex roles played by all the various non-classical mechanisms like microrna editing and so forth, and yes there will be silly encode-like claims along the way but they needn't philosophically derail us.

Genomicists need to get back up with geneticists. Geneticists are always more hesitant to claim something as useless because they know genes are often environmentally sensitive. Many genes appear to have no function. Until they do.

Genomicists have so much data, they forget the basic rules of genetics and the genotype by environment interaction.

I'm not a biologist nor a doctor, but reading this brought to mind the revolution in understanding of the brain that opened up when they figured out that it didn't follow rules. Our brain is constantly changing and adjusting to the inputs it gets. It is far more complex than some mechanical model where this patch of cells handles vision, that one handles input from the left hand.

To look at the genome and decide that this is useful, this isn't borders on hubris if you don't understand in detail the process by which that genome is used from the fertilization of the egg culminating in an adult male or female.

Yes. The whole 'junk DNA' thing has always struck me as hubristic. From a gene's-eye view, they're just playing a different game than they're non-junky counterparts, and if the scoreboard is survival, they're doing ok.

Natural selection produces these hard-to-fathom things all the time: breathtaking economy here, massive redundancy there, friends and allies that are also rivals and competitors. It's a rich tapestry.

And Alex, the analogies to economics are, IMO, profoundly apt. Look no further than the word 'niche'.

I will say that you and Scott Sumner get the wrong end of the stick with your 'science envy'. Science cannot be measured by the vast amount of stuff it doesn't know, but by the comparatively small, tentative, but spectacularly successful things we have learned. This is why economics is still milling around near the starter's gun compared to real science. Sorry.

Tyler isn't interested in predicting bubbles - and even if he was, economists seem absolutely horrible at predictions - and calling together a group of 10 economists to explain the cause of the 2008 meltdown would yield 10 different answers.

Remind me again what practical use this field of study has?

I think we have gleaned a lot of insights from microeconomic study, including behavioral economics.

Macro, to me, is what used to be called, more honestly, "political economy". There is a huge demand for answers to questions of political economy, always has been.

It is not the fault of economist that the supply of high quality answers have been lacking- it's the nature of the beast. But "we don't know" isn't really helpful, so the supply of one-armed economists burgeons nowadays.

It is interesting to see the stung reaction to this study by Graur et al, because the idea that the amount of useless DNA is large gives moral support to those who believe that life has developed through random processes. Any study that undermines that belief must be attacked. Take an argument by a theist that God created the living creatures that exist and have existed and substitute "evolution" everywhere the word "God" appears and you have the basic world view of the believers in random evolution.

still crazy after all these years

3GB for a human, 17GB for an onion. That's some intelligent design right there...

We could fit several complete human genomes in an onion. That might be a good backup, if something were to happen to us. I wonder if any species has already done that.

Except one is a fairy tale and the other you can see evidence of with your own lying eyes.

Other than that, exactly the same.

Haven't really thought very deeply about this, have you? The random evolution paradigm is a story about how things have come to be as they are over billions of years. The actual events that occurred cannot be observed. When you ask evolutionists how life arose, for example, they have a lot of speculation, but little evidence. When someone like Michael Behe describes the complexity of the flagellum, and questions how something can arise that cannot work until all parts are in place, you get just-so stories of how, maybe, it could happen. And ridicule, which is a key weapon in their armory.

There is a touching belief in the purity of science by the uninformed (even those who are intelligent enough to know better). The actual practice of science is a good deal messier and less precise than commonly believed. At bottom, the war over junk DNA is more than just competing theories, but a deeper philosophical battle grounded in ideas beyond science.

I recommend The Origin of Species, by Charles Darwin. Any educated person can read and understand it (unlike, say, Newton, let alone 20th century physics.)

You are correct that biology, as an historical science, is more like 'social science' than physics is, but we really do know quite a bit with a high degree of confidence in biology.

Biologists definitely don't know everything and don't have all the answers. However, the theory of evolution has been fairly well substantiated, regardless whether the nitty gritty details are fully understood.

Isn't that also true of magnetism? We can prove it exists as a force without knowing every last detail of how it works. Can't say the same for "heaven," or "God," for example.

There have been several other critiques of the ENCODE work recently - you get the impression that several people started on manuscripts right after all the big headlines hit, because they couldn't stand it all.

As you'd imagine, this fight is also playing out inside the pharmaceutical industry, where we're very keen on finding new drug targets and avenues to treat disease. Opinions are pretty sharply divided over here, too, although (if I had to put money down), I'd say that the "80% functional" idea is probably losing out. Here's a recent post on this, with some links to other critiques:

Since life is way, way more complicated than anyone ever imagined, it may turn out that ENCODE and its detractors may both be somewhat right. It is easy to get primary sequence data (the order of A, T, G and C bases) that make up the genome. Sequences are something humans can get their heads, and computer programs around. But other parameters, such as flexibility, or something even less salient, could be selected for. Lots of different primary sequences could do the job in that case. I suspect we will recognize a smooth gradient from strict sequence requirements (to code for enzyme active site residues) to easily identified sequence families (promoter sequences for example where the data can be summarized with a sequence logo) to sequences where some physical property is important but the pattern is hard to spot, to real junk due to things like recent retroviral insertions.

Why is the "how much junk" question interesting? It strikes me as a pop-science concern, or possibly something for religious arguments. I'd think though that for much of science you simply wouldn't care. You'd be looking at "delta" between species and populations. In that you could never truly exclude inactive seeming genes. (I guess if in onions junk DNA was a protein store used in spring growth that would be interesting.)

It's interesting on several levels: one of them is sheer bafflement and scientific curiosity. You'd think that the handling of DNA, as crucial as it is to cellular (and organismal) reproduction, would be under very strong selection pressure. And in fact, almost all of its "subroutines" are very highly conserved across huge stretches of living creatures (replication, transcription, error-checking, unwinding, coiling of coils around histone proteins, etc.)

So why, then, does the amount of apparent junk in the DNA vary so spectacularly? You have fish that have almost none of it (the fugu pufferfish) and fish that have vast piles of it (the marbled lungfish). The same goes for plants, for mammals, etc. (To add to the mystery, the prokaryotes - such as bacteria - don't seem to have anywhere near this level of variation. Why not?)

How come this stuff hasn't been worn away by natural selection - is there any kind of penalty at all for stuffing your genome full of shredded phone books? If not, why the heck not? Or are this stuff not as worthless as it appears - if so, what's it doing? And why do some organisms have so much more of it than others? We're missing out on some fundamental ideas somewhere.

If some (or most!) of that "junk" turns out to be functional, well, people like me in the drug discovery business suddenly have a lot more possible targets to mess with. You'd have to think that these sequences would have consequences for human disease, and could (in theory) be affected by various therapies. And the biologists suddenly would have extra layers of regulatory complexity to untangle - it would mean that there are some really key things about molecular biology that have (up until now) been dark.

Thank you, the pufferfish vs lungfish conundrum does make it sound a bit more interesting. I had thought that gene mechanisms would be heavily researched but that the junk percentage itself was background (more in plants, less in animals).

Is disagreement a result of knaves and fools?

Yes, if you wear partisan blinders and have an excessively large ego.

"Big data is coming to economics but data is not knowledge and big data is not wisdom." T.C. (above)

"Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not understanding." RRS

"Where is the Life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?" T. S. Eliot

"I opened the doors for you … showed you how the system works … the value of information … how to get it!"

Sorry, first quote is by Alex, not T.C.

"....more often disagreement is just the way the invisible hand of science works."

This seems to imply too much rationality to scientific paradigm victory . After reading Kuhn's "Structure of Scientific Revolutions" and any amount of the Sociology of Science , its hard to accept that it is always rational.

Of course it is not always rational, it is just the best framework for rationality yet devised.

Science is rational. It is just practiced by often irrational creatures. You haven't lived until you sit in a conference of specialists arguing whether an organism is or is not a specific specie.

With no dog in the fight, it is quite entertaining.

Not always, and blind alleys happen. In the end, though, I accept the Popperian view that the objectivity of science does not depend on the objectivity of the scientist. The truth, as it were, will out.

To the extent this isn't true, I think it says more about the difficulty of applying the methods of science usefully to the field of study than anything else (complex systems are a bitch.) In such fields, there is ample room for ideologues of all stripes posing as 'scientists'. Which I think is a useful summary of the current state of social 'science'. And climate 'science' for that matter.

Whether there is any junk DNA or not is a biological question, not one of economics or religion. What's the rush? Some DNA thought to be junk has later been found to serve an unanticipated and somewhat roundabout regulatory function. The human political function of religion, to rule and guide a family, tribe, or nation by ascribing rules, mitzvot, creation, to a 'god,' has been understood since ancient times. Let's keep science as safe from politics and political functions as we can.


This site knows about as much about genes, genomics and cellular metabolism as it does about economics.

Re junk DNA, perhaps you want to look at some of the network analysis of DNA gene interactions before confidently predicting anything, much less discarding what one group thinks is a defect in a junk gene on the periphery. Here is one link: What is on the periphery in fact is important, when combined with other defects.

Your claim of knowledge on this complex subject is a bit ironic: just finished a Coursera course on network analytical techniques, and my niece is the administrator of a very large genetics and biological research institution which is using network analysis for mapping gene interactions. If you read some of the network articles, like the one above, you would see that what is considered junk today may not be tomorrow. The researchers in this area would dispute your knowledgeable claim of what is junk and what isn't: it's time dependent, based on current knowledge. To give an example, some fetuses abort because of combinations of defective genes, including combinations of defective "junk" genes. Are they junk? (See article above at 8689) Similarly, disease, if you read the article, is not simply a state of yes/no, but rather combinometric with a whole bunch of interactions.

Burn the books, or look inside of them. Junk today may not be junk tommorow. And, knowing what is and is not important is also valuable knowledge.

If you want to watch a TEDMED video on the subject of complex gene interactions and network science, here is a great presentation at TED by Barabasi: there is so much that is unknown, and calling something junk when you don't know is just a folly:

Here is the video, click on the Barabasi presentation:

Prof. Larry Moran, at his blog, has done a terrific series of posts on the spurious and grandiose conclusions many have drawn from the ENCODE data. Moran, a strong and persuasive proponent of the high percentage Junk DNA view, almost has as much fun with the high-percentage-functionality proponents as he does with intelligent design advocates.

The larger point is right but just to clear -- biologists (not "microbiologists" really) _can_ agree and generally _do_ agree -- it's 80% junk. This is one of many good attacks that have been made on the idiotic ENCODE methodology (although I think this paper, which I read a couple months back, is, um, a bit too unprofessional). This isn't "controversy."

Junk DNA is important because creating long chains of nucleotides is energy expensive. There is considerable evidence that certain genera and even classes have reduced the length of their DNA over time. I have personally reviewed currently unpublished research tying this to the development of flight, but it appears in other contexts. I personally don't believe that the energy cost of what we call junk DNA is all that critical in large complex organisms, but it is important. I am also very suspicious of the entire concept, we know way too little about how DNA functions particularly in relation to the evolution of the cell, to be so arrogant as to assume that we know that this material is useless.

Over the centuries body organs such as the tonsils and appendix have often been regarded as vestigial organs, but recently we have discovered their rolls in protecting the lungs from infection and providing a safe refugia for gut flora, in other words we as scientists assumed we knew much more than we did.

Also as regards to random evolution, my own work in micropaleontology has shown me that competition is an incredibly efficient designer. Poor design is relentlessly punished over the long term and almost everything is both perfectly designed for its niche and the product of relentless competition.

Big Data came to high-energy physics decades ago. Their solution involved learning a lot more about statistical analysis techniques, and cranking up the acceptable bound of "statistical significance" from 95% to 99.99997%.

I suppose you could be concerned about Big Data if you relied upon frictionless mathematical theory, but why oppose using as much data as you can manage.

In the pricing space today, there are a few analytical software firms that basically manage pricing for retailers, model pricing based on local weather forecasts, predictive reactions of rivals, extremely intricate elasticity and cross elasticity measures....extremely complex software managing HUGE amounts of data, earning retailers significantly more for those who use it than those who do not.

Why diss Big Data with the statement: "Big data is coming to economics but data is not knowledge and big data is not wisdom." Data can give you knowledge, unless you solely rely on ideological theory. Why not be an empiricist?

This seems like an argument over words: "Junk", "Functional", etc. Why not just use accurate descriptions.

"Conserved", "Active", "Gene" seem much more useful because they're descriptive.

Comments for this post are closed