What do Data Scientists Do?

by on January 17, 2013 at 10:33 am in Data Source, Economics | Permalink

Andrew Gelman edits The Statistics Forum, a blog of the American Statistical Association. He is looking for a series of guest posts about what statisticians/econometricians/data scientist do during their day–not the theory but the nitty gritty of what a typical day with its frustrations and triumphs looks like. Kaiser Fung’s post, three hours in the life of a (glorified) “data scientist” could be a model. Submissions from stats people in any field, academia, business, non-profit etc. are all welcome. Write Andrew here.

prior_approval January 17, 2013 at 10:52 am

With enough collected information, it’s statistics all the way down – whee!

Sam January 17, 2013 at 11:38 am

I’m an Econ undergrad and my first computer science semester consisted of ways of editing data cells en masse when the format isn’t recognized, given a bunch of constraints on which techniques we were allowed to use. Those intro skills are evidently invaluable.

Scott Cunningham January 17, 2013 at 11:43 am

Great idea. My favorite is when I never fully learn something and so have to occasionally spend three hours doing something that I should know by now. This almost always is the case when working with date variables that have string characters like “/”. By the time I finally remember/relearn the syntax, I’m a pro. But then I don’t have to actually do that again for a while as not every project has raw data with string variables in the dates. That drives me nuts as it feels like ground hog day.

Randy B January 17, 2013 at 12:20 pm

Yeah, that sounds exactly like what people that work in data do all day.

Doug January 17, 2013 at 1:23 pm

Protip: Always store dates in YYYYMMDD format. That way even if it’s not recognized as a date format comparing dates is trivially done by comparing the int value of YYYYMMDD.

AVX January 17, 2013 at 3:46 pm

That is a silly suggestion. Most/All databases store dates in an internal format. You are basically suggesting to store dates as strings which in the database world is the silliest thing one can do.
The person in that link just did not bother to read the manual, which would have taken him 10 mins to refer and solve his issue.
Heck, he calls himself a regular expressions guy .. which if he had used could have solved his problem much faster.

dr January 17, 2013 at 4:13 pm

Not so silly when you consider that most/all databases use different conventions for storing dates in an internal format.

msl January 17, 2013 at 4:17 pm

@The pro-tip is fine. Don’t be nasty. We don’t need to hear about your programming penis.

AVX January 17, 2013 at 6:45 pm

Relax. I was not being nasty. Calling something silly is not exactly being nasty. I was just pointing out that it is not exactly a pro-tip, instead it is a mistake commonly made by amateurs. Reason being that most databases have powerful date functions which would be hard to use if you don’t store the dates/data properly. That and storing something in a proper format reduces the chances of corrupt data being present.
This has nothing to do with programming. The above apply to Excel also.

Rahul January 17, 2013 at 10:41 pm

I’m assuming you are familiar then with the Teradata manual? Can you present how you would do it?

Avx January 18, 2013 at 12:00 am

@Rahul. Google: teradata date …. Or your flavor of database and date

Abelard Lindsey January 17, 2013 at 2:07 pm

I don’t know about “data” scientists in particular. I would say that scientists, in general, are people who study and use science to invent new things with the purpose of science being to invent new things. Government funding (e.g. tax payer funding) of science that does not lead to worthwhile invention seems quite useless to me.

Chris January 17, 2013 at 2:50 pm

And how exactly do you determine whether science leads to “worthwhile invention”? Basic science research can take years, if not decades, to lead to truly innovative products.

Abelard Lindsey January 17, 2013 at 8:10 pm

True, but a lot of what is being funded is clearly rent-seeking parasitism. The tokamak fusion program has no hope of leading to commercial fusion power and NASA has not done any useful research at all intended for opening up the high frontier for human settlement. I personally know people who have worked in each of these milieus. They tell me that it is nothing more than “pork”, social welfare programs for people with PhD’s (this is their words, not mine). Most medical research is similarly bogus. It is the experience of my friends working in these fields that has convinced me (and them) that ALL government funding for scientific research should be eliminated.

Chris January 18, 2013 at 2:10 pm

I would argue that the vast majority of government funded research (at least from NSF, NIH, EPA, USDA, NASA, etc.) is exactly the opposite of rent-seeking. Perhaps defense is another story, but compared to the standard appropriation process, government research monies are generally awarded through competitive award systems. It is incredibly simplistic to say that NASA’s research program has failed based on the observation that we are not very close to building extra-earth settlements. See here for some examples of significant innovations that have followed from NASA research: http://en.wikipedia.org/wiki/NASA_spin-off_technologies . Our success as a country is due in no small part to government-funded research. There is no way you can honestly argue that fact.

dead serious January 17, 2013 at 2:46 pm

ETL is your friend, friend.

Baphomet January 17, 2013 at 2:52 pm

It is very important that people in academia say as little as possible about what they actually do on a typical day. If the man in the street ever finds out, it will all be over.

AVX January 17, 2013 at 3:55 pm

A data scientist who knows the tools/functions of transforming data and who knows regular expressions, will be far more effective at his/her job. I’m shocked at the number of people who work as data scientists, who don’t know how those tools and who then suffer through their day battling simple problems just to able to do some work.

Rahul January 17, 2013 at 10:42 pm

Actually, I’m glad that is so. Competitive advantage. :)

Foobarista January 17, 2013 at 3:56 pm

In my experience, the workflow of a data scientist is something like

1. Figure out something to investigate.
2. Figure out where the data that may be useful for the investigation happens to reside.
3. Find a way to get the data out of these datastores.
4. Figure out how to implement up the code that does the analysis.
5. Come up with the schema in the analysis datastore.
6. Load the analysis datastore.
7. Run the analysis.
8. Look at the analysis.

The tricky parts of being a data scientist is this is extremely interdisciplinary work both in technology and business: you have to understand the schemas of the source data, which may be in many different databases or data stores (or often in log files), and you have to get access to it all, which may involve impressive levels of organizational hoop-jumping, especially in big organizations. Add privacy laws and suchlike, which may require filtering or preprocessing the data, and things get complex. You also have to be a fairly skilled software developer and database/schema designer to do the “middle tasks”, and have to understand enough about the business at a high level to know what questions to ask. Finally, the statistics part is obviously a big discipline.

Few companies have enough resources assigned to “data groups” to have “grad student + professor” organizations where you can have a high-level “pure scientist” directing a bunch of coders and db guys to do what looks like grunt work to outsiders, but requires a deep understanding of the problem of interest to actually do correctly, so typically data scientists end up doing it all, and end up spending most of their time “fighting with the tools”.

msl January 17, 2013 at 4:18 pm


Marc Roston January 17, 2013 at 6:02 pm

“Data Scientist” is an odd term.

In the dark ages of the 1990s, I recall the activity of “data mining” being a cross between calling someone an outright idiot and short hand for “you have no theory, so your hypothesis tests will be completely invalid once you estimate something”.

By the early 2000s, “Data Mining” seemed to transform into a highly valued skill, especially by credit card companies and insurers, looking for correlations that may or may not have structural reasons for existence.

So, today, that same person has been elevated to “Scientist”?

SC January 17, 2013 at 10:11 pm

That’s about how I remember it. I think what happened is that somewhere along the line people figured out that we “formulate hypotheses” about which “mathematical models” will allow us to “make predictions” about “unknown processes,” and we “analyze empirical data” to “test those hypotheses.” Not sure what you’re supposed to call a person who does that sort of work, but I guess somebody said “scientist” and it caught on. Coincidentally, that’s exactly what all the “real scientists” I know do.

John January 18, 2013 at 5:39 pm

Depends on what you’re trying to do. If your task is coming up with a theory to explain some behavior, yes data mining is not the way to go. If your task is to Discover some behavior then data mining makes a lot of sense and is just good empirical leg work.

John January 18, 2013 at 5:43 pm

I suspect what the article is really pointing out is that there will be a lot of mundane, ad hoc type of work one has to do just to be to the point of performing analysis — or doing whatever science is being done.

The ETL challenge is a real one and I’m sure there’s both been and continues to be significant scientific research into the problem.

Jacob AG January 19, 2013 at 8:18 pm

I couldn’t find an RSS feed, so I made one. Here: http://feeds.feedburner.com/wordpress/StatisticsForum

Comments on this entry are closed.

Previous post:

Next post: