What is a Data Scientist?

The world is now generating zetabytes of data annually and it is only projected to increase. The spread of smart phones, the amount of sophisticated yet cheap sensors, near zero cost of storage, and the amount of investment dollars behind Big Data are pushing demand for data scientists to the forefront.

Software professionals are perfectly positioned to benefit from this trend. Plus, Big Data is a lot of fun to work with!

So, what is a Data Scientist?

The term data science has been around for more than 30 years. Data science has been called a combination of statistics, data munging, and visualization.  It also has to do with hacking, substantive skills, and math/stats (venn diagram). Forbes article: A very short history of data science.

In today’s fast moving world of technology data scientists draw on a combination of skills. A data scientist might be part DBA, part computer scientist and part coder. A solid background in statistics with an understanding of research principles and a critical mind is required. Data scientists may be involved with pattern recognition, data visualization, artificial intelligence, computer vision, and the tools needed to organize and make use of big data (NOSQL, Hadoop, etc).

The good paying jobs for data scientists are in industry, usually large industry (since large industry currently has the means to capture Big Data and pay for its optimization). The expectation is the ‘data scientists’ will uncover findings that generate some economic benefit well beyond their salary.  For example: mapping oil wells, optimizing shipping routes, predicting diseases before they show symptoms.  These ‘data scientists’ understand how the business works, and what matters to the bottom line. I take issue in labeling the profit maximizing role as science, but more on that later.

Software developers who have familiarity with ‘data science’ gain a valuable specialization that could morph into a second career. The neat thing is, the experience acquired by a data scientist is timeless and grows in value over time. Consider in software development, the shelf life of most skills is short given rapidly changing trends.

Person using virtual reality

(Image from the Idaho National Laboratory flickr collection).


Data Scientists and the Quest to Maximize Advertising Revenue:

Science is a sacred word and we should not so carelessly dilute the meaning of it.  At what point do we take away the word science, and put in analyst? For more: why the term data science is flawed but useful.

It has been pointed out there are too many ‘data scientists’ focused on trivial problems like maximizing advertising revenue.

Jeff Hammerbacher, former Facebook research scientist said in 2011: "The best minds of my generation are thinking about how to make people click ads." His conclusion: "That sucks."

Can it really be called science if the goal is to maximize advertising revenue for a particular social media website?  Can it be called science if standard deviation, linear regression, r factor, t-tests and the like are given zero credence.  I think not. Correlation is not causation!

Classification of Data Scientists:

To help explain the spectrum of data scientists, I’ve broken it down into the following four broad categories. To me the first two are clearly science, while the latter two are more of a grey area.

  • Theoretical Data Scientists work on the theory of data science and contribute to frameworks and tools other data scientists use.  This is essentially statistics, data storage, and computer science as applied to Big Data on a theoretical level (academics).
  • Applied Data Scientists are out to gain a better understanding of the world using big data.  Since ‘science’ does require rigor I see this grounded in academic rigor, but used in an applied manner. At the outset an applied data scientist’s job is to formulate hypotheses and test them using data.  In a perfect world, everyone benefits from their research findings and tools.
  • Industry Data Scientists use applied data science for a specific market problem, industry, or business for the sole purpose of maximizing profit. Industry Data Scientists must be proficient at communicating their findings to the business, such that it can be easily understood and acted on.  Training or experience in business, economics and accounting as it applies to the business domain is where the value is created. The roles of Business Analyst or Business Intelligence consultant are pretty similar.
  • Advertising Scientists may or may not be trained in data science and apply the craft towards maximizing clicks and optimizing A-B tests. May use pop-sci methods. Maybe we just drop the term scientist, and call these folks Advertising Maximizers?


Within the applied and industry categories, I envision two additional types:

The extroverted data scientist:

Data scientists who work with people will be required to write reports read by humans and influence their decisions. Data scientists in this camp will need to excel at communication with normal people.  The only way to get value out of the data is to communicate the findings to the decision makers.

The introverted data scientist:

Picture someone with several screens of data and code open at once, working on some algorithm.  They might look up from their dimly lit desk, pull out their ear buds, and ask their boss – “What field is the user experience stored in?”  

The more introverted data scientists will be working on cleaning up data, building tools, and engineering data feeds used by other systems. The data feeds will pass relevant events up the chain to other systems or human decision makers. Consider bio-metric data, intrusion detection systems, twitter sentiment feeds – all working to make sense out of very noisy data. The feeds must process the data in real time to be valuable, so speed is a factor.

I can’t wait to be able to stand inside my data sets, manipulate them in 3d and fly through them:

Idaho National Laboratory’s new 3-D computer-assisted virtual environment — or CAVE.

Person using virtual reality

(Image from the Idaho National Laboratory flickr collection).


This entry was posted in Data and tagged , , . Bookmark the permalink.

Comments are closed.