Originally published 22/06/2016 as part of the Bulletproof Technical Blog
Collecting and analysing data isn’t anything new so why do we suddenly have a whole new field and what is it all about anyway?
Data has been around in large volumes since at least WW2 when thousands of women processed thousands of data points to produce trajectory data tables for artillery batteries. These women were the original computers – which is what they were actually called at the time. Obviously things have changed a bit since then.
Over the intervening decades, various fields containing a mix of mathematics and technology have sprung up with intriguing names such as Knowledge Engineering, Artificial Intelligence, Information Science, Data Mining, Business Intelligence, Predictive Analytics, Data Engineering, Machine Learning, Decision Science and most recently Data Science.
Arguments rage about what this field actually is and opinions are as varied as the people who work in it. Some very smart people (including Nate Silver himself) think that it’s simply a made up buzzword to replace the name “Statistician”. For others, the terms “Data Science” and “Big Data” are interchangeable since they’re often used synonymously in industry and the popular press. Many in IT feel that any software engineer who can use or write software to calculate some statistics and create some graphs is a data scientist.
Clearly there is no consensus on what this field actually is, what it is we do or why we exist in the first place. So let’s explore what a Data Scientist can actually do and where they fit into your wider technical team.
A Specialist With Fingers In Many Πs
It’s often said that a data scientist is someone who is better at statistics than engineers and better at engineering than statisticians. And this makes a lot of sense because a practising data scientist needs to be a lot of both (despite what Nate said).
As data has become more easily available, business and society has become more data driven. This has created a need in the market for a particular type of specialist – one who has a deep understanding of how technology and business interoperate coupled with a knowledge of statistics, machine learning, software engineering and large scale data processing platforms. Someone whose sole focus is leveraging data to add business value and help drive strategy. Who can bring together heterogeneous data sets from disparate corporate data assets to provide a clear overall picture for key stakeholders. Someone who can communicate ideas and present findings to executives without wrapping it in obscure technical/statistical jargon.
Who never forgets that All Models Are Wrong But Some Are Useful.
And that’s what a Data Scientist does. The rest is just noise.
Slicing & Dicing the Many Faces of Data Scientists
Today data science lies at the intersection of statistics, computer science, machine learning and software engineering with minors in business intelligence and data engineering often thrown in there as well. Machine learning itself draws on neurophysiology, psychology, marketing, game theory, graph theory, fuzzy logic, information theory, natural language processing, evolutionary theory, knowledge engineering, robotics and computer vision and even philosophy.
It can be a lot to sift through but it’s fun, rewarding (and exhausting)!
So different data scientists have different focuses, backgrounds and skill sets – and if you place four of them in a room you’ll get five opinions on what data science is and isn’t.
Strategy and Tactics
One way of looking at it is to slice data scientists into two broad categories:
Usually comes from a business intelligence or data analyst background and is strong in overall business thinking, planning and performance. Will be comfortable with some technologies such as Python and SQL but probably hasn’t come from a computer science and engineering background. They’re likely to throw together a one-off script for an analysis and their focus is on working out what the hypotheses and questions need to be.
The computer scientist, software engineer, mathematician or statistician – gets their hands dirty when needed, fluent in a variety of tech stacks and languages and is the person who can extract the answers. Generally has more expertise integrating solutions into enterprise workflows.
Any serious data science team will need both types to really be effective and every data scientist should have overlap in both anyway. As with all things outside of textbooks, the boundaries are fuzzy.
Another view is to group by discipline and background:
- Statistics: Expert in all modelling and prediction methods, hypothesis testing, experimental design, sampling and (often) QA engineering.
- Computer Science/Software Engineering: Fluent in a number of languages, experienced in enterprise systems architectures and design.
- Data Engineering: Building data pipelines and infrastructures, data warehouse design, ETL and CDC expertise, leveraging technologies for processing at scale.
- Machine Learning/Data Mining: Expert in learning algorithms, data preparation and validation, accuracy metrics and training methodologies.
- Business analytics/intelligence: More focused on ROI metrics, profitability modelling, business analytics and dashboard design.
- Data Visualisation: Data presentation using concepts from cognitive psychology and visual design.
We can also bring these two viewpoints together to get an overall sense of the Data Science spectrum. Note that this is only meant to be indicative of relative focus: there are plenty of highly technical BI people just as there are strategic statisticians and software engineers!
Again any data scientist worth having would fall into one or more of these camps. These are useful categorisations to keep in mind when building a data science team.
Drilling Down Into Big Data
Big Data is probably the biggest buzzword in the industry at the moment so it’s important to talk about it – if nothing else than to dispel some common fallacies.
It should be clear from the discussion so far that you don’t need to be a data scientist to work with large data sets and you don’t need to work with large data sets to be a data scientist.
Big Data is usually characterized by the “3Vs” – Velocity, Volume and Variety. Different domains have different mixes of these for example:
- A real-time feed from an IoT humidity sensor may have high volume, medium velocity and low variety (well structured readings with some noise).
- A large repository of PDF legal documents naturally has an extremely high variety, a huge volume but negligible velocity.
- A Twitter feed has high volume, high velocity and high variety.
- A data set of scientific observations is likely to consist of well structured records (low variety), large volume (many observations) but zero velocity (experiment is completed).
In another blog we talked about how what Big Data really means is that we have tools available to collect and process data at scales we couldn’t handle 5-10 years ago. We also discussed how apart from being an over-used marketing term it’s also a term that’s relative to the technologies deployed to handle it.
For example, AWS Aurora can currently grow to 64 terabytes and AWS Redshift to 2 petabytes which in 2016, is still a whole lot of ones and zeros. But as we saw at the AWS Summit in April, the Internet of Things (IoT) is becoming a reality and soon enough we’ll have real time data streams at volumes that will dwarf what we’re currently dealing with. So here’s hoping we come up with a better phrase for this than Bigger Data, Big Data++ or Big Data 2.0.
Large sets of data also introduce problems in and of themselves – any data set large enough will necessarily show patterns simply as a consequence of its size. Random variations will start to look like interesting and exciting patterns leading the analyst into a gravity well of sleepless nights while trying to work out why model predictions don’t generalise in production.
One of the biggest fallacies going around today is that there’s always “gold in them there hills”. If you dig deep enough you can always find something: but it isn’t always gold.
This is known as ‘The Curse of Big Data’.
Again Data Science
So data science overlaps with many different fields and we don’t claim to be experts in all of them. But we do need to know enough to apply a wide variety of tools and techniques to data and produce something useful and relevant to business strategy.
This means spending a lot of time researching, learning and training outside of our day jobs just to try and stay on top, while all the time running the risk of becoming a jack of all trades and master of none.
A data scientist needs to be reading every week and thinking every day. There is so much happening in the field (no matter what you decide Data Science is or isn’t) and so many technologies that can be used to tackle any given problem. We need to reason intelligently and find the signal in the noise both in our work and in how we do that work.
We need to filter out what’s useful to business from what’s available. Because not every machine learning technique needs to be mastered.
Working out which ones to use is the real challenge…
Categories: Data Science Trends