Originally published 07/04/2016 as part of the Bulletproof Technical Blog
The importance of collecting data
As a society we need to measure, plan, predict and test and to do that we need data. We’ve been collecting data in various forms and sizes for thousands of years. When the Romans ran the first censuses in their empire, they sent hundreds of people out to gather and aggregate local information on site and then bring it back for central final aggregation (which if you’ve worked with MapReduce/Hadoop should sound familiar).
This model of moving computation to the data rather than the other way around lies at the heart of large scale data analysis even today – we just get a whole lot more data a whole lot faster!
What qualifies as “Big” data?
In almost every Big Data discussion, someone will talk about data velocity, volume and variety (colloquially referred to as the “3Vs”). People want to know how big is big? If I have a procurement database with a few million orders in it – is that “Big”? What if its 10000 million? 1 million? What about web traffic: does it matter how fast I’m receiving or sending it? Is the data structured or free text? If I have an archive of 10 years worth of scanned legal documents stored as PDFs on backup tapes how can I make it searchable? (yes – these are the sorts of questions data scientists have to be able to answer!)
Analysing data in real-time
The boundaries are fuzzy and really only refer to where we are today. Over the next decade there’ll be a large increase in the amount of connected devices – such as smart appliances, meters and hybrid cars – as part of the much hyped Internet of Things (IoT). And when that happens, the sort of data volumes we’ll be working with will make today’s “Big Data” look quaint in comparison.
But what we really mean when we talk about Big Data is that we now have tools and frameworks that allow us to collect, validate, store and analyse data at scales that have simply been impossible to date. Moreover, we can do it in real-time as the data arrives allowing us to build entirely new classes of analytical/predictive applications that we could not even consider just a few years ago.
The point is that often we’re asking the wrong question – it isn’t “do I have a Big Data problem or not?” It’s “How can I use my data to drive decision making processes?” – which leads us to the Snowplow Analytics platform.
Using Snowplow to capture and process data
Snowplow is an open source, high availability solution built entirely on AWS that solves the “how” of data capture and processing at scale. Central to Snowplow is the concept of an event – this can be pretty much anything that makes sense in your business domain. Some examples include:
- A purchase on an eCommerce site
- Web page views and referrer information for campaign tracking
- A power reading from smart meter
- An enterprise application event such as an order dispatch
- Video play/pause events in a native iOS or Android app
- Social networking events on your site (such as Facebook share or retweets)
Each event captured is sent through the pipeline, undergoing a series of actions/transformations. Events can be piped directly into a wide variety of data stores including S3, Redshift, ElasticSearch and DynamoDB where standard BI tools such as Looker and Tableau can be used for exploration and analysis.
If you know how, you can even use Snowplow and Apache Spark Streaming together and build real-time machine learning applications including recommender engines, fraud detection and customer segmentation classifiers.
The Snowplow pipeline consists of a set of stages shown in this diagram:
- Collectors – a set of AWS resources in an auto-scaling group that receive events from trackers, performs basic validation against your defined event ontologies and tracks sessions. Each event is placed onto an AWS Kinesis Stream (called the Raw Stream).
- Enricher – another set of AWS resources that read each event from the raw Kinesis stream and add extra information to it. Out of the box this includes location information from IP addresses, user-agent parsing for device information extraction and referrer data for tracking sources of site traffic. Custom enrichment steps are also supported. Each enriched event is then pushed onto a second AWS Kinesis stream (the Enriched Stream).
- Storage – AWS Kinesis “sink” clients for Snowplow are supplied to support reading directly from Kinesis into S3 and Redshift for storage.
- Data Modelling – each event can be further expanded upon by joining with other existing datasets such as marketing, financial or customer data.
- Analytics – since all events are captured at the lowest level it is easy to then use standard BI tools for exploration and analysis.
Owning your data with Snowplow
Unlike other platforms, when you use Snowplow you own your data in its raw form – there’s no pre-aggregated API with pre-calculated metrics that someone else decided you wanted – you define what events you want to capture and when, as well as their structure and enrichment.
What counts as an event?
To give you some idea of the level of detail that has gone into the design of Snowplow, the default schema of an event contains roughly 120 fields. Some of the areas covered include:
- Geographic location (including suburb, state, lat/long, country)
- Device profile (phone type, make, model, operating system, browser)
- Timestamps and timezone (device time, event time, collection time, user time zone)
- IP address mapping (ISP, domain, network)
- Marketing source (utm field data from online campaigns)
- Page scrolling (used to estimate time on page statistics)
- Cross-domain and session tracking (for identity stitching across user devices)
- eCommerce transaction data (quantities, sku, currencies, conversion rates)
All of this comes free out of the box. Furthermore, as an open source solution you can fork and patch, upgrade or change any part of the platform as best suits your particular needs.
Event enrichment using Snowplow
The Snowplow team is highly agile with updates to various parts of the solution being released all the time. Of particular note is the recent announcement for support for event enrichment using local weather data. What you can do with this is up to you – as an example, imagine being able to correlate spikes in sales at particular store locations in Northern Sydney with iPhone users on Telstra SIMs on rainy days and therefore send SMS alerts to store managers so they can make sure there’s enough stock? That might sound far fetched but these are exactly the kind of dynamic processes that are needed to support data driven businesses in today’s competitive market.
Given that Snowplow has been designed from the ground up to run on AWS, it’s worthwhile considering partnering with an organisation that can assist in expertise in managing and deploying Cloud infrastructure. This can assist in providing a highly scalable end-to-end data capture and analysis solution. Snowplow is an exciting development because it offers an answer to the question of data capture and processing of significant volumes, meaning businesses can uncover the intelligence and potential hidden in their data and turn it into deep, powerful insights.