For every day that passes, more devices come online, more applications are deployed and more data sources are spun up. There has never been a time in our history when data is so freely available, so widely used, so varied, so fast and so damn big.
Yet even as businesses, analysts, infrastructure and entire parts of the internet struggle with this growth, the volumes encountered to date pale in significance compared to the world of 2020.
This world is the Internet of Things – a world where a million devices come online every hour.
To handle growth and remain competitive, businesses need to change focus, to become adaptive and to become data-driven. At these scales the only way to do that is through machine learning leveraging scalable data processing technologies.
What are data-driven companies using machine learning for today?
A small sample of some common applications of machine learning in industry today includes:
- Customer segmentation discovery through purchasing behaviour clustering
- Sentiment analysis – text mining social presence to understanding what customers are saying in real-time
- Predicting “next-best-offer” at point of sale
- Risk assessment in credit card and loan applications
- Predictive maintenance (infrastructure/plant machinery) – detecting probable equipment failure before it occurs
- Intelligent inventory management – optimising costs against demand
- Any many many more…
But there’s a few problems…
So while this all reads well, generates lots of hype, has a monopoly on buzz words and puts us data scientists in high demand, the reality is that almost none of this is what we spend most of our time actually doing.
We spend a lot of time getting the data ready for action. Data discovery, acquisition, cleaning, aggregation, missing data imputation, feature engineering and exploration is where most of the dollars go. It’s commonly accepted that this can be as much as 80% of the work we do on a given project. And it’s only when this is complete that the experimentation, model iteration and tuning processes can even begin.
Messy tool chains….
In addition, we use a wide range of technology stacks, platforms and languages that are not well integrated into standard tool chains and don’t “play nicely” with each other (R and Hadoop – I’m looking at you – why can’t you two just get along?)
Lastly and most importantly, integration into production workflows is non-trivial on a good day. Many of the tools we use are one-shot – a quick Spark job, a python script or an R library we run a quick script against to produce a graph or maybe score a dataset.
This is great in exploratory mode but is of limited value if these assets cannot be operationalised. And to date, publication of predictive analytical models for enterprise consumers has generally been an afterthought at best.
What is R?
The mainstay workhorse of data science is R – an open source programming language for data preparation and statistical modelling. It is known for its rich graphical libraries, its unchallenged range of statistical tests and tools as well as its vibrant community. However for all its strengths it has traditionally struggled with enterprise workloads and integration.
Microsoft has adopted a two-pronged approach in addressing these issues by enabling R to reach its enterprise potential with Microsoft R Services and enabling data scientists to focus on delivering business value with Microsoft Azure Machine Learning.
Microsoft’s R Services
R Server brings the power of fast distributed computation for enterprise workloads to R, leveraging the Math Kernel Library (MKL) for linear algebra at scale and the ScaleR library for high-volume parallel processing. R Server is also integrated into SparkR and HDInsight (Microsoft’s Hadoop offering).
R services are fully integrated into Visual Studio out of the box but by installing R Client, data science teams can continue to use standard IDEs such as RStudio. Configuration is the work of a moment.
With the ConnectR library, data can be ingested from common sources via ODBC as well as SAS, Teradata, SPSS and HDFS.
With SQL Server 2016, Microsoft has gone one better allowing you to run R code directly inside stored procedures. This approach uses an R Server instance on your SQL Server cluster allowing you to leverage your existing hardware investment for statistical computation and data processing on the fly.
Finally, when it’s time to publish, the DeployR library can be used to turn R scripts into web services running in a secure service that can consumed by clients without any knowledge of R.
In summary, the Microsoft R Services suite addresses two of the main problems we’ve struggled with over the years with R as our analytics tool of choice. Large scale computation and an easy path to enterprise integration have made R a serious player in the Big Data space for the first time.
Azure Machine Learning
This is an online machine learning workbench optimised for fast modelling and data exploration cycles. A rich visual interface allows users to build complex data preparation workflows and apply a wide set of standard machine learning algorithms through a drag and drop interface.
Predictive models can be easily integrated into the enterprise by publishing them as fully functional REST web services supporting either simple request/response or batch execution modes.
They even pre-generate consumer client code for you in C#, R and python along with API keys and sample code using yourweb service (and I’ll just bet it was a disgruntled engineer who thought of that feature…pure gold).
Further, you can leverage existing investments with full R/python integration. This is great if you have already developed models, libraries or graphics that you want to continue to use in your daily work.
Azure Machine Learning Studio makes it easy to checkpoint intermediate results, replay workflows and run experiments. By enabling a fail-fast deploy-fast mode at minimal cost, this allows data scientists to focus on getting results that actually deliver business value.
“What’s in the Box?”
- Standard ML algorithms & tools (Text mining, Regression, Classification, Clustering, Anomaly detection)
- Data sources (Azure Blob, RDBMS, Hive, Web)
- Data preparation (preprocessing, aggregation, transformation, feature engineering)
- 500+ R packages
- Python Anaconda toolkit
In addition, it is fully integrated into both the Azure ecosystem and Visual Studio and can also be automated using PowerShell.
With so much out of the box it’s really easy to get started with Azure Machine Learning Studio. It has a large gallery of built-in tutorials and a gallery of experiments so even if you’re not a data scientist yet it’s a great place to start your journey.
Although we’ve only discussed R services and Azure Machine Learning in this blog, that’s not the whole Microsoft data story. Some of the other offerings Microsoft has in this space include Azure Data Factory (a managed data pipeline service for managing and monitoring ETL across the enterprise) and Azure Data Lake (a service supporting secure ad hoc querying of distributed heterogeneous data sets across low cost petabyte scale storage). Once we’ve got our hands dirty with these, we hope to cover these in a future blog so stay tuned.
Note – this article originally posted on the Bulletproof technical blog – reproduced here with permission.