PPT: Accelerate Academic Research with Cloud Computing

In this deck we will discuss how Microsoft Azure can be used to help Academic Research, and satisfy broad requirements and needs of researchers. We will cover Azure Machine Learning, HDInsight, HPC and other Azure services.

2016-12-08 – Academic Research – MTC Studio

Reference materials:

 

Webcast: Enabling student success with cloud computing

In this webcast you will learn how to:

  • Access data analytics tools to enable real-time and predictive analytics
  • Improve student success through measurable results
  • Make the future become less about student grades and more about measuring and customizing education to the needs of the individual student

We will also cover following examples and case studies:

  • Cleveland Metropolitan Case Study
  • Predicting student dropout risks, increasing graduation rates with cloud analytics in Tacoma Public Schools
  • Predicting Student Success using Azure Machine Learning in Northeast Wisconsin Technical College (Proof of Concept)
  • Restart Academy of Missouri (Envisioning Demo by Neal Analytics)
  • Education Data Management showcase (Power BI model by Dell)

To access the webcast, you will need to fill small registration form.

Technologies: Azure Machine Learning and Power BI.

Reference materials:

Azure Machine Learning Hands-on Labs

Last update: Oct 17, 2017

In this post I will provide information on Azure Machine Learning (ML) Hands-on Labs training for developers, which we will be delivering in New York and other technology centers. After this training you will know how to create Azure Machine Learning experiment, select best ML model, convert the training experiment to a predictive experiment, and create application which will use the model.

The training consists of following labs.

  1. Predict Individual’s Income >50K (Estimated: 1 hour).
  2. Convert a training experiment into a predictive experiment in Azure ML by Mostafa Elzoghbi (Estimated: 30 minutes).
  3. Consume an Azure ML web service using Visual Studio 2015 by Mostafa Elzoghbi (Estimated: 30 minutes).
  4. Flight delay prediction by Todd Kitta. (Estimated: 3 hours) Start from Task 2. This model can be reused later in a separate Cortana Intelligence Suite End-to-End Training.

If you need more detailed instructions for self-placed training, you may also use Hands-on Labs from edX courses (videos with theory and quizzes are included).

  1. DAT203.1x Data Science Essentials
  2. DAT203.2x Principles of Machine Learning
  3. DAT203.3x Applied Machine Learning

Prerequisites

Please install the below software:

  • Activate your Azure account and bring your Microsoft account credentials. Don’t have a Microsoft account? Sign up now.
  • If you do not have Microsoft Azure account, activate a free 30-day trial Microsoft Azure account, or if you subscribe to MSDN, activate your free Azure MSDN subscriber benefits.
  • Preferred OS is Windows 10.
  • Make sure that Visual Studio 2015 Community, Pro, or Enterprise is installed. Make sure that Office 2013 or later is installed. (Optional; alternatively, you may use Windows Data Science virtual machine in Azure).
  • Create Azure ML workspace for free by signing up here.

Additional resources:

  1. Azure Machine Learning (ML)
  2. Cortana Intelligence Suite: Big Data and Advanced Analytics
  3. Big Data Presentation Deck
  4. Azure ML Data Camp Deck
  5. Detailed Azure ML Hands-on-Labs

Next Steps:

  1. Cortana Intelligence Suite End-to-End Training (Using the Flight Delay Prediction model in Azure-based solution).
  2. Data Science with Microsoft R Hands-on Labs (Different ways of using R language).

Cortana Intelligence Suite: Big Data and Advanced Analytics

In this post we will discuss reference architecture for Big Data and Advanced Analytics using Cortana Intelligence Suite. The architecture can be relevant for organizations looking to fully manage big data and advanced analytics to transform all enterprise information into intelligent action. This will allow to take action ahead of your competitors by going beyond looking in the rearview mirror to predicting what’s next.

In general, in such solutions you use relational and semi-structured data from business and custom applications, and also semi-structured or unstructured data from sensors, devices, web sites, social networks and other sources.

Big Data flow

Big Data flow includes following steps:

  • Ingestions of data, which can be based on bulk mode or event-based/real-time.
  • Processing data to prepare for storage.
  • Storing data in relational or unstructured storage.
  • Processing data for analytics like data aggregation, complex calculations, predictive or statistical modeling etc.
  • Visualizing data and data discovery using BI tools or custom applications.

big-data-flow

Big Data Reference Architecture

Big Data Reference architecture represents most important components and data flows, allowing to do following.

  • Track Azure data (Azure Website generating web logs) and store in ADLS
  • Track real-time data from IOT Suite: collect data from IOT Suite in permanent store (ADLS)
  • Run Machine Learning through R Server for HDInsight to find patterns in data
  • Show results in BI tools (Power BI)

big-data-ra

There are lot of different options to store data, process data and for machine learning. You may use Big Data and Machine Learning decision trees as a first help to choose most relevant components for your solution. (I will also write about information management components like Azure Data Factory, Azure Data Catalog, Sqoop, Pig, Oozie etc. in one of next posts).

Example of Big Data Solution

To show you simple example of Big Data architecture we will use following artificial scenario.

  • AdventureWorks Travel (AWT) provides concierge services for business travelers. In an increasingly crowded market, they are always looking for ways to differentiate themselves and provide added value to their corporate customers.
  • They are looking to pilot a web-app that their internal customer service agents can use to provide additional information useful to the traveler during the flight booking process. They want to enable their agents to enter in the flight information and produce a prediction as to if the departing flight will encounter a 15 minute or longer delay, taking into account the weather forecasted for the departure hour.
  • Data platform team prefers to use open source technologies for data processing tasks.
  • Developers will need an easy way to create prediction experiments.

Here is example of architecture allowing to solve the scenario described above. Selected components of Cortana Intelligence Suite are highlighted.

cis-example

Demonstration of described solution is available in MTC Studio webcast: 2016-12-08 | Cortana Intelligence Suite: Big Data and Advanced Analytics.

Additional materials

Data Science with Microsoft R Hands-on Labs

In this post I will provide list of most important publically available Data Science with Microsoft R Hands-on Labs which we use in MTC New York for Microsoft R workshops.

To start doing labs provided below it’s a good idea to have a general level of predictive and classification Statistics, and a basic understanding of Machine Learning and Open R language. (For this you may use DAT204x Introduction to R for Data Science, DAT209x Programming in R for Data Science and other courses from Microsoft Data Science specialization).

Microsoft R Hands-on Labs

  1. Exploring SQL Server 2016 R Services and Microsoft R Client with R Tools for Visual Studio. (3 hours; manual is available, all necessary tools and files are included; uses New York Taxi dataset; when you see “Times Squire” in the code, change it to “New York” and save)
  2. MTC Microsoft R training by Jarek Kazmierczak. (1-2 hours; contains source file and R scripts)
  3. edX: DAT213x Analyzing Big Data with Microsoft R Server by Seth Mottaghinejad. (16 hours; contains videos, scripts; you may also earn Microsoft certificate; uses New York Taxi dataset; please let me know if you experience any issues with ggplot2 and ggrepel).
  4. Flight delay prediction with Azure ML (90 minutes; exercise 1 from Cortana Intelligence Suite End-to-End Training by Todd Kitta)
  5. Text Mining with R with Azure ML by Seayoung Rhee. (1 hour)
  6. edX. DAT203.1x Data Science Essentials
  7. edX. DAT203.2x Principles of Machine Learning
  8. edX. DAT203.3x Applied Machine Learning
  9. HDInsight Spark MLib (placeholder)
  10. Cognitive Toolkit (CNTK) Deep Dive and Hands-on (tutorial; video).

Here is one of screenshots from the first (highly recommended) training based on New York Taxi dataset.

sqlrserviceslabnyc

Prerequisites to use Data Science Virtual Machine

The Data Science Virtual Machine has all of the tools you will need to work with the materials. You will need Microsoft Azure subscription for this.

  1. To use subscription to Microsoft Azure you can sign up for a free account here or you can use your MSDN subscription.
  2. To create the Data Science Virtual Machine in Azure please login to Azure Portal and create the virtual machine. (New -> Search for “data science” -> select “Data Science Virtual Machine” -> Create).
  3. Optionally you may test your Microsoft R code on top of HDInsight Spark cluster created in Azure Portal.

Prerequisites to use your local machine

If you would like to work with some of the tools locally, please install following components.

  1. Visual Studio – the Community Edition (free) is acceptable – Version 2015 preferable.
  2. Install R Tools for Visual Studio.
  3. Optionally you may use RStudio.
  4. Optionally you may install SQL Server Developer Edition for SQL Server related content.

Additional materials

Cortana Intelligence Suite End-to-End Training

I am very excited to share information about excellent end-to-end hands-on labs training on Cortana Intelligence Suite. This training covers Azure Machine Learning, Azure Data Factory, HDInsight Spark, Power BI, and Intelligent Apps.

cis-ete

The course was developed by MTC Architect Todd Kitta. All training materials are available in his GitHub repository. If you need to provide this training to your team of data platform specialists, please contact Microsoft representative to initiate the training, or write your comment here.

Alternatively, you may register for Cortana Intelligence Suite End to End live event. (December 6, 2016, 9am – 4pm PST)

Course Outline

  • Building a Machine Learning Model and Operationalizing. (This part takes 90 minutes, so if you are not data scientist, feel free to deploy the experiment from the template).
  • Setting Up Azure Data Factory
  • Developing a Data Factory Pipeline for Data Movement
  • Operationalizing Machine Learning Scoring with Azure Machine Learning and Data Factory
  • Summarizing Data Using HDInsight Spark
  • Visualizing Spark Data in Power BI
  • Deploying an Intelligent Web App
  • Wrap-up and Cleanup of Azure Resources

Requirements

  • Microsoft Azure Subscription should be pay-as-you-go, MSDN, or Enterprise Agreement. If you are using your company’s Azure subscription and your company requires that you be connected to your corporate network (through a VPN or otherwise), we recommend that you use a Trial or MSDN subscription for this workshop. This is due to the fact that you will be connecting to your subscription inside of a VM that is not connected to your corporate network.
  • Setup is required before performing the steps in these exercises. Please see the setup instructions before going any further.
  • Please keep in mind that HDInsight cluster and VM you provision as setup for this workshop will incur charges, so provision these resources closest to the workshop date as possible. Preferably the afternoon/night before the workshop.

Machine Learning Solutions Decision Tree

New version is available: Artificial Intelligence Decision Tree

Machine learning is a technique of data science that helps computers learn from existing data in order to forecast future behaviors, outcomes, and trends. Currently there are lot of products which can be used for this on-premises or in the cloud, based on single node or multiple nodes, in relational database or in Hadoop based storage.

This article will help you to choose right Machine Learning solution based on specific requirements. We will discuss open source products, which can be deployed in Microsoft Cloud (Azure), or Microsoft products which can be deployed on-premises.

Disclaimer. In this article I present most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Decision on which product will be selected also depends on development platform used by specialists in organization, and on what Big Data solution is already used or there are plans to use. Key questions here are “Do you already use Hadoop or Data Warehouse?” (SPP or MPP?), “In the cloud or on-premises?”, and “How many data needed for machine learning?” (If storage and ML engine are separated, what will be cost and latency of data transfer?).

Process of machine learning solution selection may influence selection of Big Data solution itself. (Please see decision tree on Big Data solutions in a separate article).

Also, complexity and uniqueness of machine learning problem is important, and how much of effort the team is ready to provide to develop ML solution. Some of product are much easier to use (Azure Machine Learning), and for some tasks there are standard APIs available (Azure Cognitive Services).

Please note that some products can be deployed on top of one platform. (For example, MLlib and R Server deployed on top of Spark cluster).

So let’s see the decision tree first. Below I will provide some comments on each of products. (You may also download high-resolution printable version of the decision tree).

machine-learning-dt-v1-02

Azure Machine Learning

Azure Machine Learning (ML) is a cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solutions.

azure-ml

In Machine Learning Studio, you can create predictive models by dragging, dropping, and connecting modules. Studio also provides a library of algorithms and samples to get you started. You may create new ML experiment using sample experiments, R and Python packages, standard algorithms (modules), and custom R and Python scripts.

In Cortana Intelligence Gallery, you can try analytics solutions authored by others or contribute your own.

Data Science development: Visual, R language and Python.

Advantages: Graphical experiments representation; easy to study; quick deployment; Excel integration; scalable in terms of multiple experiments.

Concerns: May not be fastest solution to process large amount of data using one single experiment – make sure that all components of your experiment can scale

Cognitive Services

Cognitive Services are a collection of artificial intelligence REST APIs. With Cognitive Services, developers can easily add intelligent features into their applications.

cognitive-services

Cognitive Services include:

  • Vision: From faces to feelings, allow apps to understand images and video
  • Speech: Hear and speak to users by filtering noise, identifying speakers, and understanding intent
  • Language: Process text and learn how to recognize what users want
  • Knowledge: Tap into rich knowledge amassed from the web, academia, or your own data
  • Search: Access billions of web pages, images, videos, and news with the power of Bing APIs

The collection will continuously improve, adding new APIs and updating existing ones.

Development: REST APIs.

Advantages: Quickly to use, platform independent, use some publicly available data from Bing.

Concerns: Can be used only for subset of machine learning tasks.

Microsoft R Server Family

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R; it is scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling. It is compatible with the entire collection of open source algorithms, connectors, visualization tools shared openly via CRAN, Bioconductor and other shared resources like GitHub. At the same time key extensions enable R to tackle big data challenges that exceed the capacity of open source R. Scripts can be developed on the desktop and immediately deployed to RDBMS – SQL Server, EDW (SQL Server & Teradata) or Hadoop (Microsoft, Cloudera, Hortonworks and MapR).

r-server

Data Science development: R language (Open R, Scale R).

Advantages: Distributes work across cores and nodes (if multiple nodes available); R Scripts built using R Server can be easily run on multiple platforms running R Server, on-premises and in the cloud (important for hybrid scenarios).

SQL Server R Services (R Server for Windows)

SQL Server R Services (also known as R Server for Windows) is Advanced Analytics and Stand Alone Server Capability built into SQL Server Enterprise Edition. It brings the perfect mix of fast querying and In-Memory OLTP optimization from SQL Server 2016, as well as data exploration, predictive modeling, scoring, and visualization from the R Services family of products. It delivers speed and performance for advanced analytics using near-database analytics and parallel threading and processing. It is integrated with SQL Server: T-SQL can call a Stored Procedure with R code, R scripts can run in SQL through extensibility model, and result sets can be sent through Web API to database or applications.

Data Science development: R language (Open R, Scale R).

Advantages: Included into SQL Server; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

Concerns: Uses only resources of one physical server.

r-services

R Server for MapReduce

R Server for MapReduce uses Apache MapReduce nodes for R computations.

Using R Server in MapReduce eliminates data movement latency and removes data duplication if you already use MapReduce for data storage.

Supported platforms: HDInsight Premium, Hortonworks, Cloudera, MapR.

Data Science development: R language (Open R, Scale R)

Advantages: Distributes work across cores and nodes; if you have lot of MapReduce code and have no plans to move off MapReduce, deploying R Server on top of it will eliminate data movement for machine learning.

Concerns: Uses MapReduce which is slower than Spark.

R Server for Spark

R Server for Spark uses Apache Spark nodes for R computations at in-memory speeds using Spark RDDs. R Server for Spark leverages Spark DAG (Directed Acyclic Graph to distribute work across the cluster) and persistence for computation (we may leave the task running and waiting for new requests). In this scenario you can develop models using larger amounts of data with better performance.

Using R Server in Spark eliminates data movement latency and removes data duplication if you already use Spark for data storage.

Supported platforms: HDInsight Premium with Spark and R Server, Spark on Hortonworks, Spark on Cloudera, Spark on MapR.

Data Science development: R language (Open R, Scale R)

Advantages: uses Spark, which means fast in-memory computations; distributes work across cores and nodes.

R Server for Teradata DB

R Server for Teradata DB uses MPP architecture for R computations.

Data Science development: R language (Open R, Scale R)

Advantages: works with Teradata DB; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

R Server for Linux

Data Science development: R language (Open R, Scale R)

Advantages: distributes work across cores.

Concerns: uses only resources of one physical server; additional time will be used to copy or stream data to Linux machine from HDFS.

Mahout MapReduce

Mahout MapReduce is a collection of machine learning algorithms based on Hadoop MapReduce framework.

Platform: Hadoop MapReduce, Java.

Advantages: Mahoot MapReduce comes with many ML algorithms to choose from; MapReduce is much more mature framework then Spark, therefore more stable.

Concerns: Slow and does not handle iterative jobs very well (constrained by disk accesses due to MapReduce).

Mahout Samsara

Mahout Samsara is a Scala-based programming environment based on different distributed engines (Spark and H2O) which also contains machine learning algorithms. It uses all algebraic expressions in R-like Scala DSL which means that is can be readable by R programmers and in general is easier to understand.

Platform: Hadoop Spark, Scala.

Advantages: Fast due to use of Spark.

Concerns: Currently is under development – unstable.

MLlib

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib ships with Spark as a standard component, so it works seamlessly with SparkSQL, Spark Streaming and Spark GraphX. Additionally, you may deploy R Server on top of Spark cluster.

spark-platform

Platform: Spark.

Data Science language: Python and Scala/Java.

Advantages: due to in-memory capabilities MLlib runs iterative algorithms 5-10 times faster than Mahoot based on Hadoop MapReduce; efficient and interoperable with SparkSQL, Spark Streaming & Spark GraphX; clear and consistent APIs.

Concerns: not all algorithms are implemented, though MLlib is growing very rapidly.

Additional information

Vehicle Health & Driving Pattern Analysis using Cortana Analytics with Power BI

Last changes: January 13, 2016

In the following scenario of advanced analytics we will show how car dealers, insurances and automobile manufacturers can use Cortana Analytics including Power BI to gain real-time and predictive insights on vehicle health and driving pattern behavior.

The solution can be applied to following business use cases:

  • Usage-based insurance
  • Vehicle diagnostic
  • Engine emission control
  • Engine performance remapping
  • Eco-driving
  • Roadside assistance calls
  • Fleet management

auto-scenarios-by-microsoft

Starting December 1, 2015 the solution called Vehicle Telemetry Analytics template is available at Cortana Analytics Gallery. Here is quick promotional video:

In the following video and text below you will see some details on solution architecture which includes following technologies: Event Hub, Azure Stream Analytics, Azure Machine Learning, Azure Data Factory, HDInsight, Azure Storage, Azure SQL DW, and Power BI.

Let’s look on data flow and solution components.

cc-arch

The Event Hub is used to ingest huge amount of events from the vehicles into Azure for real-time and batch analytics.

The Stream Analytics job is performing real-time data ingestion into the long term storage for batch analytics and data preparation for real-time predictive insights.

Below you can see description of three queries processed in the Stream Analytics for following purposes. (All three queries are enriched with detailed data on each vehicle from Blob Storage).

Query #1 performs join with reference data from Azure Blob Storage and accumulates the resultant data into a different container in the Blob Storage for rich batch analytics.

Query #2 publishes the data as-is to the output Event Hub so that it can be consumed by the RealtimeDashboard app that invokes machine learning request/response end-point for real-time anomaly detection and pushes the results to the PowerBI live dashboard.

Query #3 performs aggregations on the data within a 3 sec tumbling window and publishes it to an Azure SQL instance that got provisioned as part of the deployment.

Data Factory is used for

  • Orchestration, monitoring and management of the batch analytics pipeline
  • Transformation of the data in an on-demand HDInisght cluster for rich insights on Driving Behavior Pattern and Vehicle Health Trending
  • Data movement across the various data stores

cc-datafactory

All data in source datasets are processed using Hive queries where we describe data structures based on CSV files. Additionally we define new tables and calculate aggregations using INSERT request.

cc-hive

In this solution, we are targeting the following batch insights:

  • Aggressive driving behavior (Identifies the trend of the models, locations, driving conditions, and time of the year to gain insights on aggressive driving pattern allowing Contoso Motors to use it for marketing campaigns, driving new personalized features and usage based insurance.)
  • Fuel efficient driving behavior (Identifies the trend of the models, locations, driving conditions, and time of the year to gain insights on fuel efficient driving pattern allowing Contoso Motors to use it for marketing campaigns, driving new features and proactive reporting to the drivers for cost effective and environment friendly driving habits.)
  • Recall models (Identifies models requiring recalls by anomaly detection trend and correlation with driving habits)

An anomaly detection Azure Machine Learning model is used in this demo to detect safety issues for vehicle recall and identifying vehicles requiring maintenance. This model is published in an existing subscription and the web service endpoint is leveraged both in request/response and batch mode for operationalization in the real-time and the batch processing.

CC-demo-AML

Aggregated data from Blob Storage is moved to Azure Data Warehouse for historical storage.

Power BI dashboards contain historical data from Azure DW and real-time data from the Azure Stream Analytics and the Event Hub.

CC-demo-powerbi

Special thanks to authors of the demo scenario: Anand Subbaraj, Sanjay Soni, Christoph Schuler, Santosh Waghmare, Shashank Khedikar, and Sam Istephan.