PPT: Accelerate Academic Research with Cloud Computing

In this deck we will discuss how Microsoft Azure can be used to help Academic Research, and satisfy broad requirements and needs of researchers. We will cover Azure Machine Learning, HDInsight, HPC and other Azure services.

2016-12-08 – Academic Research – MTC Studio

Reference materials:

 

Webcast: Predictive Data Warehouse with Datameer

In the following webcast, we will talk to Andrew Brust, Senior Director of Market Strategy and Intelligence in Datameer.

We will learn about Hadoop ecosystem and PaaS options in Azure, difference of Data Lake and Data Warehouse, and added value of unstructured datastreams. We will discuss Hadoop learning curve for professionals with OLTP database and BI background, and how Datameer can help to create big data solutions and futureproof against the change.

Technologies: HDInsight, Stream Analytics, Azure Data Lake Store and Analytics, Azure Machine Learning and Power BI.

To access the webcast, you will need to fill small registration form.

Webcast: Data warehouse migration to Azure with Hortonworks

Modern EDW should be able to manage both structured and unstructured data to realize full value of data. Security, consistency, and credibility of data is also very important. Data warehouse and big data solutions from Microsoft provide a trusted infrastructure that can handle all types of data, and scale from terabytes to petabytes, with real-time performance.

In this webcast with participation of Mark Lochbihler (Director of Partner Engineering, Hortonworks) we discuss modern enterprise data warehouses (EDW) and migration to Microsoft Cloud (Azure). We will learn about the process, tools, and reference architectures for data warehouse migration.

To access the webcast, you will need to fill small registration form.

Additional resources:

Cortana Intelligence Suite: Big Data and Advanced Analytics

In this post we will discuss reference architecture for Big Data and Advanced Analytics using Cortana Intelligence Suite. The architecture can be relevant for organizations looking to fully manage big data and advanced analytics to transform all enterprise information into intelligent action. This will allow to take action ahead of your competitors by going beyond looking in the rearview mirror to predicting what’s next.

In general, in such solutions you use relational and semi-structured data from business and custom applications, and also semi-structured or unstructured data from sensors, devices, web sites, social networks and other sources.

Big Data flow

Big Data flow includes following steps:

  • Ingestions of data, which can be based on bulk mode or event-based/real-time.
  • Processing data to prepare for storage.
  • Storing data in relational or unstructured storage.
  • Processing data for analytics like data aggregation, complex calculations, predictive or statistical modeling etc.
  • Visualizing data and data discovery using BI tools or custom applications.

big-data-flow

Big Data Reference Architecture

Big Data Reference architecture represents most important components and data flows, allowing to do following.

  • Track Azure data (Azure Website generating web logs) and store in ADLS
  • Track real-time data from IOT Suite: collect data from IOT Suite in permanent store (ADLS)
  • Run Machine Learning through R Server for HDInsight to find patterns in data
  • Show results in BI tools (Power BI)

big-data-ra

There are lot of different options to store data, process data and for machine learning. You may use Big Data and Machine Learning decision trees as a first help to choose most relevant components for your solution. (I will also write about information management components like Azure Data Factory, Azure Data Catalog, Sqoop, Pig, Oozie etc. in one of next posts).

Example of Big Data Solution

To show you simple example of Big Data architecture we will use following artificial scenario.

  • AdventureWorks Travel (AWT) provides concierge services for business travelers. In an increasingly crowded market, they are always looking for ways to differentiate themselves and provide added value to their corporate customers.
  • They are looking to pilot a web-app that their internal customer service agents can use to provide additional information useful to the traveler during the flight booking process. They want to enable their agents to enter in the flight information and produce a prediction as to if the departing flight will encounter a 15 minute or longer delay, taking into account the weather forecasted for the departure hour.
  • Data platform team prefers to use open source technologies for data processing tasks.
  • Developers will need an easy way to create prediction experiments.

Here is example of architecture allowing to solve the scenario described above. Selected components of Cortana Intelligence Suite are highlighted.

cis-example

Demonstration of described solution is available in MTC Studio webcast: 2016-12-08 | Cortana Intelligence Suite: Big Data and Advanced Analytics.

Additional materials

Cortana Intelligence Suite End-to-End Training

I am very excited to share information about excellent end-to-end hands-on labs training on Cortana Intelligence Suite. This training covers Azure Machine Learning, Azure Data Factory, HDInsight Spark, Power BI, and Intelligent Apps.

cis-ete

The course was developed by MTC Architect Todd Kitta. All training materials are available in his GitHub repository. If you need to provide this training to your team of data platform specialists, please contact Microsoft representative to initiate the training, or write your comment here.

Alternatively, you may register for Cortana Intelligence Suite End to End live event. (December 6, 2016, 9am – 4pm PST)

Course Outline

  • Building a Machine Learning Model and Operationalizing. (This part takes 90 minutes, so if you are not data scientist, feel free to deploy the experiment from the template).
  • Setting Up Azure Data Factory
  • Developing a Data Factory Pipeline for Data Movement
  • Operationalizing Machine Learning Scoring with Azure Machine Learning and Data Factory
  • Summarizing Data Using HDInsight Spark
  • Visualizing Spark Data in Power BI
  • Deploying an Intelligent Web App
  • Wrap-up and Cleanup of Azure Resources

Requirements

  • Microsoft Azure Subscription should be pay-as-you-go, MSDN, or Enterprise Agreement. If you are using your company’s Azure subscription and your company requires that you be connected to your corporate network (through a VPN or otherwise), we recommend that you use a Trial or MSDN subscription for this workshop. This is due to the fact that you will be connecting to your subscription inside of a VM that is not connected to your corporate network.
  • Setup is required before performing the steps in these exercises. Please see the setup instructions before going any further.
  • Please keep in mind that HDInsight cluster and VM you provision as setup for this workshop will incur charges, so provision these resources closest to the workshop date as possible. Preferably the afternoon/night before the workshop.

Machine Learning Solutions Decision Tree

New version is available: Artificial Intelligence Decision Tree

Machine learning is a technique of data science that helps computers learn from existing data in order to forecast future behaviors, outcomes, and trends. Currently there are lot of products which can be used for this on-premises or in the cloud, based on single node or multiple nodes, in relational database or in Hadoop based storage.

This article will help you to choose right Machine Learning solution based on specific requirements. We will discuss open source products, which can be deployed in Microsoft Cloud (Azure), or Microsoft products which can be deployed on-premises.

Disclaimer. In this article I present most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Decision on which product will be selected also depends on development platform used by specialists in organization, and on what Big Data solution is already used or there are plans to use. Key questions here are “Do you already use Hadoop or Data Warehouse?” (SPP or MPP?), “In the cloud or on-premises?”, and “How many data needed for machine learning?” (If storage and ML engine are separated, what will be cost and latency of data transfer?).

Process of machine learning solution selection may influence selection of Big Data solution itself. (Please see decision tree on Big Data solutions in a separate article).

Also, complexity and uniqueness of machine learning problem is important, and how much of effort the team is ready to provide to develop ML solution. Some of product are much easier to use (Azure Machine Learning), and for some tasks there are standard APIs available (Azure Cognitive Services).

Please note that some products can be deployed on top of one platform. (For example, MLlib and R Server deployed on top of Spark cluster).

So let’s see the decision tree first. Below I will provide some comments on each of products. (You may also download high-resolution printable version of the decision tree).

machine-learning-dt-v1-02

Azure Machine Learning

Azure Machine Learning (ML) is a cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solutions.

azure-ml

In Machine Learning Studio, you can create predictive models by dragging, dropping, and connecting modules. Studio also provides a library of algorithms and samples to get you started. You may create new ML experiment using sample experiments, R and Python packages, standard algorithms (modules), and custom R and Python scripts.

In Cortana Intelligence Gallery, you can try analytics solutions authored by others or contribute your own.

Data Science development: Visual, R language and Python.

Advantages: Graphical experiments representation; easy to study; quick deployment; Excel integration; scalable in terms of multiple experiments.

Concerns: May not be fastest solution to process large amount of data using one single experiment – make sure that all components of your experiment can scale

Cognitive Services

Cognitive Services are a collection of artificial intelligence REST APIs. With Cognitive Services, developers can easily add intelligent features into their applications.

cognitive-services

Cognitive Services include:

  • Vision: From faces to feelings, allow apps to understand images and video
  • Speech: Hear and speak to users by filtering noise, identifying speakers, and understanding intent
  • Language: Process text and learn how to recognize what users want
  • Knowledge: Tap into rich knowledge amassed from the web, academia, or your own data
  • Search: Access billions of web pages, images, videos, and news with the power of Bing APIs

The collection will continuously improve, adding new APIs and updating existing ones.

Development: REST APIs.

Advantages: Quickly to use, platform independent, use some publicly available data from Bing.

Concerns: Can be used only for subset of machine learning tasks.

Microsoft R Server Family

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R; it is scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling. It is compatible with the entire collection of open source algorithms, connectors, visualization tools shared openly via CRAN, Bioconductor and other shared resources like GitHub. At the same time key extensions enable R to tackle big data challenges that exceed the capacity of open source R. Scripts can be developed on the desktop and immediately deployed to RDBMS – SQL Server, EDW (SQL Server & Teradata) or Hadoop (Microsoft, Cloudera, Hortonworks and MapR).

r-server

Data Science development: R language (Open R, Scale R).

Advantages: Distributes work across cores and nodes (if multiple nodes available); R Scripts built using R Server can be easily run on multiple platforms running R Server, on-premises and in the cloud (important for hybrid scenarios).

SQL Server R Services (R Server for Windows)

SQL Server R Services (also known as R Server for Windows) is Advanced Analytics and Stand Alone Server Capability built into SQL Server Enterprise Edition. It brings the perfect mix of fast querying and In-Memory OLTP optimization from SQL Server 2016, as well as data exploration, predictive modeling, scoring, and visualization from the R Services family of products. It delivers speed and performance for advanced analytics using near-database analytics and parallel threading and processing. It is integrated with SQL Server: T-SQL can call a Stored Procedure with R code, R scripts can run in SQL through extensibility model, and result sets can be sent through Web API to database or applications.

Data Science development: R language (Open R, Scale R).

Advantages: Included into SQL Server; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

Concerns: Uses only resources of one physical server.

r-services

R Server for MapReduce

R Server for MapReduce uses Apache MapReduce nodes for R computations.

Using R Server in MapReduce eliminates data movement latency and removes data duplication if you already use MapReduce for data storage.

Supported platforms: HDInsight Premium, Hortonworks, Cloudera, MapR.

Data Science development: R language (Open R, Scale R)

Advantages: Distributes work across cores and nodes; if you have lot of MapReduce code and have no plans to move off MapReduce, deploying R Server on top of it will eliminate data movement for machine learning.

Concerns: Uses MapReduce which is slower than Spark.

R Server for Spark

R Server for Spark uses Apache Spark nodes for R computations at in-memory speeds using Spark RDDs. R Server for Spark leverages Spark DAG (Directed Acyclic Graph to distribute work across the cluster) and persistence for computation (we may leave the task running and waiting for new requests). In this scenario you can develop models using larger amounts of data with better performance.

Using R Server in Spark eliminates data movement latency and removes data duplication if you already use Spark for data storage.

Supported platforms: HDInsight Premium with Spark and R Server, Spark on Hortonworks, Spark on Cloudera, Spark on MapR.

Data Science development: R language (Open R, Scale R)

Advantages: uses Spark, which means fast in-memory computations; distributes work across cores and nodes.

R Server for Teradata DB

R Server for Teradata DB uses MPP architecture for R computations.

Data Science development: R language (Open R, Scale R)

Advantages: works with Teradata DB; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

R Server for Linux

Data Science development: R language (Open R, Scale R)

Advantages: distributes work across cores.

Concerns: uses only resources of one physical server; additional time will be used to copy or stream data to Linux machine from HDFS.

Mahout MapReduce

Mahout MapReduce is a collection of machine learning algorithms based on Hadoop MapReduce framework.

Platform: Hadoop MapReduce, Java.

Advantages: Mahoot MapReduce comes with many ML algorithms to choose from; MapReduce is much more mature framework then Spark, therefore more stable.

Concerns: Slow and does not handle iterative jobs very well (constrained by disk accesses due to MapReduce).

Mahout Samsara

Mahout Samsara is a Scala-based programming environment based on different distributed engines (Spark and H2O) which also contains machine learning algorithms. It uses all algebraic expressions in R-like Scala DSL which means that is can be readable by R programmers and in general is easier to understand.

Platform: Hadoop Spark, Scala.

Advantages: Fast due to use of Spark.

Concerns: Currently is under development – unstable.

MLlib

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib ships with Spark as a standard component, so it works seamlessly with SparkSQL, Spark Streaming and Spark GraphX. Additionally, you may deploy R Server on top of Spark cluster.

spark-platform

Platform: Spark.

Data Science language: Python and Scala/Java.

Advantages: due to in-memory capabilities MLlib runs iterative algorithms 5-10 times faster than Mahoot based on Hadoop MapReduce; efficient and interoperable with SparkSQL, Spark Streaming & Spark GraphX; clear and consistent APIs.

Concerns: not all algorithms are implemented, though MLlib is growing very rapidly.

Additional information

Analysis of Big Data for Financial Services Institutions

In this blog post, we will look at analysis of stock prices and dividends by industry. This task is important to all participants of Stock Market including individual retail investors, institutional investors such as mutual funds, banks, insurance companies and hedge funds, and publicly traded corporations trading in their own shares.

In this demo, team of Stock Trading Company analyses semi-structured stock data from the New York Stock Exchange (NYSE).

  1. Data Architect collects data and makes information accessible to business. He will use Hadoop-based distribution on Windows Azure and Hive queries to aggregate stock and dividend data by years.
  2. Financial Analyst will analyze stock data and prepare ad-hoc reports to support trading and management processes. She will use Power Query add-in for Excel to join aggregated data from Hadoop with additional information on top 500 S&P companies from Azure Marketplace Datamarket. Additionally she will create ad-hoc reports with Power View for Excel.
  3. Trading Executive is responsible for understanding key decision makers and suggesting best product mix of securities. He will make some modifications to Power View reports provided by Financial Analyst.

Details on how Data Architect aggregates data in Hadoop are available in a separate blog post.

Below you can see some screenshots from the demo.

role1

role1-1

role1-2

role2

role2-1

role2-2

role3

role3-1

role3-2

Aggregating Big Data with HDInsight (Hadoop) on Azure

When we a talking about Big Data we may mean huge amounts of data (high Volume), data in any format (high Variety), and streaming data (appearing with high Velocity). Microsoft provides solutions for all of these “3V” tasks under unified monitoring, management and security, as well as unified data movement technologies. These
workloads are supported correspondingly by SQL Server Database and Parallel
Data Warehouse, HDInsight (Hadoop for Windows or Azure), and Microsoft SQL
Server StreamInsight.

big-data-technologies

Let us talk about Microsoft Big Data technology for Non-Relational data.

Microsoft’s adaptation of Hadoop technology can be deployed in a cloud-based environment or on-premises. The Hadoop-based service on the Windows Azure platform is a cloud-based service that offers elastic (in a term of data volumes) analytics on Microsoft’s cloud platform. For customers who want to keep the data within their data centers, Microsoft provides Hadoop-based distribution on Windows Server.

In this blog post, we will start diving into Hadoop in Azure technology and Hive queries to analyze semi-structured data in Hadoop.

In addition to traditional data warehousing, when operational data stored in special structures in Enterprise Data Warehouse, we can store all other raw data in “Store it All” cluster. At any moment, we are able to create query to these data to answer some business question. (In addition, we may store the answer in the Data Warehouse if necessary)

additional-flow

Let me introduce the first part of Bid Data Demonstration where Data Architect will store log files with stock prices and dividends in Azure Blob Storage and will use Hive queries to aggregate data by years and stock tickers into separate file.

store-and-aggregate

Here is the video:

Additional materials: Windows Azure Storage Architecture Overview