What “near 100% compatibility” of Azure SQL DB Managed Instance actually means?

Azure SQL Database Managed Instance (Azure SQL DB MI) is a fully managed SQL Server Database Engine Instance hosted in Azure cloud. This is the most compatible PaaS option for migrating on-premises SQL Server databases to the cloud (PaaS is good if you want to use capabilities like automatic patching and version updates, automated backups, built-in high-availability etc. to reduce management overhead and TCO).

So, what “near 100% compatibility of Azure SQL DB MI with the latest SQL Server on-premises (Enterprise Edition) Database Engine” actually means?

First, components of SQL Server which are not related to the Database Engine are not available in Azure SQL DB MI. Reporting Services, Integration Services, Analysis Services, Master Data Services, Data Quality Services are not there.

Second, some of the features of SQL Server EE needed for enterprise database workloads, are still not available in Azure SQL DB MI, probably due to complexity considerations, not enough requests/impact, or due to availability of similar or better capabilities in Azure.

Feature comparison of SQL Server and Azure SQL DB MI can be found in the official documentation.

Below is a graphical representation of most important differences between SQL Server, Azure SQL DB MI and some other PaaS offerings in Azure.

On the picture features on the borders are partially compatible.

Reference:

Business Intelligence Solutions Decision Tree

In this article we will cover most important Business Intelligence components based on Microsoft Data Platform. One week ago there were announcements on Power BI Premium and Power BI Report Server which will require some clarification, so I decided to create another decision tree describing available Microsoft analytical modeling and visualization tools, and covering Power BI related components in more detailed way.

For the purposes of this article we will define Business Intelligence in a narrow way, as top and middle layers of BI stack, so it will include Analytical Modeling, Data Visualization, and Collaboration. We will also cover Sites and Apps integration as important part of BI functionality.

  1. Analytical Modeling solutions allow to load data from different data sources, combine data in one model and create calculations.
  2. Data Visualization and Collaboration solutions allow users to create, change, manage and share reports and dashboards built on top of analytical models or data sources.
  3. Sites and Apps Integration solutions allow to create applications of top of data sources, embed analytical resorts into applications and web sites, and create data driven workflows.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Analytical Modeling

  • Azure Analysis Services is Azure PaaS offering built on the proven analytics engine in Microsoft SQL Server Analysis Services. Azure Analysis Services provides enterprise-grade tabular data modeling in the cloud.
  • SQL Server Analysis Services (SSAS) is a part of SQL Server which contains engines for multidimensional (OLAP) and tabular analytical models, and for data mining.
  • SQLBI DAX Studio is a tool to write, execute, and analyze DAX queries in Power BI Designer, Power Pivot for Excel, and Analysis Services Tabular. It includes an Object Browser, query editing and execution, formula and measure editing, syntax highlighting and formatting, integrated tracing and query execution breakdowns.
  • Microsoft Excel is a spreadsheet application with cell-based calculations. It includes Pivot Tables, Pivot Charts and Power View for data visualization; Power Query for data transformation; Power Pivot to create in-memory tabular models and calculations. Excel is a component of Microsoft Office applications package, and is also available in Office 365 subscriptions.
  • Power BI Desktop is a visual data exploration tool for data analysis and reports creation. It allows to load multiple data sources, establish data structure, transform, create analytical tabular model, visualize and explore data in interactive way, and also publish to Power BI Service.

Visualization and collaboration

  • Power BI is a set of tools for self-service and traditional business intelligence, which uses tabular analytical models, allows to build interactive reports and dashboards, and features mobile reports, collaboration and application embedding.
  • Power BI Mobile is a set of free Windows, iOS, and Android applications allowing to view and explore personalized dashboards and reports created in Power BI Service. Also it allows users to be up-to-date with data-driven alerts.
  • Power BI Service (or powerbi.com), is a SaaS part of Power BI offering allowing to create interactive reports, build dashboards, create reports & datasets, update data with real-time, automatic and scheduled refreshes, share dashboards easily with other people in your organization, ask questions of data with Natural Language Query, stay connected to data all the time with mobile applications.
    • Power BI Free is a free version of Power BI Service intended for report authoring (personal use). Currently this service is in transition to have the same functionality as Power BI Pro, but with limited sharing and collaboration features. (This will be effective June 1st)
    • Power BI Pro is a professional version of Power BI Services intended for report authoring, sharing and collaboration. Power BI Pro is payed per user, per month.
    • Power BI Premium is dedicated capacity for large-scale BI deployments, with enhanced performance and larger data volumes, without requiring to purchase per-user licenses. Power BI Premium builds on the existing Power BI portfolio with a capacity-based licensing model that increases flexibility for how users access, share and distribute content. Power BI Premium is payed per node, per month.
  • Power BI Report Server is an on-premises server that allows the deployment and distribution of interactive Power BI reports – and traditional paginated reports – completely within the boundaries of the organization’s firewall. Power BI Report Server is available as part of Power BI Premium or with SQL EE SA.
  • SQL Server Reporting Services is a solution for creating, publishing, managing reports, and delivering reports to users in web browser, on mobile device, or as an email. Types of supported reports: “traditional” paginated reports, mobile reports (AKA DataZen), and Power BI reports (through Power BI Server of Power BI Service).

Sites and Apps development

  • Power BI Embedded is PaaS offering in Azure, which provides interactive data visualizations in customer-facing apps without the time and expense of having to build it from the ground up. In future it will be converged with the Power BI Service to deliver one API surface, a consistent set of capabilities and access to the latest features.
  • Microsoft Flow is a component of Office 365 which represents user friendly and intuitive way of creating automated workflows between applications and services to generate notifications, synchronize files, collect data, and produce other actions.
  • Microsoft PowerApps is a component of Office 365 with user friendly and intuitive interface allowing to build applications without writing code, connect to data sources and create new data, publish and use created apps on web and mobile devices. Power Apps allows business experts in the organization to create the apps they need to support their business requirements with drag and drop simplicity.
  • SharePoint in Office 365 allows you to integrate Power BI interactive reports into SharePoint web pages.
  • SharePoint Server on-premises solution also includes BI-related functionality of SharePoint Server. It includes integration with SQL Server Reporting Services, and also creating PerformancePoint dashboards.

Reference materials:

Data Science with Microsoft R Hands-on Labs

In this post I will provide list of most important publically available Data Science with Microsoft R Hands-on Labs which we use in MTC New York for Microsoft R workshops.

To start doing labs provided below it’s a good idea to have a general level of predictive and classification Statistics, and a basic understanding of Machine Learning and Open R language. (For this you may use DAT204x Introduction to R for Data Science, DAT209x Programming in R for Data Science and other courses from Microsoft Data Science specialization).

Microsoft R Hands-on Labs

  1. Exploring SQL Server 2016 R Services and Microsoft R Client with R Tools for Visual Studio. (3 hours; manual is available, all necessary tools and files are included; uses New York Taxi dataset; when you see “Times Squire” in the code, change it to “New York” and save)
  2. MTC Microsoft R training by Jarek Kazmierczak. (1-2 hours; contains source file and R scripts)
  3. edX: DAT213x Analyzing Big Data with Microsoft R Server by Seth Mottaghinejad. (16 hours; contains videos, scripts; you may also earn Microsoft certificate; uses New York Taxi dataset; please let me know if you experience any issues with ggplot2 and ggrepel).
  4. Flight delay prediction with Azure ML (90 minutes; exercise 1 from Cortana Intelligence Suite End-to-End Training by Todd Kitta)
  5. Text Mining with R with Azure ML by Seayoung Rhee. (1 hour)
  6. edX. DAT203.1x Data Science Essentials
  7. edX. DAT203.2x Principles of Machine Learning
  8. edX. DAT203.3x Applied Machine Learning
  9. HDInsight Spark MLib (placeholder)
  10. Cognitive Toolkit (CNTK) Deep Dive and Hands-on (tutorial; video).

Here is one of screenshots from the first (highly recommended) training based on New York Taxi dataset.

sqlrserviceslabnyc

Prerequisites to use Data Science Virtual Machine

The Data Science Virtual Machine has all of the tools you will need to work with the materials. You will need Microsoft Azure subscription for this.

  1. To use subscription to Microsoft Azure you can sign up for a free account here or you can use your MSDN subscription.
  2. To create the Data Science Virtual Machine in Azure please login to Azure Portal and create the virtual machine. (New -> Search for “data science” -> select “Data Science Virtual Machine” -> Create).
  3. Optionally you may test your Microsoft R code on top of HDInsight Spark cluster created in Azure Portal.

Prerequisites to use your local machine

If you would like to work with some of the tools locally, please install following components.

  1. Visual Studio – the Community Edition (free) is acceptable – Version 2015 preferable.
  2. Install R Tools for Visual Studio.
  3. Optionally you may use RStudio.
  4. Optionally you may install SQL Server Developer Edition for SQL Server related content.

Additional materials

Materials from Mission Critical Performance Workshop

Today in MTC New York I provided workshop “Always On: Mission Critical Performance” dedicated to some new features of SQL Server 2016. (And this time SQL Server AlwaysOn technology actually was covered, but it was only fraction of the whole content 😉 ).

Here you can find presentation decks from this workshop:

  1. SQL Server 2016 Evolution
  2. SQL Server 2016 Performance (Here I additionally included slides on in-memory OLTP and ColumnStore from SQL Server 2014)
  3. SQL Server 2016 Security and Compliance
  4. SQL Server 2016 Availability
  5. SQL Server 2016 Scalability
  6. SQL Server 2016 Cloud Service (Bonus topic)

Additional materials are available on the official site of SQL Server 2016.

You may also try Virtual Labs. (Please, filter by “SQL Server 2016”).

evolution

 

Machine Learning @ 1 million predictions per second and more

Watch recordings of keynote and session previews of  Microsoft Machine Learning & Data Science Summit 2016 on the latest Big Data, Machine Learning, Artificial Intelligence, and Open Source techniques and technologies.

Some take-aways from the keynote:

  1. Combination of in-memory technologies and in-database analytics with R at scale using SQL Server 2016 can make 1 million fraud predictions per second.
  2. U-SQL in combination with Cognitive APIs and Azure ML can significantly extend datasets to make possible to analyze large volumes of images (different objects and complexity) and text (subjects, key phrases, sentiments, story).
  3. In future Azure Data Lake Analytics will support Hive and Spark.
  4. Microsoft ResNet (solutions for Deep Learning) is built using 152 neural network layers.
  5. Azure N-series Virtual Machines with GPUs to be used for Deep Learning are available in preview. For example, Tesla K80 delivers 4992 CUDA cores with a dual GPU design, up to 2.91 Teraflops of double-precision and up to 8.93 Teraflops of single-precision performance.

Case Studies:

  1. Student Drop-Out Prediction Service in Indian schools uses Azure ML.
  2. PROS used Azure and R in SQL Server for airlines to recommend prices in milliseconds. For another customer they moved R-based solution to SQL Server 2016 to generate renewals automatically “faster in a factor of a hundred”.
  3. Dyxia used combination of Microsoft Band, MS Health application, Azure IoT Hub, Stream Analytics, Power BI, Machine Learning and other services to monitor and predict anxiety of children with autism.
  4. eSmart Systems created Connected Drone solution combining drones with Deep Learning in Azure to automate inspections of power lines.
  5. CrowdFlower use crowd sourcing (Human-in-the-Loop) to train machine learning models for non-confident predictions.

Below there are some screenshots from the keynote.

intelligence

in-mem-r-sql

mln-predictions

war-and-peace

deep-learning

List of available recordings:

Machine Learning Solutions Decision Tree

New version is available: Artificial Intelligence Decision Tree

Machine learning is a technique of data science that helps computers learn from existing data in order to forecast future behaviors, outcomes, and trends. Currently there are lot of products which can be used for this on-premises or in the cloud, based on single node or multiple nodes, in relational database or in Hadoop based storage.

This article will help you to choose right Machine Learning solution based on specific requirements. We will discuss open source products, which can be deployed in Microsoft Cloud (Azure), or Microsoft products which can be deployed on-premises.

Disclaimer. In this article I present most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Decision on which product will be selected also depends on development platform used by specialists in organization, and on what Big Data solution is already used or there are plans to use. Key questions here are “Do you already use Hadoop or Data Warehouse?” (SPP or MPP?), “In the cloud or on-premises?”, and “How many data needed for machine learning?” (If storage and ML engine are separated, what will be cost and latency of data transfer?).

Process of machine learning solution selection may influence selection of Big Data solution itself. (Please see decision tree on Big Data solutions in a separate article).

Also, complexity and uniqueness of machine learning problem is important, and how much of effort the team is ready to provide to develop ML solution. Some of product are much easier to use (Azure Machine Learning), and for some tasks there are standard APIs available (Azure Cognitive Services).

Please note that some products can be deployed on top of one platform. (For example, MLlib and R Server deployed on top of Spark cluster).

So let’s see the decision tree first. Below I will provide some comments on each of products. (You may also download high-resolution printable version of the decision tree).

machine-learning-dt-v1-02

Azure Machine Learning

Azure Machine Learning (ML) is a cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solutions.

azure-ml

In Machine Learning Studio, you can create predictive models by dragging, dropping, and connecting modules. Studio also provides a library of algorithms and samples to get you started. You may create new ML experiment using sample experiments, R and Python packages, standard algorithms (modules), and custom R and Python scripts.

In Cortana Intelligence Gallery, you can try analytics solutions authored by others or contribute your own.

Data Science development: Visual, R language and Python.

Advantages: Graphical experiments representation; easy to study; quick deployment; Excel integration; scalable in terms of multiple experiments.

Concerns: May not be fastest solution to process large amount of data using one single experiment – make sure that all components of your experiment can scale

Cognitive Services

Cognitive Services are a collection of artificial intelligence REST APIs. With Cognitive Services, developers can easily add intelligent features into their applications.

cognitive-services

Cognitive Services include:

  • Vision: From faces to feelings, allow apps to understand images and video
  • Speech: Hear and speak to users by filtering noise, identifying speakers, and understanding intent
  • Language: Process text and learn how to recognize what users want
  • Knowledge: Tap into rich knowledge amassed from the web, academia, or your own data
  • Search: Access billions of web pages, images, videos, and news with the power of Bing APIs

The collection will continuously improve, adding new APIs and updating existing ones.

Development: REST APIs.

Advantages: Quickly to use, platform independent, use some publicly available data from Bing.

Concerns: Can be used only for subset of machine learning tasks.

Microsoft R Server Family

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R; it is scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling. It is compatible with the entire collection of open source algorithms, connectors, visualization tools shared openly via CRAN, Bioconductor and other shared resources like GitHub. At the same time key extensions enable R to tackle big data challenges that exceed the capacity of open source R. Scripts can be developed on the desktop and immediately deployed to RDBMS – SQL Server, EDW (SQL Server & Teradata) or Hadoop (Microsoft, Cloudera, Hortonworks and MapR).

r-server

Data Science development: R language (Open R, Scale R).

Advantages: Distributes work across cores and nodes (if multiple nodes available); R Scripts built using R Server can be easily run on multiple platforms running R Server, on-premises and in the cloud (important for hybrid scenarios).

SQL Server R Services (R Server for Windows)

SQL Server R Services (also known as R Server for Windows) is Advanced Analytics and Stand Alone Server Capability built into SQL Server Enterprise Edition. It brings the perfect mix of fast querying and In-Memory OLTP optimization from SQL Server 2016, as well as data exploration, predictive modeling, scoring, and visualization from the R Services family of products. It delivers speed and performance for advanced analytics using near-database analytics and parallel threading and processing. It is integrated with SQL Server: T-SQL can call a Stored Procedure with R code, R scripts can run in SQL through extensibility model, and result sets can be sent through Web API to database or applications.

Data Science development: R language (Open R, Scale R).

Advantages: Included into SQL Server; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

Concerns: Uses only resources of one physical server.

r-services

R Server for MapReduce

R Server for MapReduce uses Apache MapReduce nodes for R computations.

Using R Server in MapReduce eliminates data movement latency and removes data duplication if you already use MapReduce for data storage.

Supported platforms: HDInsight Premium, Hortonworks, Cloudera, MapR.

Data Science development: R language (Open R, Scale R)

Advantages: Distributes work across cores and nodes; if you have lot of MapReduce code and have no plans to move off MapReduce, deploying R Server on top of it will eliminate data movement for machine learning.

Concerns: Uses MapReduce which is slower than Spark.

R Server for Spark

R Server for Spark uses Apache Spark nodes for R computations at in-memory speeds using Spark RDDs. R Server for Spark leverages Spark DAG (Directed Acyclic Graph to distribute work across the cluster) and persistence for computation (we may leave the task running and waiting for new requests). In this scenario you can develop models using larger amounts of data with better performance.

Using R Server in Spark eliminates data movement latency and removes data duplication if you already use Spark for data storage.

Supported platforms: HDInsight Premium with Spark and R Server, Spark on Hortonworks, Spark on Cloudera, Spark on MapR.

Data Science development: R language (Open R, Scale R)

Advantages: uses Spark, which means fast in-memory computations; distributes work across cores and nodes.

R Server for Teradata DB

R Server for Teradata DB uses MPP architecture for R computations.

Data Science development: R language (Open R, Scale R)

Advantages: works with Teradata DB; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

R Server for Linux

Data Science development: R language (Open R, Scale R)

Advantages: distributes work across cores.

Concerns: uses only resources of one physical server; additional time will be used to copy or stream data to Linux machine from HDFS.

Mahout MapReduce

Mahout MapReduce is a collection of machine learning algorithms based on Hadoop MapReduce framework.

Platform: Hadoop MapReduce, Java.

Advantages: Mahoot MapReduce comes with many ML algorithms to choose from; MapReduce is much more mature framework then Spark, therefore more stable.

Concerns: Slow and does not handle iterative jobs very well (constrained by disk accesses due to MapReduce).

Mahout Samsara

Mahout Samsara is a Scala-based programming environment based on different distributed engines (Spark and H2O) which also contains machine learning algorithms. It uses all algebraic expressions in R-like Scala DSL which means that is can be readable by R programmers and in general is easier to understand.

Platform: Hadoop Spark, Scala.

Advantages: Fast due to use of Spark.

Concerns: Currently is under development – unstable.

MLlib

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib ships with Spark as a standard component, so it works seamlessly with SparkSQL, Spark Streaming and Spark GraphX. Additionally, you may deploy R Server on top of Spark cluster.

spark-platform

Platform: Spark.

Data Science language: Python and Scala/Java.

Advantages: due to in-memory capabilities MLlib runs iterative algorithms 5-10 times faster than Mahoot based on Hadoop MapReduce; efficient and interoperable with SparkSQL, Spark Streaming & Spark GraphX; clear and consistent APIs.

Concerns: not all algorithms are implemented, though MLlib is growing very rapidly.

Additional information