Artificial Intelligence Decision Tree

Let’s discuss decision points for selecting right components for Artificial Intelligence (AI) solutions. This is also an update to Machine Learning Decision Tree (v1). Keep in mind here that AI is a broader term compared to Machine Learning.

Artificial Intelligence functionality in the decision tree is divided between following groups:

  • AI in Business Applications: Dynamics 365 AI (AI for Customer Service, Market Insights, Sales), Microsoft 365 AI (Office 365 Workplace Analytics, ML in Power BI, O365 Search).
  • Knowledge Mining: O365 Search, Azure Search.
  • AI Apps and Agents: Azure Bot Service (Framework), Cognitive Services (Vision APIs, Speech APIs, Language APIs, Search APIs, Custom APIs).
  • Machine Learning Tools: Azure Notebooks, Jupiter Notebooks, Code, PyCharm, Visual Studio, Azure ML Studio.
  • Cloud-based Machine Learning: Azure ML Service, Azure ML Studio, Data Preparation (Azure Data Factory, Azure Databricks), Model Training/Testing (Azure Databricks, Azure HDInsight, Data Science VM), Container Registry, Model Deployment (Azure Container Instances, Azure Kubernetes Service, Azure Batch, Azure IoT Edge), Azure Infrastructure (CPUs, GPUs, FPGAs).
  • On-premises Machine Learning: Edge Devices, Cognitive Services Containers, SQL Server ML Services, On-prem Hadoop.
  • Machine Learning Frameworks: Deep Learning (ONNX, PyTorch, TensorFlow), General ML (Spark MLllib, SparkR, SparklyR, MML Spark).

The text description of the decision points will be available in a few days…

Reference materials

Other Decision Trees/Maps

Big Data Decision Tree v4

This is a 4th version of the Big Data Decision Tree (Mind Map), which reflects the last changes in Microsoft products.

As usual, a disclaimer: The process of solution selection for Big Data projects is very complex with a lot of factors. That’s why you may use this decision tree only as a first approximation to start looking deep into described and other solutions. Also, please double check the information provided here in an official documentation.

In the decision tree Big Data is divided into the “three V’s”: velocity, volume and variety. How we choose the right solution depends on which one of these problems we are trying to solve first:

  • Volume: need to store and query hundreds of terabytes of data or more, and the total volume is growing. Processing systems must be scalable to handle increasing volumes of data, typically by scaling out across multiple machines.
  • Velocity: need to collect data at an increasing rate from many new types of devices, from a fast-growing number of users, and from an increasing number of devices and applications per user. Processing systems must be able to return results within an acceptable timeframe, often almost in real-time.
  • Variety: situation when data do not match any existing data schema – semi-structured or unstructured data.

There are three groups of solutions addressing described areas:

  • Complex event processing (CEP) is method of tracking and processing streams of data from multiple sources about events, identifying meaningful events, deriving a conclusion from them, and responding to them as quickly as possible. Use CEP if you need to process hundreds of thousands of events per second.
  • Data Warehouses (DWHs) are central relational repositories of integrated data from one or more disparate sources. They store current and historical data and are used for different analytical tasks in organizations. Use DWH is you have structured relational data with defined scheme.
  • NoSQL systems provide a mechanism for storage and retrieval of data without tabular relations. Characteristics of NoSQL: simplicity of design, simpler “horizontal” scaling to clusters of machines. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are more flexible, and therefore more difficult to store in relational databases. Use NoSQL systems if you have non-relational, semi-structured, or unstructured data; with no schema defined.

Here is the decision tree, which maps the three types of problems to specific solutions.

Here are some most important comments on each of the big data components and corresponding decision points.

Complex event processing

Azure Event Hub

Azure Event Hubs is a highly scalable data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored using any real-time analytics provider or batching/storage adapters. Event Hubs provides publish-subscribe capabilities with low latency at massive scale, which makes it appropriate for big data scenarios.

Integration: supports AMQP and HTTPS protocols

Advantages: easy to use.

Disadvantages: limited access revocation through publisher policies.

Azure IoT Hub

Azure IoT Hub is a managed service that enables reliable and secure bidirectional communications between millions of IoT devices and a cloud-based back end.

Feature of IoT Hub include: multiple options for device-to-cloud and cloud-to-device communication; message routing to other Azure services; queryable store for device metadata and synchronized state information; secure communications and access control using per-device security keys or X.509 certificates; monitoring of device connectivity and device identity management events.

In terms of message ingestion, IoT Hub is similar to Event Hubs. However, it was specifically designed for managing IoT device connectivity, not just message ingestion.

Integration: supports MQTT, AMQP, HTTPS protocols.

Advantages: supports cloud-to-device communications, device-initiated file upload, device state information using Device twins; per-device identity; revocable access control.

Azure HDInsight Kafka

Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams.

Programmability and integration: Kafka is often used with Apache Storm or Spark for real-time stream processing; Kafka 0.10.0.0 streaming API allows to build streaming solutions without requiring Storm or Spark; supports Kafka Protocol

Advantages: simplified configuration process; 99.9% SLA on Kafka uptime; scaling (changing number of worker nodes) and rebalancing Kafka partitions and replicas using Update Domains (UD) and Fault Domains (FD); monitor Kafka using Azure Log Analytics; integration with external authentication services supported.

Disadvantages: complexity.

Azure Stream Analytics (ASA)

Azure Stream Analytics (ASA) may be used for real-time insights from devices, sensors, infrastructure, and applications. Scenarios: real-time remote management and monitoring. ASA is optimized to get streaming data from Azure Event Hubs and Azure Blob Storage. ASA SQL-like queries run continuously against the stream of incoming events. The results can be stored in Blob Storage, Event Hubs, Azure Tables and Azure SQL database. So, if the output is stored in Event Hub it can become the input to another ASA job to chain together multiple real-time queries.

Programmability: Stream analytics query language, JavaScript; Declarative programming paradigm

Advantages: SQL-like query language, cloud-based: close to globally distributed data; largest amount of supported data sinks;

Disadvantages: supports only Avro, JSON or CSV, UTF-8 encoded input data formats.

Notes: priced by Streaming units; scaled by Query partitions

Azure HDInsight with Apache Storm

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG). Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. Storm topologies run indefinitely until “killed”. Storm uses Zookeeper to manage its processes. Storm can read and write files to HDFS.

Architecture: Storm processes the events one at a time.

Performance: millisecond latency.

Programmability: Java, C#, Python; Imperative paradigm; HDInsight Tools for Visual Studio; integrates with Azure Event Hubs, Azure SQL DB, Azure Storage, and Azure Data Lake Storage

Advantages: complete stream processing engine with micro-batching support; 99% Service Level Agreement (SLA) on Storm uptime; Dynamic scaling;

Disadvantages: supports only streaming data, not integrated with Azure platform.

Notes: priced per cluster hour.

Azure HDInsight with Spark Streaming

Spark Streaming is used to build interactive and analytical applications. Used to create low-latency dashboards and security alert system, to optimize operations or prevent specific outcomes. Includes high-level operators to read streaming data from Apache Flume, Apache Kafka, and Twitter; historical data – from HDFS.

Architecture: Spark streams events in small batches that come in short time window before it processes them.

Programmability: Scala, Python, Java; Dstreams; mixture of declarative and imperative paradigm

Performance: 100s of MB/s with low latency (few seconds).

Disadvantages: not integrated with Azure platform.

Notes: priced per cluster hour

Azure App Service WebJobs

WebJobs is a feature of Azure App Service that enables you to run a program or script in the same context as a web app, API app, or mobile app.

Programmability: C#, Node.js, PHP, Java, Python; imperative paradigm.

Advantages: more control over JobHost behavior in the host.json file (for example, to configure a custom retry policy for Azure Storage).

Disadvantages: No built-in temporal/windowing support; no late arrival and out of order event handling support.

Notes: priced per app service plan hour.

Azure Functions

Azure Functions is a solution for easily running small pieces of code, or “functions,” in the cloud. Azure Functions lets you respond to events delivered to an Azure Event Hub. Useful in application instrumentation, user experience or workflow processing, and internet-of-things (IoT) scenarios.

Programmability: C#, F#, Node.js; imperative paradigm.

Advantages: pay only for the time your code runs and trust Azure to scale as needed.

Disadvantages: No built-in temporal/windowing support; limited by up to 200 function app instances processing in parallel; No late arrival and out of order event handling support.

Notes: priced per function execution and resource consumption.

Big data warehouses

Azure SQL DW Gen1

Azure SQL Data Warehouse (DW) is MPP version of SQL Server in Azure for data warehousing workloads. It allows to quickly run complex queries across petabytes of data, allows resize of compute nodes in a minute, and integrated with Azure platform.

Advantages: highly scalable, MPP architecture, lower cost relational storage than Blobs, feature of pausing compute is available, relational store, T-SQL, flexible indexing, security.

Disadvantage: 4-5 times less powerful than Azure SQL DW Gen2; cannot query from external relational stores; no row-level security; no dynamic data masking.

Azure SQL DW Gen2

Azure SQL DW Gen2 comes with five times the compute capacity and four times the concurrent queries of the Gen1 offering. The enhanced storage architecture on Gen2 introduces unlimited columnar storage capacity, while maintaining the ability to independently scale compute and storage.

Azure SQL Data Warehouse Compute Optimized Gen2 tier comes with up to 5 times better query performance, 4 times more concurrency, and 5 times higher computing power compared to the Gen 1. It can serve 128 concurrent queries from a single cluster.

Powering these performance gains is adaptive caching technology that understands where data needs to be and when it needs to be there for the best possible performance. Azure SQL Data Warehouse takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency.

Automatically upgrade from Gen1 to Gen2 is available from the Azure portal.

Programmability: T-SQL.

Advantages: 4-5 times more performant, concurrent and higher compute compared to Gen 1; supports pausing compute, Transparent Data Encryption with customer-managed keys.

Disadvantages: no row-level security; no dynamic data masking

Microsoft APS/PDW

Microsoft Analytics Platform System (APS) is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data technologies. It uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase—a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop.

Advantages: very cost-effective fast MPP architecture if constantly and fully used.

Disadvantages: make sure that for most of queries you don’t initiate data movements between the nodes which is more expensive operation; increasing size requires to buy additional rack and reconfigure manually.

NoSQL on-premises or IaaS in the cloud

SQL Server Big Data Cluster

Starting with SQL Server 2019 preview, SQL Server big data clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.

In SQL Server big data clusters, Kubernetes is responsible for the state of the SQL Server big data clusters; Kubernetes builds and configures the cluster nodes, assigns pods to nodes, and monitors the health of the cluster. This means that SQL Server Big Data Cluster can be easily deployed in any cloud supporting Kubernetes clusters.

Advantages: on-pSQremises/IaaS allows customization; SQL Server supports dynamic data masking, row level security; using PolyBase can query and join with external data sources without moving or copying the data; scalable HDFS storage pool; scale-out data marts; integrated AI and Machine Learning; can be deployed in non-Microsoft clouds supporting Kubernetes clusters.

Disadvantages: currently in a public preview.

Hadoop on-prem or in VMs

Apache Hadoop is the original open-source framework for distributed processing and analysis of big data sets on clusters. The Hadoop technology stack includes related software and utilities, including Apache Hive, Apache HBase, Spark, Kafka, and many others.

In Azure Marketplace there are available Hortonworks, Cloudera and MapR implementations of Hadoop.

Advantages: flexibility: can be used on-premises or in IaaS cloud environment (easy migration); full control over deployment.

Disadvantages: complexity of deployment; need to manage updates.

Notes: can be deployed in Azure IaaS using custom Docker images with the Distributed Data Engineering Toolkit (AZTK)

NoSQL Processing in Azure

Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Programmability: Python, Scala, Java, R, SQL.

Advantages: user friendly UI for collaboration and experimentation (notebooks, cluster creation etc.); fast cluster start times, auto-termination, auto-scaling; supports pausing compute; supports fast scale-out (less than 1 minute); supports GPU-enabled clusters; security with native AD integration.

Disadvantages: cannot act as a relational data store; no row-level security.

Notes: priced by Databricks Unit (DBU) and cluster hour; does not support firewalls.

Azure Data Lake Analytics

Azure Data Lake Analytics (ADLA) is a distributed analytics service built on Apache YARN. It handles jobs of any scale instantly by setting how much compute power is needed. It allows to do analytics on Exabytes of data, and customers still pays only for the cost of the query. ADLA supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems. It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#; runtime processes data across multiple Azure data sources. ADLA allows you to compute on data anywhere and a join data from multiple cloud sources like Azure SQL DW, Azure SQL DB, ADLS, Azure Storage Blobs, SQL Server in Azure VM.

Programmability: U-SQL.

Advantages: easy to start on big data leveraging SQL and C# skills; AAD; security; integrated with Azure platform and Visual Studio; can be priced per job; supports fast scale-out (less than 1 minute).

Disadvantages: currently only batch mode is supported — you may use HDInsight for other types of workloads; no clear roadmap; not compatible with ADLS Gen2; no in-memory caching of data; no row-level security; no dynamic data masking.

Azure HDInsight

HDInsight is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP (Hortonworks’ distribution of Hadoop), and integrates with Azure storage.

Supports a variety of open source analytics engines such as Hive LLAP, Storm, Kafka, HBase, Apache Storm, Spark.

Advantages: cloud-based which means that the cluster can be created approximately in 15 minutes; scale nodes on demand; fully managed by Microsoft (upgrades, patching); some Visual Studio and IntelliJ integration; 99.9% SLA.

Concerns: Ranger (Kerberos-based) security; requires manual configuration and scaling; cannot act as a relational data store

Azure HDInsight with Spark

Apache Spark is an open source cluster computing framework. It provides API based on resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. RDDs function as a working set for distributed programs that offers a form of distributed shared memory. Components built on top of Spark: Spark SQL, Spark Streaming, MLlib, GraphX.

Programmability: Python, Scala, Java, R, SQL.

Advantages: in-memory, fast (5-7 times faster than MapReduce).

Disadvantages: less components if you compare with components based on MapReduce; no row-level security.

Azure HDInsight Hadoop

Apache Hadoop includes two core components: the Apache Hadoop Distributed File System (HDFS) that provides storage, and Apache Hadoop Yet Another Resource Negotiator (YARN) that provides processing. With storage and processing capabilities, a cluster becomes capable of running MapReduce programs to perform the desired data processing.

Advantages: a lot of open source components on top of MapReduce.

Disadvantages: much slower that Spark; not in-memory.

Azure HDInsight with Hive LLAP

Interactive Query (also called Apache Hive LLAP, or Low Latency Analytical Processing) is an Azure HDInsight cluster type. Interactive Query supports in-memory caching, which makes Apache Hive queries faster and much more interactive. An Interactive Query cluster contains only the Hive service.

Programmability: HiveQL (can be executed from Power BI, Apache Zeppelin, Visual Studio, Visual Studio Code, Apache Ambari Hive View, Beeline, Hive ODBC).

Advantages: fast performance with intelligent caching (low latency); support ACID transactions; scalable query concurrency; optimized for speed serving layer.

Disadvantages: cannot query external relational stores (like Azure SQL DB, SQL Server in VM, Azure SQL DW); manual configuration and scaling; no redundant regional servers for high availability.

Azure HDInsight with ML Services

Azure HDInsight with ML Services (AKA R Server cluster on HDInsight) provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. The service provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage. Since ML Services cluster is built on open-source R, the R-based applications you build can leverage any of the 8000+ open-source R packages. The routines in ScaleR, Microsoft’s big data analytics package are also available. ML Services bridges these Microsoft innovations and contributions coming from the open-source community (R, Python, and AI toolkits) all on top of a single enterprise-grade platform.

Programmability: ML Services includes highly scalable, distributed set of algorithms such as RevoscaleR, revoscalepy, and microsoftML that can work on data sizes larger than the size of physical memory, and run on a wide variety of platforms in a distributed manner. Includes Microsoft’s custom R packages and Python packages.

Advantages: can run AI packages from Microsoft and open-source; RevoscaleR allows to apply ML algorithms on top of data which cannot fit in memory of the cluster.

NoSQL Database in Azure

Azure Cosmos DB

Azure Cosmos DB is a globally distributed database service designed to elastically and independently scale throughput and storage across any number of geographical regions with a comprehensive SLA. It supports document, key/value, or graph databases leveraging popular APIs and programming models: DocumentDB API, MongoDB API, Graph API, and Table API.

Development: SQL query and transactions over JSON documents, REST; SDKs: .NET, Node.js, Java, JavaScript, Python, Xamarin, Gremlin.

Advantages: different formats of storage, global distribution, elastic scale out, low latency, 5 consistency models, automatically indexed, schema agnostic, native JSON, stored procedures.

Disadvantages: no support for in-memory caching of data; no dynamic data masking.

Azure HDInsight with HBase

Apache HBase is a NoSQL wide-column store for writing large amounts of unstructured or semi-structured application data to run analytical processes using Hadoop. It provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families. Data is stored in the rows and columns of a table, and data within a row is grouped by column family.

Programmability and integration: Phoenix, OpenTSDB, Kiji, and Titan etc. can run on top of HBase by using it as a datastore; Apache Hive, Apache Pig, Solr, Apache Storm, Apache Flume, Apache Impala, Apache Spark , Ganglia, and Apache Drill also can integrate with HBase.

Advantages: NoSQL wide-column store; can be used as key-value store, for sensor data, for real-time querying; optimized for speed serving layer.

Disadvantages: manual configuration and scaling; no support for in-memory caching of data.

Notes: supports SQL language using Phoenix JDBC driver.

NoSQL storage

Azure Data Lake Store Gen1

Azure Data Lake Store (ADLS) is a distributed, parallel file system in the cloud performance-tuned and optimized for analytics based on different data types. It is supported by leading Hadoop distributives: Hortonworks, Cloudera, MapR, HDInsight and Azure Data Lake Analytics (ADLA).

Development: WebHDFS protocol (behaves like HDFS); REST API over HTTPS.

Advantages: hierarchical file system; optimized performance for parallel analytical workloads; high throughput and IOPs; no limit on account sizes, file sizes or number of files.

Disadvantages: only locally redundant; not available in some regions.

Azure Blob Storage

Azure Blob Storage is a general purpose object store for a wide variety on storage scenarios. It is highly available, secure, durable, scalable, and redundant. It provides hot, cool, and archive storage tiers for different use cases.

Administrative tools: PowerShell, AzCopy.

Development: .NET, Java, Android, C++, Node.js, PHP, Ruby, and Python; REST API with HTTP/HTTPS requests.

Advantages: most compatible; globally redundant; lowest storage costs; better for simple non-hierarchical storages; client-side encryption.

Disadvantages: flat namespace; not optimized for analytical workloads; max 500 TB per account and 4.75 TB per file.

Azure Table Storage

Azure Table Storage allows you to store petabytes of semi-structured data while keeping costs down, without manual sharding. Using geo-redundant storage, stored data is replicated 3 times within a region—and an additional 3 times in another region.

Development: OData-based queries.

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. In Data Lake Storage Gen2 features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale, are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

Advantages: able to store and serve many exabytes of data with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS); near-constant per-request latencies; hierarchical namespace significantly improve the overall performance of many analytics jobs.

Disadvantages: A little more expensive on transactions costs compared to Gen1.

Reference

  1. Choosing a real-time message ingestion technology in Azure
  2. Choosing a stream processing technology in Azure
  3. Choosing an analytical data store in Azure
  4. Choosing a batch processing technology in Azure
  5. Choosing a big data storage technology in Azure
  6. Apache Kafka on HDInsight
  7. Apache Storm on Azure HDInsight
  8. ML Services and open-source R capabilities on HDInsight
  9. Interactive Query with HDInsight
  10. SQL Server 2019 big data clusters
  11. Apache HBase in HDInsight
  12. Azure Databricks Documentation
  13. Connecting IoT Devices to Azure: IoT Hub and Event Hubs
  14. Melissa Coates. Data Lake Use Cases and Planning Considerations
  15. Big Data Architectures
  16. James Serra. SQL Server 2019 Big Data Clusters
  17. James Serra. Azure Data Lake Store Gen2 is GA

What “near 100% compatibility” of Azure SQL DB Managed Instance actually means?

Azure SQL Database Managed Instance (Azure SQL DB MI) is a fully managed SQL Server Database Engine Instance hosted in Azure cloud. This is the most compatible PaaS option for migrating on-premises SQL Server databases to the cloud (PaaS is good if you want to use capabilities like automatic patching and version updates, automated backups, built-in high-availability etc. to reduce management overhead and TCO).

So, what “near 100% compatibility of Azure SQL DB MI with the latest SQL Server on-premises (Enterprise Edition) Database Engine” actually means?

First, components of SQL Server which are not related to the Database Engine are not available in Azure SQL DB MI. Reporting Services, Integration Services, Analysis Services, Master Data Services, Data Quality Services are not there.

Second, some of the features of SQL Server EE needed for enterprise database workloads, are still not available in Azure SQL DB MI, probably due to complexity considerations, not enough requests/impact, or due to availability of similar or better capabilities in Azure.

Feature comparison of SQL Server and Azure SQL DB MI can be found in the official documentation.

Below is a graphical representation of most important differences between SQL Server, Azure SQL DB MI and some other PaaS offerings in Azure.

On the picture features on the borders are partially compatible.

Reference:

Azure Search with Knowledge-based Cognitive Capabilities

Azure Search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search experience over private, heterogenous content in web, mobile, and enterprise applications.

In a context of organizations, applications with search capabilities can be used by External Customers or Internal Business Users. Example of Cognitive Search on top of publicly available JFK files (see JFK Files Public Site and JFK Files Project).

When you use Full Text Search, query execution is done over a user-defined index on top files and searchable datasets.

Cognitive search

Cognitive search can be added to create searchable information out of non-searchable content by attaching AI algorithms to an indexing pipeline. AI integration is provided through cognitive skills, enriching source documents before creating a search index.

Cognitive Skills are based on the same AI algorithms used in Cognitive Services APIs:

  1. Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, and sentiment detection. With these skills, unstructured text becomes structured, mapped to searchable and filterable fields in an index.
  2. Image processing skills include OCR and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Search.
  3. Custom skills are a way to insert transformations unique to application content. A custom skill executes independently, applying whatever enrichment step you require. For example, you could define field-specific custom entities, build custom classification models to differentiate business and financial contracts and documents, or add a speech recognition skill to reach deeper into audio files for relevant content.

Skills can be chained. For instance, you may want to use the language you detected to improve the accuracy of the key-phrase extractor.

Video from Ignite: AI for Knowledge Mining by Luis Cabrera:

Azure Search Global Distribution

To reduce latency for remote users (in a case of geo-distributed workloads) it makes sense to create search services in each corresponding region (that is in closer proximity to these users). For example, you may use multiple Azure Search indexers in different regions that will point to the same datastore. To route requests to multiple geo-located websites that are then backed by multiple Azure Search Services, you may use Azure Traffic Manager. This approach also provides high availability and load balancing.

List of Azure Search Features

Reference

Monitoring Azure SQL Databases

In this article we will discuss different ways of monitoring data solutions including different types of Azure SQL databases like Azure SQL DB, Azure SQL DW, Azure SQL Managed Instance. This can be done using 2 approaches:

  • Single database monitoring using Internal SQL Server features (SQL Server Query Data Store, SQL Server Dynamic Management Views, SQL Server Extended Events), DTU consumption in Azure portal, Query Performance Insight, SQL Database Advisor;
  • Multiple database monitoring and reacting using Azure Monitor with Azure SQL Analytics, Event Hubs, Logic Apps, and Power BI.

Monitoring and troubleshooting single database performance:

  • DTU consumption in Azure portal
  • Query Performance Insight
  • SQL Database Advisor
  • Azure SQL Intelligent Insights
  • Real-time monitoring: dynamic management views (DMVs), extended events, and the Query Store.

Monitoring Multiple databases and reacting to events can be done using Azure Monitor with Azure SQL Analytics, Event Hubs, Logic Apps, and Power BI.

DTU consumption in Azure portal

DTU consumption in Azure portal: for each SQL database use Monitoring chart to look for resources approaching their maximum.

Query Performance Insight

Query Performance Insight lets you spend less time troubleshooting database performance by providing the following:

  • Deeper insight into your databases resource (DTU) consumption
  • The top CPU-consuming queries, which can potentially be tuned for improved performance
  • The ability to drill down into the details of a query

SQL Database Advisor

SQL Database Advisor allows to view recommendations for creating and dropping indexes, parameterizing queries, and fixing schema issues. The advisor assesses performance by analyzing SQL database’s usage history. The recommendations that are best suited for running your database’s typical workload are recommended.

Azure SQL Intelligent Insights

Azure SQL Intelligent Insights can be used for automatic monitoring of database performance. Once a performance issue is detected (for example, performance degradation), a diagnostic log is generated with details and Root Cause Analysis (RCA) of the issue. Performance improvement recommendation is provided when possible.

Real-time monitoring

You also can use dynamic management views (DMVs), extended events, and the Query Store to get performance parameters in real time. See performance guidance to find techniques that you can use to improve performance of Azure SQL Database if you identify some issue using these reports or views.

SQL Server Query Data Store (QDS) allows to ask questions about workloads gathering a history of compilation and runtime metrics throughout query executions:

  • Compile-time statistics: Query text; Semantic-affecting settings; Containing objects: SP, TVF, trigger; Parametrization type; Compilation, binding, and optimization stats; Query plan plus initial and last compile/execute times.
  • Run-time stats (aggregated on an interval): Count of executions and first/last execution time; AVG, LAST, MIN, MAX, and STDEV for metrics Duration, CPU time, Logical I/O reads and writes, Physical I/O reads, DOP, Memory grants, Number of rows.

Dynamic Management Views (DMVs) – Microsoft Azure SQL Database enables a subset of dynamic management views to diagnose performance problems, which might be caused by blocked or long-running queries, resource bottlenecks, poor query plans, and so on. This topic provides information on how to detect common performance problems by using dynamic management views. SQL Database partially supports database-related, execution-related and transaction-related DMVs.

Extended Events (XE) allows to monitor and troubleshoot performance issues, SQL statement executions and full-text related errors. The results can be captured to Ring buffer target (briefly holds event data in memory), Event counter target (counts all events that occur during an extended events session), or Event file target (Writes complete buffers to an Azure Storage container).

Azure Monitor

Azure Monitor maximizes the availability and performance of your applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments. It helps you understand how your applications are performing and proactively identifies issues affecting them and the resources they depend on.

Azure Monitor uses two data stores (for metrics and for logs). It takes data from the sources that collect telemetry from different monitored resources and populate the data stores. Azure Monitor provides analysis, alerting, and streaming to external systems of the collected data.

Most important services related to SQL monitoring using Azure monitor are following:

  • Azure SQL Analytics – provides SQL-related dashboards
  • Logic Apps – allows creating automated workflows reating to events in Azure Monitor
  • Event Hubs – allows streaming of monitoring data to partner monitoring tools
  • Power BI – allows to create and customize custom visualizations.

Azure SQL Analytics is a cloud monitoring solution for monitoring performance of Azure SQL databases, elastic pools, and Managed Instances at scale and across multiple subscriptions. It collects and visualizes important Azure SQL Database performance metrics with built-in intelligence for performance troubleshooting.

Logic Apps is a service that allows you to automate tasks and business processes using workflows that integrate with different systems and services. Activities are available that read and write metrics and logs in Azure Monitor, which allows you to build workflows integrating with a variety of other systems.

Azure Event Hubs is a streaming platform and event ingestion service that can transform and store data using any real-time analytics provider or batching/storage adapters. Use Event Hubs to stream log data from Azure Monitor to partner SIEM and monitoring tools.

Power BI is a business analytics service that provides interactive visualizations across a variety of data sources and is an effective means of making data available to others within and outside your organization. You can configure Power BI to automatically import log data from Azure Monitor to take advantage of these additional visualizations. Note. This is especially convenient when you also monitor SaaS solutions, for example Power BI reports usage with Power BI Premium Capacities – in this case you will have all dashboards and interactive reports in one collaboration environment.

Important Notes

Log Analytics and Application Insights have been consolidated into Azure Monitor. Operations Management Suite (OMS) brand, as a combination of Application Insights, Azure Automation, Azure Backup, Log Analytics, Site Recovery (for licensing purposes) is retired.

Reference materials

  1. Azure Monitor overview
  2. Monitor Azure SQL Database using Azure SQL Analytics (Preview)
  3. Monitoring Azure SQL Database using dynamic management views
  4. Announcing the Power BI Solution Template for Azure Activity Log Analytics
  5. New monitoring capabilities for Power BI Premium Capacities
  6. Intelligent Insights
  7. SQL Database Service: Monitoring and performance tuning

Modern Data Platform Map and Video

Last update: Dec 4, 2018

Modern Data Platform Map represents reference organizational layout of most important data pillars and services and corresponding groups of specialists in enterprises.

In the following video I make a quick overview of Microsoft Data Platform. I will provide more details in subsequent posts and videos. Please post your questions, suggestions and feedback below.

You may also check for details following data pillars and products:

Business Intelligence Solutions Decision Tree

In this article we will cover most important Business Intelligence components based on Microsoft Data Platform. One week ago there were announcements on Power BI Premium and Power BI Report Server which will require some clarification, so I decided to create another decision tree describing available Microsoft analytical modeling and visualization tools, and covering Power BI related components in more detailed way.

For the purposes of this article we will define Business Intelligence in a narrow way, as top and middle layers of BI stack, so it will include Analytical Modeling, Data Visualization, and Collaboration. We will also cover Sites and Apps integration as important part of BI functionality.

  1. Analytical Modeling solutions allow to load data from different data sources, combine data in one model and create calculations.
  2. Data Visualization and Collaboration solutions allow users to create, change, manage and share reports and dashboards built on top of analytical models or data sources.
  3. Sites and Apps Integration solutions allow to create applications of top of data sources, embed analytical resorts into applications and web sites, and create data driven workflows.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Analytical Modeling

  • Azure Analysis Services is Azure PaaS offering built on the proven analytics engine in Microsoft SQL Server Analysis Services. Azure Analysis Services provides enterprise-grade tabular data modeling in the cloud.
  • SQL Server Analysis Services (SSAS) is a part of SQL Server which contains engines for multidimensional (OLAP) and tabular analytical models, and for data mining.
  • SQLBI DAX Studio is a tool to write, execute, and analyze DAX queries in Power BI Designer, Power Pivot for Excel, and Analysis Services Tabular. It includes an Object Browser, query editing and execution, formula and measure editing, syntax highlighting and formatting, integrated tracing and query execution breakdowns.
  • Microsoft Excel is a spreadsheet application with cell-based calculations. It includes Pivot Tables, Pivot Charts and Power View for data visualization; Power Query for data transformation; Power Pivot to create in-memory tabular models and calculations. Excel is a component of Microsoft Office applications package, and is also available in Office 365 subscriptions.
  • Power BI Desktop is a visual data exploration tool for data analysis and reports creation. It allows to load multiple data sources, establish data structure, transform, create analytical tabular model, visualize and explore data in interactive way, and also publish to Power BI Service.

Visualization and collaboration

  • Power BI is a set of tools for self-service and traditional business intelligence, which uses tabular analytical models, allows to build interactive reports and dashboards, and features mobile reports, collaboration and application embedding.
  • Power BI Mobile is a set of free Windows, iOS, and Android applications allowing to view and explore personalized dashboards and reports created in Power BI Service. Also it allows users to be up-to-date with data-driven alerts.
  • Power BI Service (or powerbi.com), is a SaaS part of Power BI offering allowing to create interactive reports, build dashboards, create reports & datasets, update data with real-time, automatic and scheduled refreshes, share dashboards easily with other people in your organization, ask questions of data with Natural Language Query, stay connected to data all the time with mobile applications.
    • Power BI Free is a free version of Power BI Service intended for report authoring (personal use). Currently this service is in transition to have the same functionality as Power BI Pro, but with limited sharing and collaboration features. (This will be effective June 1st)
    • Power BI Pro is a professional version of Power BI Services intended for report authoring, sharing and collaboration. Power BI Pro is payed per user, per month.
    • Power BI Premium is dedicated capacity for large-scale BI deployments, with enhanced performance and larger data volumes, without requiring to purchase per-user licenses. Power BI Premium builds on the existing Power BI portfolio with a capacity-based licensing model that increases flexibility for how users access, share and distribute content. Power BI Premium is payed per node, per month.
  • Power BI Report Server is an on-premises server that allows the deployment and distribution of interactive Power BI reports – and traditional paginated reports – completely within the boundaries of the organization’s firewall. Power BI Report Server is available as part of Power BI Premium or with SQL EE SA.
  • SQL Server Reporting Services is a solution for creating, publishing, managing reports, and delivering reports to users in web browser, on mobile device, or as an email. Types of supported reports: “traditional” paginated reports, mobile reports (AKA DataZen), and Power BI reports (through Power BI Server of Power BI Service).

Sites and Apps development

  • Power BI Embedded is PaaS offering in Azure, which provides interactive data visualizations in customer-facing apps without the time and expense of having to build it from the ground up. In future it will be converged with the Power BI Service to deliver one API surface, a consistent set of capabilities and access to the latest features.
  • Microsoft Flow is a component of Office 365 which represents user friendly and intuitive way of creating automated workflows between applications and services to generate notifications, synchronize files, collect data, and produce other actions.
  • Microsoft PowerApps is a component of Office 365 with user friendly and intuitive interface allowing to build applications without writing code, connect to data sources and create new data, publish and use created apps on web and mobile devices. Power Apps allows business experts in the organization to create the apps they need to support their business requirements with drag and drop simplicity.
  • SharePoint in Office 365 allows you to integrate Power BI interactive reports into SharePoint web pages.
  • SharePoint Server on-premises solution also includes BI-related functionality of SharePoint Server. It includes integration with SQL Server Reporting Services, and also creating PerformancePoint dashboards.

Reference materials:

Decision Tree for Enterprise Information Management (EIM)

In continuation to Big Data Solutions Decision Tree, it makes sense to provide additional details on Enterprise Information Management (EIM). In this article we will define EIM as solutions making possible optimal use of information within organizations to support decision-making processes or day-to-day operations that require the availability of knowledge.

For this purpose, we will look into following aspects of EIM:

  1. Master data management (MDM) is a method of enabling an enterprise to link all of its critical data to one file, called a master file, that provides a common point of reference. MDM streamlines data sharing among personnel and departments, and can facilitate computing in multiple system architectures, platforms and applications. MDM is used for quality improvement, to provide the end user community with a “trusted single version of the truth” from which to base decisions.
  2. Data Cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
  3. Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing that performs: data extraction (extracts data from homogeneous or heterogeneous data sources), data transformation (transforms the data for storing it in the proper format or structure for the purposes of querying and analysis), and data loading (loads data into the operational data store, data mart, or data warehouse).
  4. Metadata management is end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes (often to enable the management of content).
  5. Streaming Data Processing will be covered in a separate post and decision tree.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Master data management (MDM):

  • SQL Server Master Data Services (MDS) is the SQL Server solution, which can be used by organizations to discover and define non-transactional lists of data, with the goal of compiling maintainable master lists. You can use MDS to manage any subject domain, create hierarchies, define granular security, log transactions, manage data versioning, and create business rules.
  • Profisee Master Data Maestro is an enterprise-grade master data management software suite designed to deliver powerful data stewardship and data quality capabilities to customers deploying multi-domain MDM solutions. The Maestro suite delivers a best-in-class user interface to ensure optimal efficiency and productivity for data stewards, innovative large-volume match-merge capabilities for authoritative Golden Record Management, and integrated data quality services to standardize and verify location and contact data across domains. Combined together with Microsoft MDS as a core platform, they provide a world-class out-of-the-box software solution for enterprise-grade master data management applications.

Data Cleansing:

  • SQL Server Data Quality Services (DQS) allows data steward or IT professional to create solutions to maintain the quality of their data and ensure that the data is suited for its business usage. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.

Extract, Transform, Load (ETL):

  • SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. It includes a rich set of built-in tasks and transformations; tools for constructing packages; and the Integration Services service for running and managing packages.
  • Azure Data Factory (ADF) is a cloud-based data integration service that orchestrates and automates movement and transformation of data. Data Factory works across on-premises and cloud data sources and SaaS to ingest, prepare, transform, analyze, and publish data.
  • Datameer takes full advantage of the scalability, security and schema-on-read power of Hadoop providing an elegant front end that reinvents the entire user experience, making the previously linear steps of data integration, preparation, analytics and visualization a single, fluid interaction. It provides Smart Execution technology on top of MapReduce, Tez, and Spark, which frees users from having to determine what compute framework is optimal for their various big data analytics jobs by automatically optimizing performance across both small and large data.
  • U-SQL is the new big data query language of the Azure Data Lake Analytics service. It combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language and big data processing concepts such as “schema on reads”, custom processors and reducers. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
  • Spark SQL is a Spark module for structured data processing. Spark SQL uses information about the structure of both the data and the computation to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result, the same execution engine is used, independent of which API/language you are using to express the computation.
  • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop.
  • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
  • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Structure of Pig programs is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist. Pig’s language layer currently consists of a textual language called Pig Latin with properties like ease of programming, optimization opportunities, and extensibility.

Metadata management:

  • Azure Data Catalog is an enterprise-wide catalog in Azure that enables self-service discovery of data from any source. Key component of Azure Data Catalog is a metadata repository that allow users to register, enrich, understand, discover, and consume data sources. It uses crowd sourcing model, which means that any member of organization can contribute.
  • HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.
  • PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
  • Sapient Synapse is a centralized platform that helps organizations efficiently manage and capture data requirements and metadata using a series of highly visual, web-based tools. Key capabilities: information mapping, data requirements management, view transparency and lineage, research metadata, impact assessment, and data mapping across sources.

Additional resources:

Cortana Intelligence Suite: Big Data and Advanced Analytics

In this post we will discuss reference architecture for Big Data and Advanced Analytics using Cortana Intelligence Suite. The architecture can be relevant for organizations looking to fully manage big data and advanced analytics to transform all enterprise information into intelligent action. This will allow to take action ahead of your competitors by going beyond looking in the rearview mirror to predicting what’s next.

In general, in such solutions you use relational and semi-structured data from business and custom applications, and also semi-structured or unstructured data from sensors, devices, web sites, social networks and other sources.

Big Data flow

Big Data flow includes following steps:

  • Ingestions of data, which can be based on bulk mode or event-based/real-time.
  • Processing data to prepare for storage.
  • Storing data in relational or unstructured storage.
  • Processing data for analytics like data aggregation, complex calculations, predictive or statistical modeling etc.
  • Visualizing data and data discovery using BI tools or custom applications.

big-data-flow

Big Data Reference Architecture

Big Data Reference architecture represents most important components and data flows, allowing to do following.

  • Track Azure data (Azure Website generating web logs) and store in ADLS
  • Track real-time data from IOT Suite: collect data from IOT Suite in permanent store (ADLS)
  • Run Machine Learning through R Server for HDInsight to find patterns in data
  • Show results in BI tools (Power BI)

big-data-ra

There are lot of different options to store data, process data and for machine learning. You may use Big Data and Machine Learning decision trees as a first help to choose most relevant components for your solution. (I will also write about information management components like Azure Data Factory, Azure Data Catalog, Sqoop, Pig, Oozie etc. in one of next posts).

Example of Big Data Solution

To show you simple example of Big Data architecture we will use following artificial scenario.

  • AdventureWorks Travel (AWT) provides concierge services for business travelers. In an increasingly crowded market, they are always looking for ways to differentiate themselves and provide added value to their corporate customers.
  • They are looking to pilot a web-app that their internal customer service agents can use to provide additional information useful to the traveler during the flight booking process. They want to enable their agents to enter in the flight information and produce a prediction as to if the departing flight will encounter a 15 minute or longer delay, taking into account the weather forecasted for the departure hour.
  • Data platform team prefers to use open source technologies for data processing tasks.
  • Developers will need an easy way to create prediction experiments.

Here is example of architecture allowing to solve the scenario described above. Selected components of Cortana Intelligence Suite are highlighted.

cis-example

Demonstration of described solution is available in MTC Studio webcast: 2016-12-08 | Cortana Intelligence Suite: Big Data and Advanced Analytics.

Additional materials

Machine Learning Solutions Decision Tree

New version is available: Artificial Intelligence Decision Tree

Machine learning is a technique of data science that helps computers learn from existing data in order to forecast future behaviors, outcomes, and trends. Currently there are lot of products which can be used for this on-premises or in the cloud, based on single node or multiple nodes, in relational database or in Hadoop based storage.

This article will help you to choose right Machine Learning solution based on specific requirements. We will discuss open source products, which can be deployed in Microsoft Cloud (Azure), or Microsoft products which can be deployed on-premises.

Disclaimer. In this article I present most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Decision on which product will be selected also depends on development platform used by specialists in organization, and on what Big Data solution is already used or there are plans to use. Key questions here are “Do you already use Hadoop or Data Warehouse?” (SPP or MPP?), “In the cloud or on-premises?”, and “How many data needed for machine learning?” (If storage and ML engine are separated, what will be cost and latency of data transfer?).

Process of machine learning solution selection may influence selection of Big Data solution itself. (Please see decision tree on Big Data solutions in a separate article).

Also, complexity and uniqueness of machine learning problem is important, and how much of effort the team is ready to provide to develop ML solution. Some of product are much easier to use (Azure Machine Learning), and for some tasks there are standard APIs available (Azure Cognitive Services).

Please note that some products can be deployed on top of one platform. (For example, MLlib and R Server deployed on top of Spark cluster).

So let’s see the decision tree first. Below I will provide some comments on each of products. (You may also download high-resolution printable version of the decision tree).

machine-learning-dt-v1-02

Azure Machine Learning

Azure Machine Learning (ML) is a cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solutions.

azure-ml

In Machine Learning Studio, you can create predictive models by dragging, dropping, and connecting modules. Studio also provides a library of algorithms and samples to get you started. You may create new ML experiment using sample experiments, R and Python packages, standard algorithms (modules), and custom R and Python scripts.

In Cortana Intelligence Gallery, you can try analytics solutions authored by others or contribute your own.

Data Science development: Visual, R language and Python.

Advantages: Graphical experiments representation; easy to study; quick deployment; Excel integration; scalable in terms of multiple experiments.

Concerns: May not be fastest solution to process large amount of data using one single experiment – make sure that all components of your experiment can scale

Cognitive Services

Cognitive Services are a collection of artificial intelligence REST APIs. With Cognitive Services, developers can easily add intelligent features into their applications.

cognitive-services

Cognitive Services include:

  • Vision: From faces to feelings, allow apps to understand images and video
  • Speech: Hear and speak to users by filtering noise, identifying speakers, and understanding intent
  • Language: Process text and learn how to recognize what users want
  • Knowledge: Tap into rich knowledge amassed from the web, academia, or your own data
  • Search: Access billions of web pages, images, videos, and news with the power of Bing APIs

The collection will continuously improve, adding new APIs and updating existing ones.

Development: REST APIs.

Advantages: Quickly to use, platform independent, use some publicly available data from Bing.

Concerns: Can be used only for subset of machine learning tasks.

Microsoft R Server Family

Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R; it is scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling. It is compatible with the entire collection of open source algorithms, connectors, visualization tools shared openly via CRAN, Bioconductor and other shared resources like GitHub. At the same time key extensions enable R to tackle big data challenges that exceed the capacity of open source R. Scripts can be developed on the desktop and immediately deployed to RDBMS – SQL Server, EDW (SQL Server & Teradata) or Hadoop (Microsoft, Cloudera, Hortonworks and MapR).

r-server

Data Science development: R language (Open R, Scale R).

Advantages: Distributes work across cores and nodes (if multiple nodes available); R Scripts built using R Server can be easily run on multiple platforms running R Server, on-premises and in the cloud (important for hybrid scenarios).

SQL Server R Services (R Server for Windows)

SQL Server R Services (also known as R Server for Windows) is Advanced Analytics and Stand Alone Server Capability built into SQL Server Enterprise Edition. It brings the perfect mix of fast querying and In-Memory OLTP optimization from SQL Server 2016, as well as data exploration, predictive modeling, scoring, and visualization from the R Services family of products. It delivers speed and performance for advanced analytics using near-database analytics and parallel threading and processing. It is integrated with SQL Server: T-SQL can call a Stored Procedure with R code, R scripts can run in SQL through extensibility model, and result sets can be sent through Web API to database or applications.

Data Science development: R language (Open R, Scale R).

Advantages: Included into SQL Server; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

Concerns: Uses only resources of one physical server.

r-services

R Server for MapReduce

R Server for MapReduce uses Apache MapReduce nodes for R computations.

Using R Server in MapReduce eliminates data movement latency and removes data duplication if you already use MapReduce for data storage.

Supported platforms: HDInsight Premium, Hortonworks, Cloudera, MapR.

Data Science development: R language (Open R, Scale R)

Advantages: Distributes work across cores and nodes; if you have lot of MapReduce code and have no plans to move off MapReduce, deploying R Server on top of it will eliminate data movement for machine learning.

Concerns: Uses MapReduce which is slower than Spark.

R Server for Spark

R Server for Spark uses Apache Spark nodes for R computations at in-memory speeds using Spark RDDs. R Server for Spark leverages Spark DAG (Directed Acyclic Graph to distribute work across the cluster) and persistence for computation (we may leave the task running and waiting for new requests). In this scenario you can develop models using larger amounts of data with better performance.

Using R Server in Spark eliminates data movement latency and removes data duplication if you already use Spark for data storage.

Supported platforms: HDInsight Premium with Spark and R Server, Spark on Hortonworks, Spark on Cloudera, Spark on MapR.

Data Science development: R language (Open R, Scale R)

Advantages: uses Spark, which means fast in-memory computations; distributes work across cores and nodes.

R Server for Teradata DB

R Server for Teradata DB uses MPP architecture for R computations.

Data Science development: R language (Open R, Scale R)

Advantages: works with Teradata DB; distributes work across cores; no database data movement, which means it will work much faster; data generated by data scientists using R language can be secured and managed by DBAs and queried by data analysts.

R Server for Linux

Data Science development: R language (Open R, Scale R)

Advantages: distributes work across cores.

Concerns: uses only resources of one physical server; additional time will be used to copy or stream data to Linux machine from HDFS.

Mahout MapReduce

Mahout MapReduce is a collection of machine learning algorithms based on Hadoop MapReduce framework.

Platform: Hadoop MapReduce, Java.

Advantages: Mahoot MapReduce comes with many ML algorithms to choose from; MapReduce is much more mature framework then Spark, therefore more stable.

Concerns: Slow and does not handle iterative jobs very well (constrained by disk accesses due to MapReduce).

Mahout Samsara

Mahout Samsara is a Scala-based programming environment based on different distributed engines (Spark and H2O) which also contains machine learning algorithms. It uses all algebraic expressions in R-like Scala DSL which means that is can be readable by R programmers and in general is easier to understand.

Platform: Hadoop Spark, Scala.

Advantages: Fast due to use of Spark.

Concerns: Currently is under development – unstable.

MLlib

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib ships with Spark as a standard component, so it works seamlessly with SparkSQL, Spark Streaming and Spark GraphX. Additionally, you may deploy R Server on top of Spark cluster.

spark-platform

Platform: Spark.

Data Science language: Python and Scala/Java.

Advantages: due to in-memory capabilities MLlib runs iterative algorithms 5-10 times faster than Mahoot based on Hadoop MapReduce; efficient and interoperable with SparkSQL, Spark Streaming & Spark GraphX; clear and consistent APIs.

Concerns: not all algorithms are implemented, though MLlib is growing very rapidly.

Additional information