Big Data Decision Tree v4

This is a 4th version of the Big Data Decision Tree (Mind Map), which reflects the last changes in Microsoft products.

As usual, a disclaimer: The process of solution selection for Big Data projects is very complex with a lot of factors. That’s why you may use this decision tree only as a first approximation to start looking deep into described and other solutions. Also, please double check the information provided here in an official documentation.

In the decision tree Big Data is divided into the “three V’s”: velocity, volume and variety. How we choose the right solution depends on which one of these problems we are trying to solve first:

  • Volume: need to store and query hundreds of terabytes of data or more, and the total volume is growing. Processing systems must be scalable to handle increasing volumes of data, typically by scaling out across multiple machines.
  • Velocity: need to collect data at an increasing rate from many new types of devices, from a fast-growing number of users, and from an increasing number of devices and applications per user. Processing systems must be able to return results within an acceptable timeframe, often almost in real-time.
  • Variety: situation when data do not match any existing data schema – semi-structured or unstructured data.

There are three groups of solutions addressing described areas:

  • Complex event processing (CEP) is method of tracking and processing streams of data from multiple sources about events, identifying meaningful events, deriving a conclusion from them, and responding to them as quickly as possible. Use CEP if you need to process hundreds of thousands of events per second.
  • Data Warehouses (DWHs) are central relational repositories of integrated data from one or more disparate sources. They store current and historical data and are used for different analytical tasks in organizations. Use DWH is you have structured relational data with defined scheme.
  • NoSQL systems provide a mechanism for storage and retrieval of data without tabular relations. Characteristics of NoSQL: simplicity of design, simpler “horizontal” scaling to clusters of machines. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are more flexible, and therefore more difficult to store in relational databases. Use NoSQL systems if you have non-relational, semi-structured, or unstructured data; with no schema defined.

Here is the decision tree, which maps the three types of problems to specific solutions.

Here are some most important comments on each of the big data components and corresponding decision points.

Complex event processing

Azure Event Hub

Azure Event Hubs is a highly scalable data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored using any real-time analytics provider or batching/storage adapters. Event Hubs provides publish-subscribe capabilities with low latency at massive scale, which makes it appropriate for big data scenarios.

Integration: supports AMQP and HTTPS protocols

Advantages: easy to use.

Disadvantages: limited access revocation through publisher policies.

Azure IoT Hub

Azure IoT Hub is a managed service that enables reliable and secure bidirectional communications between millions of IoT devices and a cloud-based back end.

Feature of IoT Hub include: multiple options for device-to-cloud and cloud-to-device communication; message routing to other Azure services; queryable store for device metadata and synchronized state information; secure communications and access control using per-device security keys or X.509 certificates; monitoring of device connectivity and device identity management events.

In terms of message ingestion, IoT Hub is similar to Event Hubs. However, it was specifically designed for managing IoT device connectivity, not just message ingestion.

Integration: supports MQTT, AMQP, HTTPS protocols.

Advantages: supports cloud-to-device communications, device-initiated file upload, device state information using Device twins; per-device identity; revocable access control.

Azure HDInsight Kafka

Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams.

Programmability and integration: Kafka is often used with Apache Storm or Spark for real-time stream processing; Kafka 0.10.0.0 streaming API allows to build streaming solutions without requiring Storm or Spark; supports Kafka Protocol

Advantages: simplified configuration process; 99.9% SLA on Kafka uptime; scaling (changing number of worker nodes) and rebalancing Kafka partitions and replicas using Update Domains (UD) and Fault Domains (FD); monitor Kafka using Azure Log Analytics; integration with external authentication services supported.

Disadvantages: complexity.

Azure Stream Analytics (ASA)

Azure Stream Analytics (ASA) may be used for real-time insights from devices, sensors, infrastructure, and applications. Scenarios: real-time remote management and monitoring. ASA is optimized to get streaming data from Azure Event Hubs and Azure Blob Storage. ASA SQL-like queries run continuously against the stream of incoming events. The results can be stored in Blob Storage, Event Hubs, Azure Tables and Azure SQL database. So, if the output is stored in Event Hub it can become the input to another ASA job to chain together multiple real-time queries.

Programmability: Stream analytics query language, JavaScript; Declarative programming paradigm

Advantages: SQL-like query language, cloud-based: close to globally distributed data; largest amount of supported data sinks;

Disadvantages: supports only Avro, JSON or CSV, UTF-8 encoded input data formats.

Notes: priced by Streaming units; scaled by Query partitions

Azure HDInsight with Apache Storm

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG). Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. Storm topologies run indefinitely until “killed”. Storm uses Zookeeper to manage its processes. Storm can read and write files to HDFS.

Architecture: Storm processes the events one at a time.

Performance: millisecond latency.

Programmability: Java, C#, Python; Imperative paradigm; HDInsight Tools for Visual Studio; integrates with Azure Event Hubs, Azure SQL DB, Azure Storage, and Azure Data Lake Storage

Advantages: complete stream processing engine with micro-batching support; 99% Service Level Agreement (SLA) on Storm uptime; Dynamic scaling;

Disadvantages: supports only streaming data, not integrated with Azure platform.

Notes: priced per cluster hour.

Azure HDInsight with Spark Streaming

Spark Streaming is used to build interactive and analytical applications. Used to create low-latency dashboards and security alert system, to optimize operations or prevent specific outcomes. Includes high-level operators to read streaming data from Apache Flume, Apache Kafka, and Twitter; historical data – from HDFS.

Architecture: Spark streams events in small batches that come in short time window before it processes them.

Programmability: Scala, Python, Java; Dstreams; mixture of declarative and imperative paradigm

Performance: 100s of MB/s with low latency (few seconds).

Disadvantages: not integrated with Azure platform.

Notes: priced per cluster hour

Azure App Service WebJobs

WebJobs is a feature of Azure App Service that enables you to run a program or script in the same context as a web app, API app, or mobile app.

Programmability: C#, Node.js, PHP, Java, Python; imperative paradigm.

Advantages: more control over JobHost behavior in the host.json file (for example, to configure a custom retry policy for Azure Storage).

Disadvantages: No built-in temporal/windowing support; no late arrival and out of order event handling support.

Notes: priced per app service plan hour.

Azure Functions

Azure Functions is a solution for easily running small pieces of code, or “functions,” in the cloud. Azure Functions lets you respond to events delivered to an Azure Event Hub. Useful in application instrumentation, user experience or workflow processing, and internet-of-things (IoT) scenarios.

Programmability: C#, F#, Node.js; imperative paradigm.

Advantages: pay only for the time your code runs and trust Azure to scale as needed.

Disadvantages: No built-in temporal/windowing support; limited by up to 200 function app instances processing in parallel; No late arrival and out of order event handling support.

Notes: priced per function execution and resource consumption.

Big data warehouses

Azure SQL DW Gen1

Azure SQL Data Warehouse (DW) is MPP version of SQL Server in Azure for data warehousing workloads. It allows to quickly run complex queries across petabytes of data, allows resize of compute nodes in a minute, and integrated with Azure platform.

Advantages: highly scalable, MPP architecture, lower cost relational storage than Blobs, feature of pausing compute is available, relational store, T-SQL, flexible indexing, security.

Disadvantage: 4-5 times less powerful than Azure SQL DW Gen2; cannot query from external relational stores; no row-level security; no dynamic data masking.

Azure SQL DW Gen2

Azure SQL DW Gen2 comes with five times the compute capacity and four times the concurrent queries of the Gen1 offering. The enhanced storage architecture on Gen2 introduces unlimited columnar storage capacity, while maintaining the ability to independently scale compute and storage.

Azure SQL Data Warehouse Compute Optimized Gen2 tier comes with up to 5 times better query performance, 4 times more concurrency, and 5 times higher computing power compared to the Gen 1. It can serve 128 concurrent queries from a single cluster.

Powering these performance gains is adaptive caching technology that understands where data needs to be and when it needs to be there for the best possible performance. Azure SQL Data Warehouse takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency.

Automatically upgrade from Gen1 to Gen2 is available from the Azure portal.

Programmability: T-SQL.

Advantages: 4-5 times more performant, concurrent and higher compute compared to Gen 1; supports pausing compute, Transparent Data Encryption with customer-managed keys.

Disadvantages: no row-level security; no dynamic data masking

Microsoft APS/PDW

Microsoft Analytics Platform System (APS) is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data technologies. It uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase—a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop.

Advantages: very cost-effective fast MPP architecture if constantly and fully used.

Disadvantages: make sure that for most of queries you don’t initiate data movements between the nodes which is more expensive operation; increasing size requires to buy additional rack and reconfigure manually.

NoSQL on-premises or IaaS in the cloud

SQL Server Big Data Cluster

Starting with SQL Server 2019 preview, SQL Server big data clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.

In SQL Server big data clusters, Kubernetes is responsible for the state of the SQL Server big data clusters; Kubernetes builds and configures the cluster nodes, assigns pods to nodes, and monitors the health of the cluster. This means that SQL Server Big Data Cluster can be easily deployed in any cloud supporting Kubernetes clusters.

Advantages: on-pSQremises/IaaS allows customization; SQL Server supports dynamic data masking, row level security; using PolyBase can query and join with external data sources without moving or copying the data; scalable HDFS storage pool; scale-out data marts; integrated AI and Machine Learning; can be deployed in non-Microsoft clouds supporting Kubernetes clusters.

Disadvantages: currently in a public preview.

Hadoop on-prem or in VMs

Apache Hadoop is the original open-source framework for distributed processing and analysis of big data sets on clusters. The Hadoop technology stack includes related software and utilities, including Apache Hive, Apache HBase, Spark, Kafka, and many others.

In Azure Marketplace there are available Hortonworks, Cloudera and MapR implementations of Hadoop.

Advantages: flexibility: can be used on-premises or in IaaS cloud environment (easy migration); full control over deployment.

Disadvantages: complexity of deployment; need to manage updates.

Notes: can be deployed in Azure IaaS using custom Docker images with the Distributed Data Engineering Toolkit (AZTK)

NoSQL Processing in Azure

Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Programmability: Python, Scala, Java, R, SQL.

Advantages: user friendly UI for collaboration and experimentation (notebooks, cluster creation etc.); fast cluster start times, auto-termination, auto-scaling; supports pausing compute; supports fast scale-out (less than 1 minute); supports GPU-enabled clusters; security with native AD integration.

Disadvantages: cannot act as a relational data store; no row-level security.

Notes: priced by Databricks Unit (DBU) and cluster hour; does not support firewalls.

Azure Data Lake Analytics

Azure Data Lake Analytics (ADLA) is a distributed analytics service built on Apache YARN. It handles jobs of any scale instantly by setting how much compute power is needed. It allows to do analytics on Exabytes of data, and customers still pays only for the cost of the query. ADLA supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems. It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#; runtime processes data across multiple Azure data sources. ADLA allows you to compute on data anywhere and a join data from multiple cloud sources like Azure SQL DW, Azure SQL DB, ADLS, Azure Storage Blobs, SQL Server in Azure VM.

Programmability: U-SQL.

Advantages: easy to start on big data leveraging SQL and C# skills; AAD; security; integrated with Azure platform and Visual Studio; can be priced per job; supports fast scale-out (less than 1 minute).

Disadvantages: currently only batch mode is supported — you may use HDInsight for other types of workloads; no clear roadmap; not compatible with ADLS Gen2; no in-memory caching of data; no row-level security; no dynamic data masking.

Azure HDInsight

HDInsight is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP (Hortonworks’ distribution of Hadoop), and integrates with Azure storage.

Supports a variety of open source analytics engines such as Hive LLAP, Storm, Kafka, HBase, Apache Storm, Spark.

Advantages: cloud-based which means that the cluster can be created approximately in 15 minutes; scale nodes on demand; fully managed by Microsoft (upgrades, patching); some Visual Studio and IntelliJ integration; 99.9% SLA.

Concerns: Ranger (Kerberos-based) security; requires manual configuration and scaling; cannot act as a relational data store

Azure HDInsight with Spark

Apache Spark is an open source cluster computing framework. It provides API based on resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. RDDs function as a working set for distributed programs that offers a form of distributed shared memory. Components built on top of Spark: Spark SQL, Spark Streaming, MLlib, GraphX.

Programmability: Python, Scala, Java, R, SQL.

Advantages: in-memory, fast (5-7 times faster than MapReduce).

Disadvantages: less components if you compare with components based on MapReduce; no row-level security.

Azure HDInsight Hadoop

Apache Hadoop includes two core components: the Apache Hadoop Distributed File System (HDFS) that provides storage, and Apache Hadoop Yet Another Resource Negotiator (YARN) that provides processing. With storage and processing capabilities, a cluster becomes capable of running MapReduce programs to perform the desired data processing.

Advantages: a lot of open source components on top of MapReduce.

Disadvantages: much slower that Spark; not in-memory.

Azure HDInsight with Hive LLAP

Interactive Query (also called Apache Hive LLAP, or Low Latency Analytical Processing) is an Azure HDInsight cluster type. Interactive Query supports in-memory caching, which makes Apache Hive queries faster and much more interactive. An Interactive Query cluster contains only the Hive service.

Programmability: HiveQL (can be executed from Power BI, Apache Zeppelin, Visual Studio, Visual Studio Code, Apache Ambari Hive View, Beeline, Hive ODBC).

Advantages: fast performance with intelligent caching (low latency); support ACID transactions; scalable query concurrency; optimized for speed serving layer.

Disadvantages: cannot query external relational stores (like Azure SQL DB, SQL Server in VM, Azure SQL DW); manual configuration and scaling; no redundant regional servers for high availability.

Azure HDInsight with ML Services

Azure HDInsight with ML Services (AKA R Server cluster on HDInsight) provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. The service provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage. Since ML Services cluster is built on open-source R, the R-based applications you build can leverage any of the 8000+ open-source R packages. The routines in ScaleR, Microsoft’s big data analytics package are also available. ML Services bridges these Microsoft innovations and contributions coming from the open-source community (R, Python, and AI toolkits) all on top of a single enterprise-grade platform.

Programmability: ML Services includes highly scalable, distributed set of algorithms such as RevoscaleR, revoscalepy, and microsoftML that can work on data sizes larger than the size of physical memory, and run on a wide variety of platforms in a distributed manner. Includes Microsoft’s custom R packages and Python packages.

Advantages: can run AI packages from Microsoft and open-source; RevoscaleR allows to apply ML algorithms on top of data which cannot fit in memory of the cluster.

NoSQL Database in Azure

Azure Cosmos DB

Azure Cosmos DB is a globally distributed database service designed to elastically and independently scale throughput and storage across any number of geographical regions with a comprehensive SLA. It supports document, key/value, or graph databases leveraging popular APIs and programming models: DocumentDB API, MongoDB API, Graph API, and Table API.

Development: SQL query and transactions over JSON documents, REST; SDKs: .NET, Node.js, Java, JavaScript, Python, Xamarin, Gremlin.

Advantages: different formats of storage, global distribution, elastic scale out, low latency, 5 consistency models, automatically indexed, schema agnostic, native JSON, stored procedures.

Disadvantages: no support for in-memory caching of data; no dynamic data masking.

Azure HDInsight with HBase

Apache HBase is a NoSQL wide-column store for writing large amounts of unstructured or semi-structured application data to run analytical processes using Hadoop. It provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families. Data is stored in the rows and columns of a table, and data within a row is grouped by column family.

Programmability and integration: Phoenix, OpenTSDB, Kiji, and Titan etc. can run on top of HBase by using it as a datastore; Apache Hive, Apache Pig, Solr, Apache Storm, Apache Flume, Apache Impala, Apache Spark , Ganglia, and Apache Drill also can integrate with HBase.

Advantages: NoSQL wide-column store; can be used as key-value store, for sensor data, for real-time querying; optimized for speed serving layer.

Disadvantages: manual configuration and scaling; no support for in-memory caching of data.

Notes: supports SQL language using Phoenix JDBC driver.

NoSQL storage

Azure Data Lake Store Gen1

Azure Data Lake Store (ADLS) is a distributed, parallel file system in the cloud performance-tuned and optimized for analytics based on different data types. It is supported by leading Hadoop distributives: Hortonworks, Cloudera, MapR, HDInsight and Azure Data Lake Analytics (ADLA).

Development: WebHDFS protocol (behaves like HDFS); REST API over HTTPS.

Advantages: hierarchical file system; optimized performance for parallel analytical workloads; high throughput and IOPs; no limit on account sizes, file sizes or number of files.

Disadvantages: only locally redundant; not available in some regions.

Azure Blob Storage

Azure Blob Storage is a general purpose object store for a wide variety on storage scenarios. It is highly available, secure, durable, scalable, and redundant. It provides hot, cool, and archive storage tiers for different use cases.

Administrative tools: PowerShell, AzCopy.

Development: .NET, Java, Android, C++, Node.js, PHP, Ruby, and Python; REST API with HTTP/HTTPS requests.

Advantages: most compatible; globally redundant; lowest storage costs; better for simple non-hierarchical storages; client-side encryption.

Disadvantages: flat namespace; not optimized for analytical workloads; max 500 TB per account and 4.75 TB per file.

Azure Table Storage

Azure Table Storage allows you to store petabytes of semi-structured data while keeping costs down, without manual sharding. Using geo-redundant storage, stored data is replicated 3 times within a region—and an additional 3 times in another region.

Development: OData-based queries.

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. In Data Lake Storage Gen2 features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale, are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

Advantages: able to store and serve many exabytes of data with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS); near-constant per-request latencies; hierarchical namespace significantly improve the overall performance of many analytics jobs.

Disadvantages: A little more expensive on transactions costs compared to Gen1.

Reference

  1. Choosing a real-time message ingestion technology in Azure
  2. Choosing a stream processing technology in Azure
  3. Choosing an analytical data store in Azure
  4. Choosing a batch processing technology in Azure
  5. Choosing a big data storage technology in Azure
  6. Apache Kafka on HDInsight
  7. Apache Storm on Azure HDInsight
  8. ML Services and open-source R capabilities on HDInsight
  9. Interactive Query with HDInsight
  10. SQL Server 2019 big data clusters
  11. Apache HBase in HDInsight
  12. Azure Databricks Documentation
  13. Connecting IoT Devices to Azure: IoT Hub and Event Hubs
  14. Melissa Coates. Data Lake Use Cases and Planning Considerations
  15. Big Data Architectures
  16. James Serra. SQL Server 2019 Big Data Clusters
  17. James Serra. Azure Data Lake Store Gen2 is GA

Modern Data Platform Map and Video

Last update: Dec 4, 2018

Modern Data Platform Map represents reference organizational layout of most important data pillars and services and corresponding groups of specialists in enterprises.

In the following video I make a quick overview of Microsoft Data Platform. I will provide more details in subsequent posts and videos. Please post your questions, suggestions and feedback below.

You may also check for details following data pillars and products:

Webcast: Predictive Data Warehouse with Datameer

In the following webcast, we will talk to Andrew Brust, Senior Director of Market Strategy and Intelligence in Datameer.

We will learn about Hadoop ecosystem and PaaS options in Azure, difference of Data Lake and Data Warehouse, and added value of unstructured datastreams. We will discuss Hadoop learning curve for professionals with OLTP database and BI background, and how Datameer can help to create big data solutions and futureproof against the change.

Technologies: HDInsight, Stream Analytics, Azure Data Lake Store and Analytics, Azure Machine Learning and Power BI.

To access the webcast, you will need to fill small registration form.

Webcast: Data warehouse migration to Azure with Hortonworks

Modern EDW should be able to manage both structured and unstructured data to realize full value of data. Security, consistency, and credibility of data is also very important. Data warehouse and big data solutions from Microsoft provide a trusted infrastructure that can handle all types of data, and scale from terabytes to petabytes, with real-time performance.

In this webcast with participation of Mark Lochbihler (Director of Partner Engineering, Hortonworks) we discuss modern enterprise data warehouses (EDW) and migration to Microsoft Cloud (Azure). We will learn about the process, tools, and reference architectures for data warehouse migration.

To access the webcast, you will need to fill small registration form.

Additional resources:

Cortana Intelligence Suite: Big Data and Advanced Analytics

In this post we will discuss reference architecture for Big Data and Advanced Analytics using Cortana Intelligence Suite. The architecture can be relevant for organizations looking to fully manage big data and advanced analytics to transform all enterprise information into intelligent action. This will allow to take action ahead of your competitors by going beyond looking in the rearview mirror to predicting what’s next.

In general, in such solutions you use relational and semi-structured data from business and custom applications, and also semi-structured or unstructured data from sensors, devices, web sites, social networks and other sources.

Big Data flow

Big Data flow includes following steps:

  • Ingestions of data, which can be based on bulk mode or event-based/real-time.
  • Processing data to prepare for storage.
  • Storing data in relational or unstructured storage.
  • Processing data for analytics like data aggregation, complex calculations, predictive or statistical modeling etc.
  • Visualizing data and data discovery using BI tools or custom applications.

big-data-flow

Big Data Reference Architecture

Big Data Reference architecture represents most important components and data flows, allowing to do following.

  • Track Azure data (Azure Website generating web logs) and store in ADLS
  • Track real-time data from IOT Suite: collect data from IOT Suite in permanent store (ADLS)
  • Run Machine Learning through R Server for HDInsight to find patterns in data
  • Show results in BI tools (Power BI)

big-data-ra

There are lot of different options to store data, process data and for machine learning. You may use Big Data and Machine Learning decision trees as a first help to choose most relevant components for your solution. (I will also write about information management components like Azure Data Factory, Azure Data Catalog, Sqoop, Pig, Oozie etc. in one of next posts).

Example of Big Data Solution

To show you simple example of Big Data architecture we will use following artificial scenario.

  • AdventureWorks Travel (AWT) provides concierge services for business travelers. In an increasingly crowded market, they are always looking for ways to differentiate themselves and provide added value to their corporate customers.
  • They are looking to pilot a web-app that their internal customer service agents can use to provide additional information useful to the traveler during the flight booking process. They want to enable their agents to enter in the flight information and produce a prediction as to if the departing flight will encounter a 15 minute or longer delay, taking into account the weather forecasted for the departure hour.
  • Data platform team prefers to use open source technologies for data processing tasks.
  • Developers will need an easy way to create prediction experiments.

Here is example of architecture allowing to solve the scenario described above. Selected components of Cortana Intelligence Suite are highlighted.

cis-example

Demonstration of described solution is available in MTC Studio webcast: 2016-12-08 | Cortana Intelligence Suite: Big Data and Advanced Analytics.

Additional materials

Materials from Modern Data Warehouse Workshop

Today in MTC New York I provided workshop “Always On: Modern Data Warehouse”. (Don’t confuse with SQL Server AlwaysOn technology 😉 ).

Here you can find presentation decks from this workshop: Modern Data Warehouse Architecture and Data Warehouse Technology Deck. Additional materials are available on the Microsoft Modern Data Warehouse site.

reference-arch

Next data platform workshops in the MTC New York:

  • Nov 30, 2016. Always On: Dashboard in a Day (Power BI)
  • Dec 7, 2016. Always On: Mission Critical Performance

Machine Learning @ 1 million predictions per second and more

Watch recordings of keynote and session previews of  Microsoft Machine Learning & Data Science Summit 2016 on the latest Big Data, Machine Learning, Artificial Intelligence, and Open Source techniques and technologies.

Some take-aways from the keynote:

  1. Combination of in-memory technologies and in-database analytics with R at scale using SQL Server 2016 can make 1 million fraud predictions per second.
  2. U-SQL in combination with Cognitive APIs and Azure ML can significantly extend datasets to make possible to analyze large volumes of images (different objects and complexity) and text (subjects, key phrases, sentiments, story).
  3. In future Azure Data Lake Analytics will support Hive and Spark.
  4. Microsoft ResNet (solutions for Deep Learning) is built using 152 neural network layers.
  5. Azure N-series Virtual Machines with GPUs to be used for Deep Learning are available in preview. For example, Tesla K80 delivers 4992 CUDA cores with a dual GPU design, up to 2.91 Teraflops of double-precision and up to 8.93 Teraflops of single-precision performance.

Case Studies:

  1. Student Drop-Out Prediction Service in Indian schools uses Azure ML.
  2. PROS used Azure and R in SQL Server for airlines to recommend prices in milliseconds. For another customer they moved R-based solution to SQL Server 2016 to generate renewals automatically “faster in a factor of a hundred”.
  3. Dyxia used combination of Microsoft Band, MS Health application, Azure IoT Hub, Stream Analytics, Power BI, Machine Learning and other services to monitor and predict anxiety of children with autism.
  4. eSmart Systems created Connected Drone solution combining drones with Deep Learning in Azure to automate inspections of power lines.
  5. CrowdFlower use crowd sourcing (Human-in-the-Loop) to train machine learning models for non-confident predictions.

Below there are some screenshots from the keynote.

intelligence

in-mem-r-sql

mln-predictions

war-and-peace

deep-learning

List of available recordings:

Big Data Solutions Decision Tree

Last update: May 30, 2017

New version: Big Data Decision Tree v4 (Jan 14th, 2019).

Currently there are lot of existing solutions for Big Data storage and analysis. In this article, I will describe a generic decision tree to choose the right solution to achieve your goals.

Disclaimer. Process of solution selection for Big Data projects is very complex with a lot of factors. So I presented most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Let’s look at a variety of products proposed for Big Data by Microsoft on-premises and in the Cloud: Analytics Platform System (APS), Apache HBase, Apache Spark, Apache Storm, Azure Data Lake Analytics (ADLA), Azure Data Lake Store (ADLS), Azure Cosmos DB, Azure Stream Analytics (ASA), Azure SQL DB, Azure SQL DW, Hortonworks Data Platform (HDP), HDInsight, Spark Streaming etc. With many big data solutions we need a structured method to choose the right solution for a Big Data problem.

Big Data is often described as a solution to the “three V’s problem”, and how we choose right solution depends on which one of these problems we are trying to solve first:

  • Volume: need to store and query hundreds of terabytes of data or more, and the total volume is growing. Processing systems must be scalable to handle increasing volumes of data, typically by scaling out across multiple machines.
  • Velocity: need to collect data at an increasing rate from many new types of devices, from a fast-growing number of users, and from an increasing number of devices and applications per user. Processing systems must be able to return results within an acceptable timeframe, often almost in real-time.
  • Variety: situation when data do not match any existing data schema – semi-structured or unstructured data.

Often you will use combination of solutions from all or some of these areas.

Here is the decision tree, which maps the three types of problems to specific solutions. Below I will provide some comments on each of them.

There are three groups of solutions:

  • Data Warehouses (DWHs) are central relational repositories of integrated data from one or more disparate sources. They store current and historical data and are used for different analytical tasks in organizations. Use DWH is you have structured relational data with defined scheme.
  • Complex event processing (CEP) is method of tracking and processing streams of data from multiple sources about events, identifying meaningful events, deriving a conclusion from them, and responding to them as quickly as possible. Use CEP if you need to process hundreds of thousands of events per second.
  • NoSQL systems provide a mechanism for storage and retrieval of data without tabular relations. Characteristics of NoSQL: simplicity of design, simpler “horizontal” scaling to clusters of machines. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are more flexible, and therefore more difficult to store in relational databases. Use NoSQL systems if you have non-relational, semi-structured, or unstructured data; with no schema defined.

Data Warehouses (DWHs)

  • SQL Serverindustry leading database for OLTP mission critical applications, data warehouses, enterprise information management and analytics. Updatable ColumnStore indexes allow to achieve significant increase in performance for analytical workloads. Combination of b-tree, in-memory tables and ColumnStore indexes allows to run analytics queries concurrently with operational workloads using the same schema. PolyBase for SQL Server 2016 allows execute T-SQL queries against semi-structured data in Hadoop or Azure Blob Storage, in addition to relational data in SQL Server. SQL Server also includes components for Enterprise Information Management (SSIS, DQS, MDS), business intelligence (SSAS, SSRS), and Machine Learning (R Services). Advantages: relational store, full transactional support, T-SQL, flexible indexing, security, EIM/Analytics components included, operational analytics on top of OLTP systems. Concerns: on a single node it is hard to achieve high performance for over 100 terabytes of data; sharding and custom code may be used to use multiple instances of SQL Server which means much more development and administrative effort.
  • Azure SQL Database (DB) is a relational database service in the cloud based on the Microsoft SQL Server DBMS engine. SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection. Advantages: relational store, full transactional support, T-SQL, flexible indexing, security. Concerns: current maximum data volume in Azure SQL DB is 1TB, so probably even with sharding and custom code this solution is not applicable for big volumes of relational data.
  • Azure SQL Data Warehouse (DW) is MPP version of SQL Server in Azure. It scales to petabytes of data, allows resize of compute nodes in a minute, and integrated with Azure platform. Advantages: highly scalable, MPP architecture, lower cost relational storage than Blobs, pause able compute, relational store, T-SQL, flexible indexing, security.
  • Microsoft Analytics Platform System (APS) is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data technologies. It uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase—a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop. Advantages: very cost-effective fast MPP architecture if constantly and fully used. Concerns: make sure that for most of queries you don’t initiate data movements between the nodes which is more expensive operation.

Complex event processing (CEP)

  • Azure Stream Analytics (ASA) may be used for real-time insights from devices, sensors, infrastructure, and applications. Scenarios: real-time remote management and monitoring. ASA is optimized to get streaming data from Azure Event Hubs and Azure Blob Storage. ASA SQL-like queries run continuously against the stream of incoming events. The results can be stored in Blob Storage, Event Hubs, Azure Tables and Azure SQL database. So if the output is stored in Event Hub it can become the input to another ASA job to chain together multiple real-time queries. Advantages: SQL-like query language, cloud-based: close to globally distributed data.
  • Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG). Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. Storm topologies run indefinitely until “killed”. Storm uses Zookeeper to manage its processes. Storm can read and write files to HDFS. Architecture: Storm processes the events one at a time. Performance: millisecond latency. Advantages: complete stream processing engine with micro-batching support. Concerns: supports only streaming data, not integrated with Azure platform.
  • Spark Streaming is used to build interactive and analytical applications. Used to create low-latency dashboards and security alert system, to optimize operations or prevent specific outcomes. Includes high-level operators to read streaming data from Apache Flume, Apache Kafka, and Twitter; historical data – from HDFS. Architecture: Spark streams events in small batches that come in short time window before it processes them. Development: Scala+Dstreams. Performance: 100s of MB/s with low latency (few seconds). Concerns: not integrated with Azure platform.

NoSQL systems

On-premises NoSQL:

  • Hortonworks Data Platform (HDP) is an enterprise-ready Hadoop, which is supported by Hortonworks for business clients. It contains collection of open source software frameworks for the distributed storing and processing of Big Data. It is scalable and fault tolerant, and works with commodity hardware. HDP for Windows is a complete package that you can install on Windows Server to build your own fully-configurable big data clusters based on Hadoop. It can be installed on physical on-premises hardware, or in virtual machines in the cloud. Development: Java for MapReduce, Pig, Hive. Advantages: enterprise-ready; full control of managing and running clusters. Concerns: Installation on multiple nodes and managing versions is a complex process; will need Hadoop admin experience.

NoSQL for storage and querying:

  • Azure Blob Storage easily and cost-effectively stores hundreds of objects, or hundreds of millions. It allows petabytes of capacity and massive scalability. You pay only for what you use, saving you more than on-premises storage options. It’s designed for applications that require performance, high availability, and security—offering both client-side encryption and server-side encryption. Blob storage is ideal for storing big data for analysis, whether using an Azure analytics service or your own on-premises solution. Advantages: geo-replication; low storage costs; scalability and data sharing outside of the Hadoop cluster. Administrative tools: PowerShell, AzCopy. Development: .NET, Java, Android, C++, Node.js, PHP, Ruby, and Python; REST API with HTTP/HTTPS requests.
  • Azure Data Lake Store (ADLS) is a distributed, parallel file system in the cloud performance-tuned and optimized for analytics based on different data types. It is supported by leading Hadoop distributives: Hortonworks, Cloudera, MapR, and also with HDInsight and Azure Data Lake Analytics (ADLA). Development: WebHDFS protocol (behaves like HDFS). Advantages: low cost data store with high throughput, role-based security, AAD integration, larger storage limits than Blob Store.
  • Azure Table Storage allows you to store petabytes of semi-structured data while keeping costs down, without manual sharding. Using geo-redundant storage, stored data is replicated 3 times within a region—and an additional 3 times in another region. Development: OData-based queries.
  • Apache HBase is a NoSQL wide-column store for writing large amounts of unstructured or semi-structured application data to run analytical processes using Hadoop (like HDP or HDInsight).
  • Azure Cosmos DB is a globally distributed database service designed to elastically and independently scale throughput and storage across any number of geographical regions with a comprehensive SLA. It supports document, key/value, or graph databases leveraging popular APIs and programming models: DocumentDB API, MongoDB API, Graph API, and Table API. Development: SQL query and transactions over JSON documents, REST; SDKs: .NET, Node.js, Java, JavaScript, Python, Xamarin, Gremlin. Advantages: different formats of storage, global distribution, elastic scale out, 
low latency, 5 consistency models, automatically indexed, schema agnostic, native JSON, stored procedures.

NoSQL for Advanced Analytics:

  • Azure Data Lake Analytics (ADLA) is a distributed analytics service built on Apache YARN. It handles jobs of any scale instantly by setting how much compute power is needed. It allows to do analytics on Exabytes of data, and customers still pays only for the cost of the query. ADLA supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems. It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#; runtime processes data across multiple Azure data sources. ADLA allows you to compute on data anywhere and a join data from multiple cloud sources like Azure SQL DW, Azure SQL DB, ADLS, Azure Storage Blobs, SQL Server in Azure VM. Advantages: U-SQL, AAD, security, integrated with Azure platform and Visual Studio. Concerns: currently only batch mode is supported — you may use HDInsight for other types of workloads.
  • HDInsight. This is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP, and integrates with Azure storage. Supports MapReduce, Spark, Storm frameworks. Advantages: cloud-based which means that the cluster can be created approximately in 15 minutes; scale nodes on demand; fully managed by Microsoft (upgrades, patching); some Visual Studio and IntelliJ integration; 99.9% SLA. Concerns: if you use some specific Hadoop-based components, make sure of compatibility.
  • Apache Spark is an open source cluster computing framework. It provides API based on resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. RDDs function as a working set for distributed programs that offers a form of distributed shared memory. Components built on top of Spark: Spark SQL, Spark Streaming, MLlib, GraphX. It also can work with R Server. Advantages: in-memory, fast (5-7 times faster than MapReduce). Concerns: less components if you compare with components based on MapReduce.
  • Machine Learning technologies will be covered in next article.

Any comments are appreciated. To be continued…

Additional information:

Analysis of Big Data for Financial Services Institutions

In this blog post, we will look at analysis of stock prices and dividends by industry. This task is important to all participants of Stock Market including individual retail investors, institutional investors such as mutual funds, banks, insurance companies and hedge funds, and publicly traded corporations trading in their own shares.

In this demo, team of Stock Trading Company analyses semi-structured stock data from the New York Stock Exchange (NYSE).

  1. Data Architect collects data and makes information accessible to business. He will use Hadoop-based distribution on Windows Azure and Hive queries to aggregate stock and dividend data by years.
  2. Financial Analyst will analyze stock data and prepare ad-hoc reports to support trading and management processes. She will use Power Query add-in for Excel to join aggregated data from Hadoop with additional information on top 500 S&P companies from Azure Marketplace Datamarket. Additionally she will create ad-hoc reports with Power View for Excel.
  3. Trading Executive is responsible for understanding key decision makers and suggesting best product mix of securities. He will make some modifications to Power View reports provided by Financial Analyst.

Details on how Data Architect aggregates data in Hadoop are available in a separate blog post.

Below you can see some screenshots from the demo.

role1

role1-1

role1-2

role2

role2-1

role2-2

role3

role3-1

role3-2

Aggregating Big Data with HDInsight (Hadoop) on Azure

When we a talking about Big Data we may mean huge amounts of data (high Volume), data in any format (high Variety), and streaming data (appearing with high Velocity). Microsoft provides solutions for all of these “3V” tasks under unified monitoring, management and security, as well as unified data movement technologies. These
workloads are supported correspondingly by SQL Server Database and Parallel
Data Warehouse, HDInsight (Hadoop for Windows or Azure), and Microsoft SQL
Server StreamInsight.

big-data-technologies

Let us talk about Microsoft Big Data technology for Non-Relational data.

Microsoft’s adaptation of Hadoop technology can be deployed in a cloud-based environment or on-premises. The Hadoop-based service on the Windows Azure platform is a cloud-based service that offers elastic (in a term of data volumes) analytics on Microsoft’s cloud platform. For customers who want to keep the data within their data centers, Microsoft provides Hadoop-based distribution on Windows Server.

In this blog post, we will start diving into Hadoop in Azure technology and Hive queries to analyze semi-structured data in Hadoop.

In addition to traditional data warehousing, when operational data stored in special structures in Enterprise Data Warehouse, we can store all other raw data in “Store it All” cluster. At any moment, we are able to create query to these data to answer some business question. (In addition, we may store the answer in the Data Warehouse if necessary)

additional-flow

Let me introduce the first part of Bid Data Demonstration where Data Architect will store log files with stock prices and dividends in Azure Blob Storage and will use Hive queries to aggregate data by years and stock tickers into separate file.

store-and-aggregate

Here is the video:

Additional materials: Windows Azure Storage Architecture Overview