Big Data Solutions Decision Tree

Last update: May 30, 2017

New version: Big Data Decision Tree v4 (Jan 14th, 2019).

Currently there are lot of existing solutions for Big Data storage and analysis. In this article, I will describe a generic decision tree to choose the right solution to achieve your goals.

Disclaimer. Process of solution selection for Big Data projects is very complex with a lot of factors. So I presented most important decision points based on my experience. You may use it as first approximation to start looking deep into described and other solutions.

Let’s look at a variety of products proposed for Big Data by Microsoft on-premises and in the Cloud: Analytics Platform System (APS), Apache HBase, Apache Spark, Apache Storm, Azure Data Lake Analytics (ADLA), Azure Data Lake Store (ADLS), Azure Cosmos DB, Azure Stream Analytics (ASA), Azure SQL DB, Azure SQL DW, Hortonworks Data Platform (HDP), HDInsight, Spark Streaming etc. With many big data solutions we need a structured method to choose the right solution for a Big Data problem.

Big Data is often described as a solution to the “three V’s problem”, and how we choose right solution depends on which one of these problems we are trying to solve first:

  • Volume: need to store and query hundreds of terabytes of data or more, and the total volume is growing. Processing systems must be scalable to handle increasing volumes of data, typically by scaling out across multiple machines.
  • Velocity: need to collect data at an increasing rate from many new types of devices, from a fast-growing number of users, and from an increasing number of devices and applications per user. Processing systems must be able to return results within an acceptable timeframe, often almost in real-time.
  • Variety: situation when data do not match any existing data schema – semi-structured or unstructured data.

Often you will use combination of solutions from all or some of these areas.

Here is the decision tree, which maps the three types of problems to specific solutions. Below I will provide some comments on each of them.

There are three groups of solutions:

  • Data Warehouses (DWHs) are central relational repositories of integrated data from one or more disparate sources. They store current and historical data and are used for different analytical tasks in organizations. Use DWH is you have structured relational data with defined scheme.
  • Complex event processing (CEP) is method of tracking and processing streams of data from multiple sources about events, identifying meaningful events, deriving a conclusion from them, and responding to them as quickly as possible. Use CEP if you need to process hundreds of thousands of events per second.
  • NoSQL systems provide a mechanism for storage and retrieval of data without tabular relations. Characteristics of NoSQL: simplicity of design, simpler “horizontal” scaling to clusters of machines. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are more flexible, and therefore more difficult to store in relational databases. Use NoSQL systems if you have non-relational, semi-structured, or unstructured data; with no schema defined.

Data Warehouses (DWHs)

  • SQL Serverindustry leading database for OLTP mission critical applications, data warehouses, enterprise information management and analytics. Updatable ColumnStore indexes allow to achieve significant increase in performance for analytical workloads. Combination of b-tree, in-memory tables and ColumnStore indexes allows to run analytics queries concurrently with operational workloads using the same schema. PolyBase for SQL Server 2016 allows execute T-SQL queries against semi-structured data in Hadoop or Azure Blob Storage, in addition to relational data in SQL Server. SQL Server also includes components for Enterprise Information Management (SSIS, DQS, MDS), business intelligence (SSAS, SSRS), and Machine Learning (R Services). Advantages: relational store, full transactional support, T-SQL, flexible indexing, security, EIM/Analytics components included, operational analytics on top of OLTP systems. Concerns: on a single node it is hard to achieve high performance for over 100 terabytes of data; sharding and custom code may be used to use multiple instances of SQL Server which means much more development and administrative effort.
  • Azure SQL Database (DB) is a relational database service in the cloud based on the Microsoft SQL Server DBMS engine. SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection. Advantages: relational store, full transactional support, T-SQL, flexible indexing, security. Concerns: current maximum data volume in Azure SQL DB is 1TB, so probably even with sharding and custom code this solution is not applicable for big volumes of relational data.
  • Azure SQL Data Warehouse (DW) is MPP version of SQL Server in Azure. It scales to petabytes of data, allows resize of compute nodes in a minute, and integrated with Azure platform. Advantages: highly scalable, MPP architecture, lower cost relational storage than Blobs, pause able compute, relational store, T-SQL, flexible indexing, security.
  • Microsoft Analytics Platform System (APS) is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data technologies. It uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase—a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop. Advantages: very cost-effective fast MPP architecture if constantly and fully used. Concerns: make sure that for most of queries you don’t initiate data movements between the nodes which is more expensive operation.

Complex event processing (CEP)

  • Azure Stream Analytics (ASA) may be used for real-time insights from devices, sensors, infrastructure, and applications. Scenarios: real-time remote management and monitoring. ASA is optimized to get streaming data from Azure Event Hubs and Azure Blob Storage. ASA SQL-like queries run continuously against the stream of incoming events. The results can be stored in Blob Storage, Event Hubs, Azure Tables and Azure SQL database. So if the output is stored in Event Hub it can become the input to another ASA job to chain together multiple real-time queries. Advantages: SQL-like query language, cloud-based: close to globally distributed data.
  • Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG). Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. Storm topologies run indefinitely until “killed”. Storm uses Zookeeper to manage its processes. Storm can read and write files to HDFS. Architecture: Storm processes the events one at a time. Performance: millisecond latency. Advantages: complete stream processing engine with micro-batching support. Concerns: supports only streaming data, not integrated with Azure platform.
  • Spark Streaming is used to build interactive and analytical applications. Used to create low-latency dashboards and security alert system, to optimize operations or prevent specific outcomes. Includes high-level operators to read streaming data from Apache Flume, Apache Kafka, and Twitter; historical data – from HDFS. Architecture: Spark streams events in small batches that come in short time window before it processes them. Development: Scala+Dstreams. Performance: 100s of MB/s with low latency (few seconds). Concerns: not integrated with Azure platform.

NoSQL systems

On-premises NoSQL:

  • Hortonworks Data Platform (HDP) is an enterprise-ready Hadoop, which is supported by Hortonworks for business clients. It contains collection of open source software frameworks for the distributed storing and processing of Big Data. It is scalable and fault tolerant, and works with commodity hardware. HDP for Windows is a complete package that you can install on Windows Server to build your own fully-configurable big data clusters based on Hadoop. It can be installed on physical on-premises hardware, or in virtual machines in the cloud. Development: Java for MapReduce, Pig, Hive. Advantages: enterprise-ready; full control of managing and running clusters. Concerns: Installation on multiple nodes and managing versions is a complex process; will need Hadoop admin experience.

NoSQL for storage and querying:

  • Azure Blob Storage easily and cost-effectively stores hundreds of objects, or hundreds of millions. It allows petabytes of capacity and massive scalability. You pay only for what you use, saving you more than on-premises storage options. It’s designed for applications that require performance, high availability, and security—offering both client-side encryption and server-side encryption. Blob storage is ideal for storing big data for analysis, whether using an Azure analytics service or your own on-premises solution. Advantages: geo-replication; low storage costs; scalability and data sharing outside of the Hadoop cluster. Administrative tools: PowerShell, AzCopy. Development: .NET, Java, Android, C++, Node.js, PHP, Ruby, and Python; REST API with HTTP/HTTPS requests.
  • Azure Data Lake Store (ADLS) is a distributed, parallel file system in the cloud performance-tuned and optimized for analytics based on different data types. It is supported by leading Hadoop distributives: Hortonworks, Cloudera, MapR, and also with HDInsight and Azure Data Lake Analytics (ADLA). Development: WebHDFS protocol (behaves like HDFS). Advantages: low cost data store with high throughput, role-based security, AAD integration, larger storage limits than Blob Store.
  • Azure Table Storage allows you to store petabytes of semi-structured data while keeping costs down, without manual sharding. Using geo-redundant storage, stored data is replicated 3 times within a region—and an additional 3 times in another region. Development: OData-based queries.
  • Apache HBase is a NoSQL wide-column store for writing large amounts of unstructured or semi-structured application data to run analytical processes using Hadoop (like HDP or HDInsight).
  • Azure Cosmos DB is a globally distributed database service designed to elastically and independently scale throughput and storage across any number of geographical regions with a comprehensive SLA. It supports document, key/value, or graph databases leveraging popular APIs and programming models: DocumentDB API, MongoDB API, Graph API, and Table API. Development: SQL query and transactions over JSON documents, REST; SDKs: .NET, Node.js, Java, JavaScript, Python, Xamarin, Gremlin. Advantages: different formats of storage, global distribution, elastic scale out, 
low latency, 5 consistency models, automatically indexed, schema agnostic, native JSON, stored procedures.

NoSQL for Advanced Analytics:

  • Azure Data Lake Analytics (ADLA) is a distributed analytics service built on Apache YARN. It handles jobs of any scale instantly by setting how much compute power is needed. It allows to do analytics on Exabytes of data, and customers still pays only for the cost of the query. ADLA supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems. It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#; runtime processes data across multiple Azure data sources. ADLA allows you to compute on data anywhere and a join data from multiple cloud sources like Azure SQL DW, Azure SQL DB, ADLS, Azure Storage Blobs, SQL Server in Azure VM. Advantages: U-SQL, AAD, security, integrated with Azure platform and Visual Studio. Concerns: currently only batch mode is supported — you may use HDInsight for other types of workloads.
  • HDInsight. This is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP, and integrates with Azure storage. Supports MapReduce, Spark, Storm frameworks. Advantages: cloud-based which means that the cluster can be created approximately in 15 minutes; scale nodes on demand; fully managed by Microsoft (upgrades, patching); some Visual Studio and IntelliJ integration; 99.9% SLA. Concerns: if you use some specific Hadoop-based components, make sure of compatibility.
  • Apache Spark is an open source cluster computing framework. It provides API based on resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. RDDs function as a working set for distributed programs that offers a form of distributed shared memory. Components built on top of Spark: Spark SQL, Spark Streaming, MLlib, GraphX. It also can work with R Server. Advantages: in-memory, fast (5-7 times faster than MapReduce). Concerns: less components if you compare with components based on MapReduce.
  • Machine Learning technologies will be covered in next article.

Any comments are appreciated. To be continued…

Additional information:

Vehicle Health & Driving Pattern Analysis using Cortana Analytics with Power BI

Last changes: January 13, 2016

In the following scenario of advanced analytics we will show how car dealers, insurances and automobile manufacturers can use Cortana Analytics including Power BI to gain real-time and predictive insights on vehicle health and driving pattern behavior.

The solution can be applied to following business use cases:

  • Usage-based insurance
  • Vehicle diagnostic
  • Engine emission control
  • Engine performance remapping
  • Eco-driving
  • Roadside assistance calls
  • Fleet management

auto-scenarios-by-microsoft

Starting December 1, 2015 the solution called Vehicle Telemetry Analytics template is available at Cortana Analytics Gallery. Here is quick promotional video:

In the following video and text below you will see some details on solution architecture which includes following technologies: Event Hub, Azure Stream Analytics, Azure Machine Learning, Azure Data Factory, HDInsight, Azure Storage, Azure SQL DW, and Power BI.

Let’s look on data flow and solution components.

cc-arch

The Event Hub is used to ingest huge amount of events from the vehicles into Azure for real-time and batch analytics.

The Stream Analytics job is performing real-time data ingestion into the long term storage for batch analytics and data preparation for real-time predictive insights.

Below you can see description of three queries processed in the Stream Analytics for following purposes. (All three queries are enriched with detailed data on each vehicle from Blob Storage).

Query #1 performs join with reference data from Azure Blob Storage and accumulates the resultant data into a different container in the Blob Storage for rich batch analytics.

Query #2 publishes the data as-is to the output Event Hub so that it can be consumed by the RealtimeDashboard app that invokes machine learning request/response end-point for real-time anomaly detection and pushes the results to the PowerBI live dashboard.

Query #3 performs aggregations on the data within a 3 sec tumbling window and publishes it to an Azure SQL instance that got provisioned as part of the deployment.

Data Factory is used for

  • Orchestration, monitoring and management of the batch analytics pipeline
  • Transformation of the data in an on-demand HDInisght cluster for rich insights on Driving Behavior Pattern and Vehicle Health Trending
  • Data movement across the various data stores

cc-datafactory

All data in source datasets are processed using Hive queries where we describe data structures based on CSV files. Additionally we define new tables and calculate aggregations using INSERT request.

cc-hive

In this solution, we are targeting the following batch insights:

  • Aggressive driving behavior (Identifies the trend of the models, locations, driving conditions, and time of the year to gain insights on aggressive driving pattern allowing Contoso Motors to use it for marketing campaigns, driving new personalized features and usage based insurance.)
  • Fuel efficient driving behavior (Identifies the trend of the models, locations, driving conditions, and time of the year to gain insights on fuel efficient driving pattern allowing Contoso Motors to use it for marketing campaigns, driving new features and proactive reporting to the drivers for cost effective and environment friendly driving habits.)
  • Recall models (Identifies models requiring recalls by anomaly detection trend and correlation with driving habits)

An anomaly detection Azure Machine Learning model is used in this demo to detect safety issues for vehicle recall and identifying vehicles requiring maintenance. This model is published in an existing subscription and the web service endpoint is leveraged both in request/response and batch mode for operationalization in the real-time and the batch processing.

CC-demo-AML

Aggregated data from Blob Storage is moved to Azure Data Warehouse for historical storage.

Power BI dashboards contain historical data from Azure DW and real-time data from the Azure Stream Analytics and the Event Hub.

CC-demo-powerbi

Special thanks to authors of the demo scenario: Anand Subbaraj, Sanjay Soni, Christoph Schuler, Santosh Waghmare, Shashank Khedikar, and Sam Istephan.

 

Cortana Intelligence Suite

Cortana Intelligence Suite is a fully managed big data and advanced analytics suite that enables organizations to transform data into intelligent action.

Cortana Intelligence Suite includes following products and capabilities (also see the figure below):

cortana-analytics-suite

Details: Official site of Cortana Analytics Suite

Additional materials:

 

Analysis of Big Data for Financial Services Institutions

In this blog post, we will look at analysis of stock prices and dividends by industry. This task is important to all participants of Stock Market including individual retail investors, institutional investors such as mutual funds, banks, insurance companies and hedge funds, and publicly traded corporations trading in their own shares.

In this demo, team of Stock Trading Company analyses semi-structured stock data from the New York Stock Exchange (NYSE).

  1. Data Architect collects data and makes information accessible to business. He will use Hadoop-based distribution on Windows Azure and Hive queries to aggregate stock and dividend data by years.
  2. Financial Analyst will analyze stock data and prepare ad-hoc reports to support trading and management processes. She will use Power Query add-in for Excel to join aggregated data from Hadoop with additional information on top 500 S&P companies from Azure Marketplace Datamarket. Additionally she will create ad-hoc reports with Power View for Excel.
  3. Trading Executive is responsible for understanding key decision makers and suggesting best product mix of securities. He will make some modifications to Power View reports provided by Financial Analyst.

Details on how Data Architect aggregates data in Hadoop are available in a separate blog post.

Below you can see some screenshots from the demo.

role1

role1-1

role1-2

role2

role2-1

role2-2

role3

role3-1

role3-2

Aggregating Big Data with HDInsight (Hadoop) on Azure

When we a talking about Big Data we may mean huge amounts of data (high Volume), data in any format (high Variety), and streaming data (appearing with high Velocity). Microsoft provides solutions for all of these “3V” tasks under unified monitoring, management and security, as well as unified data movement technologies. These
workloads are supported correspondingly by SQL Server Database and Parallel
Data Warehouse, HDInsight (Hadoop for Windows or Azure), and Microsoft SQL
Server StreamInsight.

big-data-technologies

Let us talk about Microsoft Big Data technology for Non-Relational data.

Microsoft’s adaptation of Hadoop technology can be deployed in a cloud-based environment or on-premises. The Hadoop-based service on the Windows Azure platform is a cloud-based service that offers elastic (in a term of data volumes) analytics on Microsoft’s cloud platform. For customers who want to keep the data within their data centers, Microsoft provides Hadoop-based distribution on Windows Server.

In this blog post, we will start diving into Hadoop in Azure technology and Hive queries to analyze semi-structured data in Hadoop.

In addition to traditional data warehousing, when operational data stored in special structures in Enterprise Data Warehouse, we can store all other raw data in “Store it All” cluster. At any moment, we are able to create query to these data to answer some business question. (In addition, we may store the answer in the Data Warehouse if necessary)

additional-flow

Let me introduce the first part of Bid Data Demonstration where Data Architect will store log files with stock prices and dividends in Azure Blob Storage and will use Hive queries to aggregate data by years and stock tickers into separate file.

store-and-aggregate

Here is the video:

Additional materials: Windows Azure Storage Architecture Overview

Microsoft Customer Analytics Solution for Banking

Understanding customers is the foundation to a sustainable competitive advantage in banking. Such problems as customer churn are very difficult to mitigate and much better just to prevent these events using analysis of customer behavior and preferences. To make it possible it is necessary to extract more value from internal and external data sources, and to be able interactively analyze intuitive data.

Microsoft Business Intelligence allows to solve problem of customer churn, provide better customer service, improve cross-selling, up-selling and enhance share of wallet.

Understanding importance of these tasks Microsoft created Customer Analytics Solution for Banking. In this blog post, I am presenting brief demonstration of this solution. After seeing video recording of demonstration (see below), you will find that marketers and other business users can easily use this solution to answer complex inquiries.

During the demonstration, I will act from points of view of different persons.

  1. The goal of Chief Retail Executive is to develop more profitable customer relationships with end-to-end customer insight with following tasks.
    1. Engage and retain the most profitable customers
    2. Assess segmentation performance and opportunities
    3. Optimize product  mix and pricing strategy
  2. The goal of Chief Marketing Officer is to understand and improve brand perception by integrating a broader set of data with following tasks.
    1. Understand brand perception & consumer sentiment.
    2. Gain insight into customer preferences.
    3. Identify key influencers & social media channels.
  3. The goal of Operator of Call Center is to provide support having complete view of the customer with following tasks.
    1. Provide best customer service quality.
    2. Combine customer sentiment analysis and profitability data.
    3. Track personal performance.
  4. The goal of Contact Center Executive is to maximize customer retention and satisfaction through improved service with following tasks.
    1. Improve  customer service quality.
    2. Track call center performance for each Agent.
    3. Combine customer sentiment analysis and profitability data.

banking-rolesbanking-role1role1-1role1-2role1-3banking-role2role2-1role2-2banking-role3role3-1banking-role4role4-1

Microsoft Business Intelligence Solutions Builder

In the case your organization need to plan for BI deployment or upgrade, you may use the BI Solutions Builder tool at www.bisolutionbuilder.com. This tool allows you to understand which products to need in attrition to existing products in the organization to get a desired BI solution.

To use this service you need to enter some basic information about your company or organization.

On the first step you should select products which you already purchased (for example on the screenshot below there are SharePoint 2010 Enterprise and Office 2007 Professional without Enterprise Agreement).

BI Solutions Builder. First step – Choose Products. Source: www.bisolutionbuilder.com

On the second step you should choose desired capabilities from the list: Predictive Analytics, In-Memory BI, Self-Service BI – End User Analysis, Self-Service BI – Governance and Control, Geospatial Data Visualization, Big Data Analytics, Dashboards and Scorecards, eDiscovery Capability, Collaboration, Cloud-Based Reporting Services, Cloud-Based Data Synchronization, Hadoop Integration, Cloud Storage, Data Warehousing, Analysis Services – OLAP, Power View Visualization, Data Driven Diagrams, Modeling and BI Development, Operational Reporting, Mobile BI.

BI Solutions Builder. Second step – Choose Capabilities. Source: www.bisolutionbuilder.com

On the third step you will get a report with products currently owned, preferred capabilities, upgrade paths with products and capabilities gained, resources related to preferred BI capabilities, information on Microsoft appliances and resources specific to your industry, use cases, and BI tools.

BI Solutions Builder. Third step – View Report. Source: www.bisolutionbuilder.com