Azure Search with Knowledge-based Cognitive Capabilities

Azure Search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search experience over private, heterogenous content in web, mobile, and enterprise applications.

In a context of organizations, applications with search capabilities can be used by External Customers or Internal Business Users. Example of Cognitive Search on top of publicly available JFK files (see JFK Files Public Site and JFK Files Project).

Full Text Search

When you use Full Text Search, query execution is done over a user-defined index on top files and searchable datasets.

Video: Azure Search Overview by Liam Cavanagh:

Cognitive search

Cognitive search can be added to create searchable information out of non-searchable content by attaching AI algorithms to an indexing pipeline. AI integration is provided through cognitive skills, enriching source documents before creating a search index.

Cognitive Skills are based on the same AI algorithms used in Cognitive Services APIs:

  1. Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, and sentiment detection. With these skills, unstructured text becomes structured, mapped to searchable and filterable fields in an index.
  2. Image processing skills include OCR and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Search.
  3. Custom skills are a way to insert transformations unique to application content. A custom skill executes independently, applying whatever enrichment step you require. For example, you could define field-specific custom entities, build custom classification models to differentiate business and financial contracts and documents, or add a speech recognition skill to reach deeper into audio files for relevant content.

Skills can be chained. For instance, you may want to use the language you detected to improve the accuracy of the key-phrase extractor.

Video from Ignite: AI for Knowledge Mining by Luis Cabrera:

Azure Search Global Distribution

To reduce latency for remote users (in a case of geo-distributed workloads) it makes sense to create search services in each corresponding region (that is in closer proximity to these users). For example, you may use multiple Azure Search indexers in different regions that will point to the same datastore. To route requests to multiple geo-located websites that are then backed by multiple Azure Search Services, you may use Azure Traffic Manager. This approach also provides high availability and load balancing.

List of Azure Search Features

Reference

Monitoring Azure SQL Databases

In this article we will discuss different ways of monitoring data solutions including different types of Azure SQL databases like Azure SQL DB, Azure SQL DW, Azure SQL Managed Instance. This can be done using 2 approaches:

  • Single database monitoring using Internal SQL Server features (SQL Server Query Data Store, SQL Server Dynamic Management Views, SQL Server Extended Events), DTU consumption in Azure portal, Query Performance Insight, SQL Database Advisor;
  • Multiple database monitoring and reacting using Azure Monitor with Azure SQL Analytics, Event Hubs, Logic Apps, and Power BI.

Monitoring and troubleshooting single database performance:

  • DTU consumption in Azure portal
  • Query Performance Insight
  • SQL Database Advisor
  • Azure SQL Intelligent Insights
  • Real-time monitoring: dynamic management views (DMVs), extended events, and the Query Store.

Monitoring Multiple databases and reacting to events can be done using Azure Monitor with Azure SQL Analytics, Event Hubs, Logic Apps, and Power BI.

DTU consumption in Azure portal

DTU consumption in Azure portal: for each SQL database use Monitoring chart to look for resources approaching their maximum.

Query Performance Insight

Query Performance Insight lets you spend less time troubleshooting database performance by providing the following:

  • Deeper insight into your databases resource (DTU) consumption
  • The top CPU-consuming queries, which can potentially be tuned for improved performance
  • The ability to drill down into the details of a query

SQL Database Advisor

SQL Database Advisor allows to view recommendations for creating and dropping indexes, parameterizing queries, and fixing schema issues. The advisor assesses performance by analyzing SQL database’s usage history. The recommendations that are best suited for running your database’s typical workload are recommended.

Azure SQL Intelligent Insights

Azure SQL Intelligent Insights can be used for automatic monitoring of database performance. Once a performance issue is detected (for example, performance degradation), a diagnostic log is generated with details and Root Cause Analysis (RCA) of the issue. Performance improvement recommendation is provided when possible.

Real-time monitoring

You also can use dynamic management views (DMVs), extended events, and the Query Store to get performance parameters in real time. See performance guidance to find techniques that you can use to improve performance of Azure SQL Database if you identify some issue using these reports or views.

SQL Server Query Data Store (QDS) allows to ask questions about workloads gathering a history of compilation and runtime metrics throughout query executions:

  • Compile-time statistics: Query text; Semantic-affecting settings; Containing objects: SP, TVF, trigger; Parametrization type; Compilation, binding, and optimization stats; Query plan plus initial and last compile/execute times.
  • Run-time stats (aggregated on an interval): Count of executions and first/last execution time; AVG, LAST, MIN, MAX, and STDEV for metrics Duration, CPU time, Logical I/O reads and writes, Physical I/O reads, DOP, Memory grants, Number of rows.

Dynamic Management Views (DMVs) – Microsoft Azure SQL Database enables a subset of dynamic management views to diagnose performance problems, which might be caused by blocked or long-running queries, resource bottlenecks, poor query plans, and so on. This topic provides information on how to detect common performance problems by using dynamic management views. SQL Database partially supports database-related, execution-related and transaction-related DMVs.

Extended Events (XE) allows to monitor and troubleshoot performance issues, SQL statement executions and full-text related errors. The results can be captured to Ring buffer target (briefly holds event data in memory), Event counter target (counts all events that occur during an extended events session), or Event file target (Writes complete buffers to an Azure Storage container).

Azure Monitor

Azure Monitor maximizes the availability and performance of your applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments. It helps you understand how your applications are performing and proactively identifies issues affecting them and the resources they depend on.

Azure Monitor uses two data stores (for metrics and for logs). It takes data from the sources that collect telemetry from different monitored resources and populate the data stores. Azure Monitor provides analysis, alerting, and streaming to external systems of the collected data.

Most important services related to SQL monitoring using Azure monitor are following:

  • Azure SQL Analytics – provides SQL-related dashboards
  • Logic Apps – allows creating automated workflows reating to events in Azure Monitor
  • Event Hubs – allows streaming of monitoring data to partner monitoring tools
  • Power BI – allows to create and customize custom visualizations.

Azure SQL Analytics is a cloud monitoring solution for monitoring performance of Azure SQL databases, elastic pools, and Managed Instances at scale and across multiple subscriptions. It collects and visualizes important Azure SQL Database performance metrics with built-in intelligence for performance troubleshooting.

Logic Apps is a service that allows you to automate tasks and business processes using workflows that integrate with different systems and services. Activities are available that read and write metrics and logs in Azure Monitor, which allows you to build workflows integrating with a variety of other systems.

Azure Event Hubs is a streaming platform and event ingestion service that can transform and store data using any real-time analytics provider or batching/storage adapters. Use Event Hubs to stream log data from Azure Monitor to partner SIEM and monitoring tools.

Power BI is a business analytics service that provides interactive visualizations across a variety of data sources and is an effective means of making data available to others within and outside your organization. You can configure Power BI to automatically import log data from Azure Monitor to take advantage of these additional visualizations. Note. This is especially convenient when you also monitor SaaS solutions, for example Power BI reports usage with Power BI Premium Capacities – in this case you will have all dashboards and interactive reports in one collaboration environment.

Important Notes

Log Analytics and Application Insights have been consolidated into Azure Monitor. Operations Management Suite (OMS) brand, as a combination of Application Insights, Azure Automation, Azure Backup, Log Analytics, Site Recovery (for licensing purposes) is retired.

Reference materials

  1. Azure Monitor overview
  2. Monitor Azure SQL Database using Azure SQL Analytics (Preview)
  3. Monitoring Azure SQL Database using dynamic management views
  4. Announcing the Power BI Solution Template for Azure Activity Log Analytics
  5. New monitoring capabilities for Power BI Premium Capacities
  6. Intelligent Insights
  7. SQL Database Service: Monitoring and performance tuning

Modern Data Platform Map and Video

Last update: Dec 4, 2018

Modern Data Platform Map represents reference organizational layout of most important data pillars and services and corresponding groups of specialists in enterprises.

In the following video I make a quick overview of Microsoft Data Platform. I will provide more details in subsequent posts and videos. Please post your questions, suggestions and feedback below.

You may also check for details following data pillars and products:

Business Intelligence Solutions Decision Tree

In this article we will cover most important Business Intelligence components based on Microsoft Data Platform. One week ago there were announcements on Power BI Premium and Power BI Report Server which will require some clarification, so I decided to create another decision tree describing available Microsoft analytical modeling and visualization tools, and covering Power BI related components in more detailed way.

For the purposes of this article we will define Business Intelligence in a narrow way, as top and middle layers of BI stack, so it will include Analytical Modeling, Data Visualization, and Collaboration. We will also cover Sites and Apps integration as important part of BI functionality.

  1. Analytical Modeling solutions allow to load data from different data sources, combine data in one model and create calculations.
  2. Data Visualization and Collaboration solutions allow users to create, change, manage and share reports and dashboards built on top of analytical models or data sources.
  3. Sites and Apps Integration solutions allow to create applications of top of data sources, embed analytical resorts into applications and web sites, and create data driven workflows.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Analytical Modeling

  • Azure Analysis Services is Azure PaaS offering built on the proven analytics engine in Microsoft SQL Server Analysis Services. Azure Analysis Services provides enterprise-grade tabular data modeling in the cloud.
  • SQL Server Analysis Services (SSAS) is a part of SQL Server which contains engines for multidimensional (OLAP) and tabular analytical models, and for data mining.
  • SQLBI DAX Studio is a tool to write, execute, and analyze DAX queries in Power BI Designer, Power Pivot for Excel, and Analysis Services Tabular. It includes an Object Browser, query editing and execution, formula and measure editing, syntax highlighting and formatting, integrated tracing and query execution breakdowns.
  • Microsoft Excel is a spreadsheet application with cell-based calculations. It includes Pivot Tables, Pivot Charts and Power View for data visualization; Power Query for data transformation; Power Pivot to create in-memory tabular models and calculations. Excel is a component of Microsoft Office applications package, and is also available in Office 365 subscriptions.
  • Power BI Desktop is a visual data exploration tool for data analysis and reports creation. It allows to load multiple data sources, establish data structure, transform, create analytical tabular model, visualize and explore data in interactive way, and also publish to Power BI Service.

Visualization and collaboration

  • Power BI is a set of tools for self-service and traditional business intelligence, which uses tabular analytical models, allows to build interactive reports and dashboards, and features mobile reports, collaboration and application embedding.
  • Power BI Mobile is a set of free Windows, iOS, and Android applications allowing to view and explore personalized dashboards and reports created in Power BI Service. Also it allows users to be up-to-date with data-driven alerts.
  • Power BI Service (or powerbi.com), is a SaaS part of Power BI offering allowing to create interactive reports, build dashboards, create reports & datasets, update data with real-time, automatic and scheduled refreshes, share dashboards easily with other people in your organization, ask questions of data with Natural Language Query, stay connected to data all the time with mobile applications.
    • Power BI Free is a free version of Power BI Service intended for report authoring (personal use). Currently this service is in transition to have the same functionality as Power BI Pro, but with limited sharing and collaboration features. (This will be effective June 1st)
    • Power BI Pro is a professional version of Power BI Services intended for report authoring, sharing and collaboration. Power BI Pro is payed per user, per month.
    • Power BI Premium is dedicated capacity for large-scale BI deployments, with enhanced performance and larger data volumes, without requiring to purchase per-user licenses. Power BI Premium builds on the existing Power BI portfolio with a capacity-based licensing model that increases flexibility for how users access, share and distribute content. Power BI Premium is payed per node, per month.
  • Power BI Report Server is an on-premises server that allows the deployment and distribution of interactive Power BI reports – and traditional paginated reports – completely within the boundaries of the organization’s firewall. Power BI Report Server is available as part of Power BI Premium or with SQL EE SA.
  • SQL Server Reporting Services is a solution for creating, publishing, managing reports, and delivering reports to users in web browser, on mobile device, or as an email. Types of supported reports: “traditional” paginated reports, mobile reports (AKA DataZen), and Power BI reports (through Power BI Server of Power BI Service).

Sites and Apps development

  • Power BI Embedded is PaaS offering in Azure, which provides interactive data visualizations in customer-facing apps without the time and expense of having to build it from the ground up. In future it will be converged with the Power BI Service to deliver one API surface, a consistent set of capabilities and access to the latest features.
  • Microsoft Flow is a component of Office 365 which represents user friendly and intuitive way of creating automated workflows between applications and services to generate notifications, synchronize files, collect data, and produce other actions.
  • Microsoft PowerApps is a component of Office 365 with user friendly and intuitive interface allowing to build applications without writing code, connect to data sources and create new data, publish and use created apps on web and mobile devices. Power Apps allows business experts in the organization to create the apps they need to support their business requirements with drag and drop simplicity.
  • SharePoint in Office 365 allows you to integrate Power BI interactive reports into SharePoint web pages.
  • SharePoint Server on-premises solution also includes BI-related functionality of SharePoint Server. It includes integration with SQL Server Reporting Services, and also creating PerformancePoint dashboards.

Reference materials:

PPT: Accelerate Academic Research with Cloud Computing

In this deck we will discuss how Microsoft Azure can be used to help Academic Research, and satisfy broad requirements and needs of researchers. We will cover Azure Machine Learning, HDInsight, HPC and other Azure services.

2016-12-08 – Academic Research – MTC Studio

Reference materials:

 

Decision Tree for Enterprise Information Management (EIM)

In continuation to Big Data Solutions Decision Tree, it makes sense to provide additional details on Enterprise Information Management (EIM). In this article we will define EIM as solutions making possible optimal use of information within organizations to support decision-making processes or day-to-day operations that require the availability of knowledge.

For this purpose, we will look into following aspects of EIM:

  1. Master data management (MDM) is a method of enabling an enterprise to link all of its critical data to one file, called a master file, that provides a common point of reference. MDM streamlines data sharing among personnel and departments, and can facilitate computing in multiple system architectures, platforms and applications. MDM is used for quality improvement, to provide the end user community with a “trusted single version of the truth” from which to base decisions.
  2. Data Cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
  3. Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing that performs: data extraction (extracts data from homogeneous or heterogeneous data sources), data transformation (transforms the data for storing it in the proper format or structure for the purposes of querying and analysis), and data loading (loads data into the operational data store, data mart, or data warehouse).
  4. Metadata management is end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes (often to enable the management of content).
  5. Streaming Data Processing will be covered in a separate post and decision tree.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Master data management (MDM):

  • SQL Server Master Data Services (MDS) is the SQL Server solution, which can be used by organizations to discover and define non-transactional lists of data, with the goal of compiling maintainable master lists. You can use MDS to manage any subject domain, create hierarchies, define granular security, log transactions, manage data versioning, and create business rules.
  • Profisee Master Data Maestro is an enterprise-grade master data management software suite designed to deliver powerful data stewardship and data quality capabilities to customers deploying multi-domain MDM solutions. The Maestro suite delivers a best-in-class user interface to ensure optimal efficiency and productivity for data stewards, innovative large-volume match-merge capabilities for authoritative Golden Record Management, and integrated data quality services to standardize and verify location and contact data across domains. Combined together with Microsoft MDS as a core platform, they provide a world-class out-of-the-box software solution for enterprise-grade master data management applications.

Data Cleansing:

  • SQL Server Data Quality Services (DQS) allows data steward or IT professional to create solutions to maintain the quality of their data and ensure that the data is suited for its business usage. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.

Extract, Transform, Load (ETL):

  • SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. It includes a rich set of built-in tasks and transformations; tools for constructing packages; and the Integration Services service for running and managing packages.
  • Azure Data Factory (ADF) is a cloud-based data integration service that orchestrates and automates movement and transformation of data. Data Factory works across on-premises and cloud data sources and SaaS to ingest, prepare, transform, analyze, and publish data.
  • Datameer takes full advantage of the scalability, security and schema-on-read power of Hadoop providing an elegant front end that reinvents the entire user experience, making the previously linear steps of data integration, preparation, analytics and visualization a single, fluid interaction. It provides Smart Execution technology on top of MapReduce, Tez, and Spark, which frees users from having to determine what compute framework is optimal for their various big data analytics jobs by automatically optimizing performance across both small and large data.
  • U-SQL is the new big data query language of the Azure Data Lake Analytics service. It combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language and big data processing concepts such as “schema on reads”, custom processors and reducers. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
  • Spark SQL is a Spark module for structured data processing. Spark SQL uses information about the structure of both the data and the computation to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result, the same execution engine is used, independent of which API/language you are using to express the computation.
  • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop.
  • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
  • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Structure of Pig programs is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist. Pig’s language layer currently consists of a textual language called Pig Latin with properties like ease of programming, optimization opportunities, and extensibility.

Metadata management:

  • Azure Data Catalog is an enterprise-wide catalog in Azure that enables self-service discovery of data from any source. Key component of Azure Data Catalog is a metadata repository that allow users to register, enrich, understand, discover, and consume data sources. It uses crowd sourcing model, which means that any member of organization can contribute.
  • HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.
  • PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
  • Sapient Synapse is a centralized platform that helps organizations efficiently manage and capture data requirements and metadata using a series of highly visual, web-based tools. Key capabilities: information mapping, data requirements management, view transparency and lineage, research metadata, impact assessment, and data mapping across sources.

Additional resources:

Webcast: Enabling student success with cloud computing

In this webcast you will learn how to:

  • Access data analytics tools to enable real-time and predictive analytics
  • Improve student success through measurable results
  • Make the future become less about student grades and more about measuring and customizing education to the needs of the individual student

We will also cover following examples and case studies:

  • Cleveland Metropolitan Case Study
  • Predicting student dropout risks, increasing graduation rates with cloud analytics in Tacoma Public Schools
  • Predicting Student Success using Azure Machine Learning in Northeast Wisconsin Technical College (Proof of Concept)
  • Restart Academy of Missouri (Envisioning Demo by Neal Analytics)
  • Education Data Management showcase (Power BI model by Dell)

To access the webcast, you will need to fill small registration form.

Technologies: Azure Machine Learning and Power BI.

Reference materials: