PPT: Accelerate Academic Research with Cloud Computing

In this deck we will discuss how Microsoft Azure can be used to help Academic Research, and satisfy broad requirements and needs of researchers. We will cover Azure Machine Learning, HDInsight, HPC and other Azure services.

2016-12-08 – Academic Research – MTC Studio

Reference materials:

 

Decision Tree for Enterprise Information Management (EIM)

In continuation to Big Data Solutions Decision Tree, it makes sense to provide additional details on Enterprise Information Management (EIM). In this article we will define EIM as solutions making possible optimal use of information within organizations to support decision-making processes or day-to-day operations that require the availability of knowledge.

For this purpose, we will look into following aspects of EIM:

  1. Master data management (MDM) is a method of enabling an enterprise to link all of its critical data to one file, called a master file, that provides a common point of reference. MDM streamlines data sharing among personnel and departments, and can facilitate computing in multiple system architectures, platforms and applications. MDM is used for quality improvement, to provide the end user community with a “trusted single version of the truth” from which to base decisions.
  2. Data Cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
  3. Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing that performs: data extraction (extracts data from homogeneous or heterogeneous data sources), data transformation (transforms the data for storing it in the proper format or structure for the purposes of querying and analysis), and data loading (loads data into the operational data store, data mart, or data warehouse).
  4. Metadata management is end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes (often to enable the management of content).
  5. Streaming Data Processing will be covered in a separate post and decision tree.

Here is the decision tree, which maps these areas to specific solutions. Below I will provide some comments on each of them.

Master data management (MDM):

  • SQL Server Master Data Services (MDS) is the SQL Server solution, which can be used by organizations to discover and define non-transactional lists of data, with the goal of compiling maintainable master lists. You can use MDS to manage any subject domain, create hierarchies, define granular security, log transactions, manage data versioning, and create business rules.
  • Profisee Master Data Maestro is an enterprise-grade master data management software suite designed to deliver powerful data stewardship and data quality capabilities to customers deploying multi-domain MDM solutions. The Maestro suite delivers a best-in-class user interface to ensure optimal efficiency and productivity for data stewards, innovative large-volume match-merge capabilities for authoritative Golden Record Management, and integrated data quality services to standardize and verify location and contact data across domains. Combined together with Microsoft MDS as a core platform, they provide a world-class out-of-the-box software solution for enterprise-grade master data management applications.

Data Cleansing:

  • SQL Server Data Quality Services (DQS) allows data steward or IT professional to create solutions to maintain the quality of their data and ensure that the data is suited for its business usage. DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.

Extract, Transform, Load (ETL):

  • SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. It includes a rich set of built-in tasks and transformations; tools for constructing packages; and the Integration Services service for running and managing packages.
  • Azure Data Factory (ADF) is a cloud-based data integration service that orchestrates and automates movement and transformation of data. Data Factory works across on-premises and cloud data sources and SaaS to ingest, prepare, transform, analyze, and publish data.
  • Datameer takes full advantage of the scalability, security and schema-on-read power of Hadoop providing an elegant front end that reinvents the entire user experience, making the previously linear steps of data integration, preparation, analytics and visualization a single, fluid interaction. It provides Smart Execution technology on top of MapReduce, Tez, and Spark, which frees users from having to determine what compute framework is optimal for their various big data analytics jobs by automatically optimizing performance across both small and large data.
  • U-SQL is the new big data query language of the Azure Data Lake Analytics service. It combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language and big data processing concepts such as “schema on reads”, custom processors and reducers. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
  • Spark SQL is a Spark module for structured data processing. Spark SQL uses information about the structure of both the data and the computation to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result, the same execution engine is used, independent of which API/language you are using to express the computation.
  • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop.
  • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
  • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Structure of Pig programs is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist. Pig’s language layer currently consists of a textual language called Pig Latin with properties like ease of programming, optimization opportunities, and extensibility.

Metadata management:

  • Azure Data Catalog is an enterprise-wide catalog in Azure that enables self-service discovery of data from any source. Key component of Azure Data Catalog is a metadata repository that allow users to register, enrich, understand, discover, and consume data sources. It uses crowd sourcing model, which means that any member of organization can contribute.
  • HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.
  • PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
  • Sapient Synapse is a centralized platform that helps organizations efficiently manage and capture data requirements and metadata using a series of highly visual, web-based tools. Key capabilities: information mapping, data requirements management, view transparency and lineage, research metadata, impact assessment, and data mapping across sources.

Additional resources:

Webcast: Enabling student success with cloud computing

In this webcast you will learn how to:

  • Access data analytics tools to enable real-time and predictive analytics
  • Improve student success through measurable results
  • Make the future become less about student grades and more about measuring and customizing education to the needs of the individual student

We will also cover following examples and case studies:

  • Cleveland Metropolitan Case Study
  • Predicting student dropout risks, increasing graduation rates with cloud analytics in Tacoma Public Schools
  • Predicting Student Success using Azure Machine Learning in Northeast Wisconsin Technical College (Proof of Concept)
  • Restart Academy of Missouri (Envisioning Demo by Neal Analytics)
  • Education Data Management showcase (Power BI model by Dell)

To access the webcast, you will need to fill small registration form.

Technologies: Azure Machine Learning and Power BI.

Reference materials:

Azure Machine Learning Hands-on Labs

Last update: Oct 17, 2017

In this post I will provide information on Azure Machine Learning (ML) Hands-on Labs training for developers, which we will be delivering in New York and other technology centers. After this training you will know how to create Azure Machine Learning experiment, select best ML model, convert the training experiment to a predictive experiment, and create application which will use the model.

The training consists of following labs.

  1. Predict Individual’s Income >50K (Estimated: 1 hour).
  2. Convert a training experiment into a predictive experiment in Azure ML by Mostafa Elzoghbi (Estimated: 30 minutes).
  3. Consume an Azure ML web service using Visual Studio 2015 by Mostafa Elzoghbi (Estimated: 30 minutes).
  4. Flight delay prediction by Todd Kitta. (Estimated: 3 hours) Start from Task 2. This model can be reused later in a separate Cortana Intelligence Suite End-to-End Training.

If you need more detailed instructions for self-placed training, you may also use Hands-on Labs from edX courses (videos with theory and quizzes are included).

  1. DAT203.1x Data Science Essentials
  2. DAT203.2x Principles of Machine Learning
  3. DAT203.3x Applied Machine Learning

Prerequisites

Please install the below software:

  • Activate your Azure account and bring your Microsoft account credentials. Don’t have a Microsoft account? Sign up now.
  • If you do not have Microsoft Azure account, activate a free 30-day trial Microsoft Azure account, or if you subscribe to MSDN, activate your free Azure MSDN subscriber benefits.
  • Preferred OS is Windows 10.
  • Make sure that Visual Studio 2015 Community, Pro, or Enterprise is installed. Make sure that Office 2013 or later is installed. (Optional; alternatively, you may use Windows Data Science virtual machine in Azure).
  • Create Azure ML workspace for free by signing up here.

Additional resources:

  1. Azure Machine Learning (ML)
  2. Cortana Intelligence Suite: Big Data and Advanced Analytics
  3. Big Data Presentation Deck
  4. Azure ML Data Camp Deck
  5. Detailed Azure ML Hands-on-Labs

Next Steps:

  1. Cortana Intelligence Suite End-to-End Training (Using the Flight Delay Prediction model in Azure-based solution).
  2. Data Science with Microsoft R Hands-on Labs (Different ways of using R language).

Webcast: Predictive Data Warehouse with Datameer

In the following webcast, we will talk to Andrew Brust, Senior Director of Market Strategy and Intelligence in Datameer.

We will learn about Hadoop ecosystem and PaaS options in Azure, difference of Data Lake and Data Warehouse, and added value of unstructured datastreams. We will discuss Hadoop learning curve for professionals with OLTP database and BI background, and how Datameer can help to create big data solutions and futureproof against the change.

Technologies: HDInsight, Stream Analytics, Azure Data Lake Store and Analytics, Azure Machine Learning and Power BI.

To access the webcast, you will need to fill small registration form.

Webcast: Data warehouse migration to Azure with Hortonworks

Modern EDW should be able to manage both structured and unstructured data to realize full value of data. Security, consistency, and credibility of data is also very important. Data warehouse and big data solutions from Microsoft provide a trusted infrastructure that can handle all types of data, and scale from terabytes to petabytes, with real-time performance.

In this webcast with participation of Mark Lochbihler (Director of Partner Engineering, Hortonworks) we discuss modern enterprise data warehouses (EDW) and migration to Microsoft Cloud (Azure). We will learn about the process, tools, and reference architectures for data warehouse migration.

To access the webcast, you will need to fill small registration form.

Additional resources:

Empowering Insurance Risk Modeling

In today’s global environment volatile financial markets and natural catastrophes have created a fast-moving risk landscape in both life and nonlife insurance. In addition, many insurers must comply with regulatory regimes to show they can cope with the risks they face.

Using Azure’s virtually limitless capacity and unlimited infrastructure resources, Insurance organizations can run their workloads faster and more frequently compared to on-premises. Use of cloud compute allows to achieve larger peaks at higher frequencies with lower TCO and access the compute power needed for even the most complex models (G-Series boxes). Azure meets a broad set of international and compliance standards for risk modeling solutions in Insurance.

risk-in-ms-cloud

In this MTC Studio recording we discuss Insurance Risk Modeling scenarios with Jonathan Silverman, Director of Business Development for Financial Services, Microsoft. We will discuss Azure and hybrid architectures for risk modeling, case studies, partner solutions and regulatory compliance of Microsoft Azure.

To access the webcast, you will need to fill the registration form.

risk-modeling-recording

Additional materials:

Risk Modeling Partner Applications:

Risk Modeling Case Studies: