Azure Databricks

Azure Databricks is a fully managed, cloud-based Big Data and Machine Learning platform, which empowers developers to accelerate AI and innovation. Azure Databricks provides data science and engineering teams with a single platform for big data processing and Machine Learning. The Azure Databricks managed Apache Spark platform makes it simple to run large-scale Spark workloads.

Azure Databricks consists of following components:

  • Control Plane: Hosts Databricks jobs, notebooks with query results, and the cluster manager. The Control plane also has the web application, hive metastore, and security access control lists (ACLs), and user sessions. These components are managed by Microsoft in collaboration with Azure Databricks and don’t reside within your Azure subscription.
  • Data Plane: Contains all the Azure Databricks runtime clusters that are hosted within the workspace. All data processing and storage exists within the client subscription. No data processing ever takes place within the Microsoft/Databricks-managed subscription.

Azure Databricks environments:

  • Databricks SQL – provides an easy-to-use platform for analysts who want to run SQL queries on their data lake. .
  • Databricks Data Science & Engineering – an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.
  • Databricks Machine Learning – an integrated end-to-end machine learning environment. It incorporates managed services for experiment tracking, model training, feature development and management, and feature and model serving.

Stream processing with Azure Databricks

Sample reference architecture which shows an end-to-end stream processing pipeline (see deep dive here):

Reference Materials

%d bloggers like this: