Once Datalake storage is mounted. We can read the files present in the Datalake using spark dataframes.
We can also write the output to the files and save them into different folder with a specific name
Once Datalake storage is mounted. We can read the files present in the Datalake using spark dataframes.
We can also write the output to the files and save them into different folder with a specific name
Mounting an Azure Datalake using Oauth with Service principal and Azure Keyvault
This is the most secure way to access it
This prevents to store keys in the code which is insecure way to define in code.
Mounting Azure Datalake Storage to Azure Databricks
Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.
It is like we are accessing DataBricks File System (DBFS)
We can mount the file system using dbutils.fs.mount()
We can mount Azure Blob Storage either by using Account Key or By using SAS key
Widget commands helps to add parameters to your notebook. Sometimes values cannot be taken as hard coded values
We need a mechanism to take values from other Azure Services Dynamically for that we have Widget Commands.
They are:
text
dropdown
combobox
multiselect
Let us see all utilities with example
Azure KeyVault is a service is Azure where we can storage our Secrets, Certificates, Keys,.
Storing the Access keys and we are linking to Databricks using Secret Scope
http://<databricks_Instance_Link>#secrets/createScope
Azure DataBricks File System
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters.
DBFS is an abstraction on top of scalable object storage
The default storage location in DBFS is known as the DBFS root.
/FileStore: Imported data files,
generated plots, and uploaded libraries
/databricks-datasets: Sample
public datasets.
/databricks-results: Files generated by downloading the full results of a query.
Below video contains example video with importing CSV file to DBFS and perform some transformation in it
Azure Data bricks
Azure data bricks is a platform which provides you computational resources and a integrated interface to write code to perform data transformation.
It prevents time to setup environments of Python, R, Scala and SQL. It provides all for us with 0 configuration.
It contains Workspace ,Cluster and Notebook to write your code.
1. Workspace is an environment provided to you by Azure Data bricks
2. Cluster is set of computation resources and configurations on which we can run
workloads
3. Notebook is a web-based interface to a document that contains a code,
Visualizations and narrative text
Below video have a simple way to create a dataframe in Azure databricks
DERIVED COLUMN Transformation
This helps to modify or generate a new column based on the condition we define. This can also helps to generate a new column based on the data of the existing columns
SORT Transformation
As name defines this helps to sort the data of the column based on the column we provide to it
Below video have the complete explanation of both with example
UNION Transformation
This helps to combine multiple streams of data. This is similar to UNION in SQL , but we can combine N number of streams
Data flows in Azure Data Factory
Data Flows helps to apply transformation logic with a code less graphical interface.
Its helps to build the logic with ease.
Below video shows with an example how to perform JOIN, SELECT, FILTER, AGGREGATE Transformations
Azure Data Factory - We can copy files in a folder incrementally that are Created or Modified today to another Folder in Azure Data Lake
Dynamic Data loading helps us to Load data to resources without having effort of giving hard coded names with concept of parameterize of values
In this video we will see to load data from different CSV files from different folders to different tables in Azure SQL Database We used Lookup and Foreach activity to achieve the dynamic data load and copy activity to copy the resourcesTumbling window triggers helps to execute a pipeline with an specified internal .
That interval is known as a window. Pipeline will be trigger based on give value of recurrence.
Schedule trigger helps to execute a pipeline at a particular scheduled time (Start Time,Recurrence, End Time). We can define an end date to schedule trigger.
Features:
a. You can specify a specific date, Month, Days in a week to run this trigger in a given duration.
b. These can be used to attach to multiple pipelines
c. Many triggers also can be attached to a single pipeline.
d. We cannot schedule for a past date (which doesn't make any sense)
Below is the video to demonstrate how it works
Storage event based trigger helps to execute the pipeline based on a particular event occurred in the given trigger
AVRO is a row-based storage format
PARQUET is a columnar-based storage format.
PARQUET is much better for analytical querying, i.e., reads and querying are much more efficient than writing.
Building a pipeline with Copy activity to Extract files from a Zipped folder from Azure Data Lake