May 2022 ~ Shanmukh Sattiraju

Tuesday, May 17, 2022

Azure Databricks - Read CSV from Datalake to Databricks and Write as Single File with specified name

By Shanmukh Sattiraju

Once Datalake storage is mounted. We can read the files present in the Datalake using spark dataframes.

We can also write the output to the files and save them into different folder with a specific name

Azure Databricks - Mounting Azure Datalake using Oauth with Service principal and Azure Keyvault

By Shanmukh Sattiraju

Mounting an Azure Datalake using Oauth with Service principal and Azure Keyvault

This is the most secure way to access it

This prevents to store keys in the code which is insecure way to define in code.

Below video shows the complete walk through to perform it

Azure Databricks - Mounting Azure Datalake using Access keys or SAS

By Shanmukh Sattiraju

Mounting Azure Datalake Storage to Azure Databricks

Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

It is like we are accessing DataBricks File System (DBFS)

We can mount the file system using dbutils.fs.mount()

We can mount Azure Blob Storage either by using Account Key or By using SAS key

Azure Databricks - dbutils - Widget commands with Examples

By Shanmukh Sattiraju

Widget commands helps to add parameters to your notebook. Sometimes values cannot be taken as hard coded values

We need a mechanism to take values from other Azure Services Dynamically for that we have Widget Commands.

They are:

text

dropdown

combobox

multiselect

Let us see all utilities with example

Azure Databricks - dbutils - FS command with all its utilities with Example

By Shanmukh Sattiraju

dbutils is a package that helps to perform certain tasks in Azure Databricks.

dbutils are only supported inside databricks notebooks

The available utilities are:

fs - Manipulates the Databricks filesystem (DBFS) from the console

1. cp

2. head

3. ls

4. mkdirs

5. mv

6. put

7. rm

Let us see about all the above utilities with example

Azure Databricks - Access Azure Datalake Gen2 Storage using Azure Key Vault Secret Scope with PySpark Code

By Shanmukh Sattiraju

Azure KeyVault is a service is Azure where we can storage our Secrets, Certificates, Keys,.

Storing the Access keys and we are linking to Databricks using Secret Scope

http://<databricks_Instance_Link>#secrets/createScope

Azure Databricks - Access Azure Datalake Gen2 Storage directly with Spark Dataframe with Pyspark code

By Shanmukh Sattiraju

We can directly access Azure Datalake from Azure Databricks using SAS or Access keys in the commands.

This is to get an idea on accessing Datalake from Databricks , the recommended way is to use Azure Key vault which we will see in coming video

Azure Databricks- Importing CSV file into DataBricks File System with PySpark Code

By Shanmukh Sattiraju

Azure DataBricks File System

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters.

DBFS is an abstraction on top of scalable object storage

The default storage location in DBFS is known as the DBFS root.

/FileStore: Imported data files, generated plots, and uploaded libraries

/databricks-datasets: Sample public datasets.

/databricks-results: Files generated by downloading the full results of a query.

Below video contains example video with importing CSV file to DBFS and perform some transformation in it

Azure Databricks - Creating simple Dataframe from list with PySpark code

By Shanmukh Sattiraju

Azure Data bricks

Azure data bricks is a platform which provides you computational resources and a integrated interface to write code to perform data transformation.

It prevents time to setup environments of Python, R, Scala and SQL. It provides all for us with 0 configuration.

It contains Workspace ,Cluster and Notebook to write your code.

1. Workspace is an environment provided to you by Azure Data bricks

2. Cluster is set of computation resources and configurations on which we can run

workloads

3. Notebook is a web-based interface to a document that contains a code,

Visualizations and narrative text

Below video have a simple way to create a dataframe in Azure databricks

Azure Data Factory - Data Flows - DERIVED COLUMN and SORT Transformation

By Shanmukh Sattiraju

DERIVED COLUMN Transformation

This helps to modify or generate a new column based on the condition we define. This can also helps to generate a new column based on the data of the existing columns

SORT Transformation

As name defines this helps to sort the data of the column based on the column we provide to it

Below video have the complete explanation of both with example

Azure Data Factory - Data flows - UNION Transformation with example

By Shanmukh Sattiraju

UNION Transformation

This helps to combine multiple streams of data. This is similar to UNION in SQL , but we can combine N number of streams

Azure Data Factory - Data Flows - CONDITIONAL SPLIT Transformation

By Shanmukh Sattiraju

Conditional Split Transformation - This helps to make different data streams based on the matching condition that we give.

This is similar to CASE statement that we use in our Programming languages. We can give multiple conditions and split the data to different streams. The following video have a detailed example with explanation for Conditional Split Transformations

Azure Data Factory - Data Flows - JOIN, SELECT , FILTER , AGGREGATE Transformations

By Shanmukh Sattiraju

Data flows in Azure Data Factory

Data Flows helps to apply transformation logic with a code less graphical interface.

Its helps to build the logic with ease.

Below video shows with an example how to perform JOIN, SELECT, FILTER, AGGREGATE Transformations

Azure Data Factory - Incremental data copy Copy files that are Created / Modified today

By Shanmukh Sattiraju

Azure Data Factory - We can copy files in a folder incrementally that are Created or Modified today to another Folder in Azure Data Lake

Azure Data Factory - Copy specific files from one folder to another in Data Lake

By Shanmukh Sattiraju

Using Get-Metadata, forEach and If condition in Azure Data Factory we can get properties of the files in Folder and we can apply our transformation to and pick desired files to another Location of DataLake

In this video, we look file size as a parameter to pick files and we took files which are more than 30 KB of size and moved them to other location

Azure Data Factory - Dynamic Data Loading using parameters to different SQL Tables.

By Shanmukh Sattiraju

Dynamic Data loading helps us to Load data to resources without having effort of giving hard coded names with concept of parameterize of values

In this video we will see to load data from different CSV files from different folders to different tables in Azure SQL Database We used Lookup and Foreach activity to achieve the dynamic data load and copy activity to copy the resources

Azure Data Factory - Triggers - Tumbling Window trigger

By Shanmukh Sattiraju

Tumbling window triggers helps to execute a pipeline with an specified internal .

That interval is known as a window. Pipeline will be trigger based on give value of recurrence.

Azure Data Factory - Triggers - Schedule Trigger with Example

By Shanmukh Sattiraju

Schedule trigger helps to execute a pipeline at a particular scheduled time (Start Time,Recurrence, End Time). We can define an end date to schedule trigger.

Features:

a. You can specify a specific date, Month, Days in a week to run this trigger in a given duration.

The above image defines to run pipeline on every Monday, Wednesday and Friday at 1.25 and 8.25 AM

b. These can be used to attach to multiple pipelines

c. Many triggers also can be attached to a single pipeline.

d. We cannot schedule for a past date (which doesn't make any sense)

Below is the video to demonstrate how it works

PARQUET is a columnar-based storage format.

PARQUET is much better for analytical querying, i.e., reads and querying are much more efficient than writing.

Azure Data Factory - Pipeline to Unzip a folder and extract files using ADF

By Shanmukh Sattiraju

Building a pipeline with Copy activity to Extract files from a Zipped folder from Azure Data Lake

Shanmukh Sattiraju

Evolve ourselves along with the trending technology by learning and enhance the skill set to master it.

Blog Viewers

Tuesday, May 17, 2022

Azure Databricks - Read CSV from Datalake to Databricks and Write as Single File with specified name

Monday, May 16, 2022

Azure Databricks - Mounting Azure Datalake using Oauth with Service principal and Azure Keyvault

Azure Databricks - Mounting Azure Datalake using Access keys or SAS

Sunday, May 15, 2022

Azure Databricks - dbutils - Widget commands with Examples

Azure Databricks - dbutils - FS command with all its utilities with Example

Azure Databricks - Access Azure Datalake Gen2 Storage using Azure Key Vault Secret Scope with PySpark Code

Azure Databricks - Access Azure Datalake Gen2 Storage directly with Spark Dataframe with Pyspark code

Azure Databricks- Importing CSV file into DataBricks File System with PySpark Code

Azure Databricks - Creating simple Dataframe from list with PySpark code

Azure Data Factory - Data Flows - DERIVED COLUMN and SORT Transformation

Azure Data Factory - Data flows - UNION Transformation with example

Azure Data Factory - Data Flows - CONDITIONAL SPLIT Transformation

Saturday, May 14, 2022

Azure Data Factory - Data Flows - JOIN, SELECT , FILTER , AGGREGATE Transformations

Friday, May 13, 2022

Azure Data Factory - Incremental data copy Copy files that are Created / Modified today

Azure Data Factory - Copy specific files from one folder to another in Data Lake

Monday, May 09, 2022

Azure Data Factory - Dynamic Data Loading using parameters to different SQL Tables.

Azure Data Factory - Triggers - Tumbling Window trigger

Azure Data Factory - Triggers - Schedule Trigger with Example

Azure Data Factory - Triggers - Storage event based trigger

Azure Data Factory - Copy data from Azure DataLake Gen2 to Azure SQL Database using Copy activity

Sunday, May 08, 2022

Azure Data Factory - Convert CSV file to JSON, Parquet, Avro formats using ADF pipeline

Azure Data Factory - Pipeline to Unzip a folder and extract files using ADF

Global Certifications:

Search

Popular Posts

Recent Posts

Text Widget

Pages

Blog Archive

About Me