Automating Data Processing with Azure Data Factory using Python

- Thursday, June 01, 2023

Automating Data Processing with Azure Data Factory using Python

Introduction: Azure Data Factory is a powerful cloud-based data integration and orchestration service provided by Microsoft Azure. It enables organizations to efficiently manage and process data from various sources. With the help of Azure Data Factory, you can create data-driven workflows, also known as pipelines, to orchestrate and monitor your data processing tasks. In this blog, we will explore how to utilize Python to interact with Azure Data Factory and perform key operations such as creating pipeline runs, retrieving run details, and querying pipeline runs.

Prerequisites: To follow along with the code examples in this blog, make sure you have the following prerequisites:

An Azure subscription: You need an active Azure subscription to create and manage an Azure Data Factory instance.
Python and Azure SDK: Ensure that Python is installed on your machine, along with the azure-mgmt-datafactory package. You can install the package using pip: pip install azure-mgmt-datafactory.

Importing the necessary libraries: Before we dive into interacting with Azure Data Factory, let's import the required libraries:

from azure.identity import DefaultAzureCredential from azure.mgmt.datafactory import DataFactoryManagementClient

Authentication: To authenticate with Azure and access your Data Factory instance, you can use the DefaultAzureCredential class, which supports multiple authentication methods (e.g., Azure CLI, managed identity, service principal).

# Create the Data Factory management client credential = DefaultAzureCredential() subscription_id = 'YOUR_SUBSCRIPTION_ID' data_factory_client = DataFactoryManagementClient(credential, subscription_id)

Creating a Pipeline Run: To initiate the execution of a pipeline, you can use the create_run() method of the Data Factory client. This method requires the name of the resource group, the Data Factory name, and the pipeline name.

resource_group_name = 'my-resource-group' data_factory_name = 'my-data-factory' pipeline_name = 'my-pipeline' run_response = data_factory_client.pipeline_runs.create_run( resource_group_name=resource_group_name, factory_name=data_factory_name, pipeline_name=pipeline_name ) run_id = run_response.run_id print("Pipeline run created with ID:", run_id)

Retrieving Run Details: To retrieve the details of a specific pipeline run, you can use the get() method of the Pipeline Run client. This method requires the resource group name, Data Factory name, and the run ID.

pipeline_run = data_factory_client.pipeline_runs.get( resource_group_name=resource_group_name, factory_name=data_factory_name, run_id=run_id ) print("Run ID:", pipeline_run.run_id) print("Status:", pipeline_run.status) print("Start Time:", pipeline_run.run_start) print("End Time:", pipeline_run.run_end)

Querying Pipeline Runs:

To query pipeline runs in Azure Data Factory using Python, you can use the query_by_factory() method of the PipelineRunOperations class. This method allows you to retrieve a list of pipeline runs based on specified criteria such as status, start time, end time, and filters.

Here's an example of how to query pipeline runs in Azure Data Factory using Python:

from azure.identity import DefaultAzureCredential from azure.mgmt.datafactory import DataFactoryManagementClient # Create the Data Factory management client credential = DefaultAzureCredential() subscription_id = 'YOUR_SUBSCRIPTION_ID' data_factory_client = DataFactoryManagementClient(credential, subscription_id) # Define the query parameters resource_group_name = 'my-resource-group' data_factory_name = 'my-data-factory' query_start_time = "2023-05-01T00:00:00Z" # Specify the start time in ISO 8601 format query_end_time = "2023-05-31T23:59:59Z" # Specify the end time in ISO 8601 format # Query pipeline runs pipeline_runs = data_factory_client.pipeline_runs.query_by_factory( resource_group_name=resource_group_name, factory_name=data_factory_name, last_updated_after=query_start_time, last_updated_before=query_end_time, filters=[ { "operand": "Status", "operator": "Equals", "values": ["Succeeded"] } ] ) # Print the details of the queried pipeline runs for run in pipeline_runs.value: print("Run ID:", run.run_id) print("Status:", run.status) print("Start Time:", run.run_start) print("End Time:", run.run_end) print("---")

Make sure to replace 'YOUR_SUBSCRIPTION_ID', 'my-resource-group', and 'my-data-factory' with your own values. Also, adjust the query_start_time and query_end_time variables to specify the desired time range for the query.

In the example above, we are querying pipeline runs that have a status of "Succeeded" within the specified time range. You can modify the filters to query runs based on different criteria such as status, activity name, or any other relevant properties.

By leveraging the query_by_factory() method in Python, you can retrieve pipeline run information programmatically and perform further analysis or reporting on your data integration and processing workflows in Azure Data Factory.

To query activity runs within a specific pipeline run in Azure Data Factory using Python, you can utilize the query_by_pipeline_run() method of the ActivityRunOperations class. This method allows you to retrieve a list of activity runs based on the pipeline run ID.

Here's an example of how to query activity runs in Azure Data Factory using Python:

from azure.identity import DefaultAzureCredential from azure.mgmt.datafactory import DataFactoryManagementClient # Create the Data Factory management client credential = DefaultAzureCredential() subscription_id = 'YOUR_SUBSCRIPTION_ID' data_factory_client = DataFactoryManagementClient(credential, subscription_id) # Define the query parameters resource_group_name = 'my-resource-group' data_factory_name = 'my-data-factory' pipeline_run_id = '12345678-1234-5678-1234-1234567890AB' # Replace with the pipeline run ID # Query activity runs activity_runs = data_factory_client.activity_runs.query_by_pipeline_run( resource_group_name=resource_group_name, factory_name=data_factory_name, run_id=pipeline_run_id ) # Print the details of the queried activity runs for activity_run in activity_runs.value: print("Activity Run ID:", activity_run.run_id) print("Activity Name:", activity_run.activity_name) print("Status:", activity_run.status) print("Start Time:", activity_run.run_start) print("End Time:", activity_run.run_end) print("---")

Make sure to replace 'YOUR_SUBSCRIPTION_ID', 'my-resource-group', 'my-data-factory', and '12345678-1234-5678-1234-1234567890AB' with your own values. The pipeline_run_id variable should contain the ID of the pipeline run for which you want to query activity runs.

In the example above, we retrieve activity runs within a specific pipeline run and print their details, including the activity run ID, activity name, status, start time, and end time. You can perform further analysis or processing on the activity run data based on your specific requirements.

By utilizing the query_by_pipeline_run() method in Python, you can programmatically access activity run information and gain insights into the execution and status of individual activities within your Azure Data Factory pipelines.

Happy Learning!! Happy Coding!!