Azure Factory is one of the most common data integration ETL services used by developers and data managers/scientists worldwide. This cloud-based Microsoft tool gathers raw business data and transforms it into usable information through extraction, transformation, and loading functions. This article will look at some of the common Azure Data Factory interviews. We hope they will help you prepare for your upcoming interview and ace it. Let’s discuss this.
1. Define Azure Data Factory
Azure Data Factory is a cloud-based integration service that allows developers to create a data-driven workflow. It allows the orchestration and automation of data transformation and movement. Some common uses of Azure Data Factory include data processing and transformation through various computing services such as Azure Data Lake Analytics, Spark, and HDInsight Hadoop. In short, Azure Data Factory allows data extraction, transformation, and loading.
2. Explain What Integration Runtime Is
Integration runtime is one of the infrastructures provided by Azure Data Factory. It offers different integration capabilities across several network environments. There are three types of integration runtimes: self-hosted integration runtime, Azure integration runtime, and Azure SSIS integration runtime.
3. Define What Self-Hosted Integration And Azure SSIS Integration Runtimes Are
A self-hosted integration runtime uses the same code as the Azure Integration Run Time. However, it has to be installed in a virtual machine running on a virtual network to copy activities between private and public cloud data stores. The main reason why Azure uses self-hosted integration runtimes is that on-primitive data stores are found behind a firewall and cannot be directly accessed. Lastly, this type of runtime can dispatch transformation activities provided it happens in a private network. On the other hand, Azure SSIS integration runtime allows the execution of SSIS packages in managed environments needed to shift SSIS packages to the data factory.
4. Differentiate Azure Data Lake From Azure Data Warehouse
The data warehouse is a traditional data storage mechanism that a data lake can complement. However, there are four main differences between the two, which are:
- The data lake complements the data warehouse, while the data warehouse may be sourced to the data lake.
- Data lake uses USQL to process any data type, while the data warehouse uses SQL.
- The schemas on the data lake are unstructured and can therefore be defined in several ways, while the ones on the data warehouse exist in a structured form.
- A data lake contains detailed or raw data, which can be in any particular form, while the data in a data warehouse is summarized, refined, and filtered.
5. Walk Us Through How You Normally Create An ETL Process In Azure Data Factory
The ETL process allows the extraction, transformation, and storage of data in the Azure Data Factory. Six main steps are required to initiate the process to enable the extraction and processing of data from the Azure SQL server. They are:
- Creating a linked service for the SQL database, which is the source data store
- Assuming the existence of a dataset
- Creating a linked service for the Azure Data Lake store, the destination data store in this case.
- Creating a dataset for saving the extracted data
- Creating a reliable pipeline and adding copy activity
- Adding a trigger to schedule the pipeline
6. Mention The Top-Level Concepts Of Azure Data Factory
There are five top-level objects of the Azure Data Factory, namely:
- Linked services, which store important information for connecting to a data source
- Datasets, which are the data sources or data structures for holding data
- Pipeline, which is the carrier that supports several occurring processes or sets of activities.
- Activities, which are the pipeline processing steps. It’s possible to have several activities, generally referred to as processes.
7. What Are The Security Levels In ADLS Gen2?
There are two levels of security in ADLS Gen2, just like in ADLS Gen1. They include:
- Access Control Lists, popularly known as ACLS. They specify the objects a user can access, read, write, or execute and can be easily accessed by Unix and Linux users. It’s also important to note that ACLs are POSIX-compliant.
- Role-Based Access Control (RBAC) includes custom roles, owner, contributor, and reader roles. This type of security is assigned to specify those who can update the storage account settings and properties and allow using in-built data explorer tools, which generally need reader permissions.
8. Walk Us Through How One Can Access Data Using The 80 Dataset Types In Data Factory
There are two main ways of accessing data using the 80 dataset types:
- Using the Copy Activity- The copy activity is generally used to stage data from different connectors before executing a Data flow activity that transforms data once it’s been staged.
- Using Mapping Data Flow- The mapping data flow feature permits sourcing and sinking of delimited files from the following: Blob storage/ parquet files from Data Lake Storage Gen 2 and Azure Blob Storage/ Azure Data Lake Gen2
9. Mention How To Schedule A Pipeline And Pass Parameters To A Pipeline Run
The easiest way to schedule a pipeline is to use the time trigger window or scheduler trigger, which uses a wall-clock calendar schedule. The trigger can either schedule the pipelines in calendar-based recurrent patterns or periodically. To answer the second part of the question, parameters can be passed to a pipeline run, given that they are top-level and first-class data factory concepts. It’s possible to define parameters at the pipeline level and pass arguments when executing the pipeline run, either through the trigger or on demand.
10. Define A Blob Storage On Azure Data Factory
Blob storage is a service in Azure Data Factory that stores huge amounts of unstructured object data, mostly binary and text data. It is mainly used to publicly expose data or privately store application data. Some of the uses of blob storage include storing files for distributed access, storing backup data, restoring disaster recovery data, streaming video and audio, serving documents or images to a browser directly, and storing data meant for analysis by an Azure-hosted service or on-premises.
11. Differentiate Azure Data Lake Store From Blob Storage
There are six main differences between Azure Data Lake Storage and Azure Blob Storage, namely:
- Azure data lake storage works for big data analytics workloads, while Azure Blob Storage is a general-purpose object store that serves different functions, including big data analytics.
- Azure data lake storage has a hierarchical file system, while Azure Blob storage has a flat namespace.
- Azure data lake storage accounts have folders storing data as files, while blob storage accounts have containers storing data in blobs.
- The server-side API of Azure data lake storage is WebHDFS- compatible REST API, while that of Azure Blob Storage is Azure Blob Storage REST API
- Data lake storage use cases include interactive, batch, streaming analytics, and machine learning data, including IoT data, log files, large datasets, and click streams. At the same time, Azure Blob Storage caters to all types of text or binary data.
12. Differentiate Between Azure Integration Runtime And Azure SSIS Integration Run Time
Azure integration run time and Azure SSIS integration runtime are common integration runtimes. The former copies data between cloud data stores and can dispatch the activity to several compute services, including HDInsight, for transformation. On the other hand, Azure SSIS Integration Runtime natively executes SSIS packages in managed environments. It is normally used to lift and shift the SSIS package to a data factory.
13. Differentiate Between Hdinsight And Azure Data Lake Analytics
There are three main differences between HDInsight and Azure Data Lake Analytics, which are:
- While Azure Data Lake Analytics is a software-as-a-service service, HDInsight is a platform-as-a-service.
- While processing a data set with HDInsight, it’s imperative to use predefined nodes to figure out the cluster and then languages like a hive to process it. At the same time, Azure Data Lake Analytics requires passing data-processing queries, after which it creates the needed compute nodes depending on the given instruction.
- HD insight is more flexible than Azure Data Lake Analytics since the cluster is configured.
14. Explain What Mapping Data Flows And Wrangling Data Flows Are In Azure Data Factory
Mapping data flow allows the visual transformation of data without any code. All you have to do is concentrate on the transformation and logic and let Azure Data Factory handle the rest, converting the logic and transformations into a code running on scaled-out Azure data bricks clusters. On the other hand, wrangling data flows visually explores and prepares datasets through the power query online mashup editor. The developer only needs to focus on modeling and logic and let Azure Data Factory convert the M code to Spark.
15. Do You Know How To Create A Data Factory That Copies Data From One Folder To Another?
Three main steps are needed in creating data factories after creating data factory instances and a blob container. Go to the Azure portal and do the following:
- On the homepage of Azure Data Factory, start the copy data tool.
- Use Azure blob storage to create a new connection
- Pick the file from the directory and select the destination you would like to copy it to in your blob storage container, and you are done.
Remember to select the monitor on the complete deployment page for pipeline monitoring, as it is usually considered a good practice.
16. Mention The Triggers Supported By The Azure Data Factory
Azure data factory supports three main triggers, namely:
- Event-based trigger- This is a common Azure Data Factory trigger that runs data pipelines based on blob events, such as creating or deleting a bob file. They allow the scheduling of data pipeline execution in response to Azure blob storage responses.
- Scheduling trigger- The scheduling trigger runs data pipelines on a predefined schedule: minutes, days, weeks, months, or even hours. It also allows the specification of start and end dates, giving the triggers a running timeline.
- Tumbling window triggers- These types of triggers run data pipelines are predetermined time intervals. They work better than schedule triggers when it comes to copying historical data.
17. Differentiate Mapping Data Flows From Wrangling Data Flows
There are three main differences between the mapping and wrangling data flows, which are:
- File and table management- File management is built into mapping data flows instead of wrangling data flows. It’s also possible to create sink tables with mapping data flows.
- Transformations- Besides grouping, sorting, merging datasets, filtering rows and renaming columns, mapping data flows also handle upserts, updates, inserts, and deleting transformations.
- Schema drift- Mapping data flows can automatically handle schema drift for frequently changing sources, changes which have to be made manually with wrangling data flows.
18. Do You Know How To Copy Data From A Text File To An Azure SQL Database Table
Six steps are involved in copying data from an Azure Blob Storage to an Azure SQL Database Table. They are:
- Creating a data factory using simple C# code
- Creating an Azure storage-linked service by adding a specific C# code
- Creating an Azure SQL database linked service through the addition of a specific code
- Creating a source Azure blob database while defining a dataset representing the source data in Azure blob
- Creating a dataset for sink Azure SQL database
- Creating a pipeline with a copy activity, monitoring the pipeline and activity runs for successful completion.
19. Can You Programmatically Monitor Azure Data Factory Through SDK?
It’s possible to programmatically monitor a pipeline using REST API, Azure AZ PowerShell, Python, and . NET. This is because pipeline runs have different statuses, including succeeded, failed, canceled, canceling, In Progress, and queued. It’s also worth noting that Data Factory stores pipeline-run data for 45 days only.
20. Mention The Data Integration Capabilities Permitted By The Azure Data Factory Integration Runtime
The Azure data factory integration runtime allows four data integration capabilities which are:
- Dispatch activities- This allows the native execution of SSIS packages for transformation purposes.
- Data Flow- It is possible to execute a data flow in an Azure data flow compute environment, provided it is managed.
- Data movement- The integration runtime can move data between the source and destination data stores. It supports format conversion, built-in connectors, scalable data transfer, and column mapping.
- SSIS packages execution- It’s possible to execute SSIS packages in managed Azure compute environments natively. Monitoring and dispatching transformation activities are also supported by the integration runtime provided they run on SQL Server, machine learning, Azure HDInsight, and Machine Learning.
21. Mention The Different Azure Data Factory Components
Seven common Azure Data Factory components include:
- Trigger- The trigger specifies the time for executing the pipeline
- Mapping data Flow- The mapping data flow transforms the UI logic for data transformation
- Pipeline- The pipeline contains the processes, made up of activities
- Control Flow- Control flow dictates the pipeline activities execution flow
- Linked service- This dictates the connection string for pipeline activities data sources
- Activity- An activity dictates the execution step in a data factory pipeline
22. Mention The Cross-Platform Sdks Used Offered By Azure Data Factory V2
Azure data factory V2 offers four main SDKs to manage, write and monitor pipelines via select IDE. These cross-platform SDKs include PowerShell CSI, Python SDK, C# SDK, and documented REST APIs that can successfully interface with Azure Data Factory V2.
23. Mention The Different Types Of Compute Environment Supported By Azure Data Factory That Can Be Used For Activities Transform Execution
The two main types of compute environments supported by Azure data factory to be used for the execution of transform activities include:
- Self-created environments- This environment calls for the self-creation and management of the computer environment using Azure data factory.
- On-demand computer environment- This is fully managed to compute environment by Azure data factory. A cluster that executes the transforming activity is created and automatically removed after completing the activity.
24. Can You Define Azure Data Factory As An ETL Tool?
Yes. Azure data factory can be defined as an ETL tool. It is, in fact, one of the best tools for ETL processes. It doesn’t need complex algorithms to simplify the data migration process. It has a data integration process that adequately combines data from several sources into a consistent data store before loading data warehouses or target systems.
25. Mention The Steps Involved In The ETL Process
Even though the ETL process stands for extract, transform, and load, it depends on the following four steps:
- Connecting and collecting- This process moves data on the cloud source data stores and on-premises stores
- Transform- This second step allows users to collect data using various computing services, including Spark and HDInsight Hadoop.
- Publish- Publish loads data into Azure SQL Database, Azure Cosmos DB, and Azure Data Warehouse.
- Monitor- Monitor is a step that supports pipeline monitoring through Log Analytics, Azure Monitor, health panels on the Azure Portal, and API and PowerShell.
These are some of the most common questions in Azure Data Factory interviews. Take some time to go through them and brainstorm some of the best answers that can impress the interviewers. Additionally, work on your grooming or first impression if you attend a physical interview. We wish you all the best and hope you will ace your upcoming interview.