Data Factory is a fully managed, highly available, highly scalable, and easy-to-use tool for creating integration solutions and implementing ETL phases. Data Factory helps create new pipelines in a drag and drop fashion using a user interface without writing any code; however, it still provides features to write code in your preferred language.
There are a few important concepts to learn about before using the Data Factory service, which we will be looking into in the following sections:
- Activities: Activities are individual tasks that enable the execution and processing of logic within a Data Factory pipeline. There are multiple types of activities. There are activities related to data movement, data transformation, and control activities. Each activity has a policy through which it can decide the retry mechanism and retry interval.
- Pipelines: Pipelines in Data Factory are composed of groups of activities and are responsible for bringing activities together. Pipelines are the workflows and orchestrators that enable the execution of the ETL phases. Pipelines allow the weaving together of activities and allow the declaration of dependencies between them. Using dependencies, it is possible to execute some tasks in parallel and other tasks in sequence.
- Datasets: Datasets are the sources and destinations of data. These could be Azure Storage accounts, Data Lake Storage, or a host of other sources.
- Linked services: These are services that contain the connection and connectivity information for datasets and are utilized by individual tasks for connecting to them. They are similar to connection strings, which you might have been using to connect to data stores with an additional responsibility of connecting to those data stores.
- Integration runtime: The main engine that is responsible for the execution of Data Factory is called the integration runtime. The integration runtime is available on the following three configurations:
- Azure: In this configuration, the data factory executes on compute resources provided by Azure.
- Self-hosted: The data factory in this configuration executes on bringing your own compute resources. This could be on-premises or cloud-based virtual machine servers.
- Azure SQL Server Integration Services (SSIS): This configuration allows for the execution of traditional SSIS packages written using SQL Server.
- Versions: Data Factory comes with two different versions. It is important to understand that all new developments will happen on V2, and that V1 will stay as is or fade out at some point. V2 is preferred because of the following reasons:
- It provides the capability to execute SQL Server integration packages.
- It has enhanced functionalities compared to V1.
- It comes with enhanced monitoring, which is missing in V1.