Forum

Please or Register to create posts and topics.

Ingest Data with Dataflows Gen2 in Microsoft Fabric

Microsoft Fabric offers a unified solution for data engineering, integration, and analytics. A crucial step in end-to-end analytics is data ingestion. Dataflows (Gen2) are used to ingest and transform data from multiple sources, and then land the cleansed data to another destination. They can be incorporated into data pipelines for additional data movement, and used as a data source in Power BI.

Imagine you work for a retail company with stores across the globe. As a data engineer, you need to prepare and transform data from various sources into a format that is suitable for data analysis and reporting. The business requests a data model that consolidates disparate data sources from the different stores. Dataflows (Gen2) allow you to prepare the data to ensure consistency, and then stage the data in the preferred destination. They also enable reuse and make it easy to update the data. Without a dataflow, you’d have to manually extract and transform the data from every source, which is time-consuming and prone to errors.

In this module, we explain how to use Dataflows (Gen2) in Microsoft Fabric to meet your data ingestion needs.

Data Folow in Microsoft Fabric

Understand Dataflows (Gen2) in Microsoft Fabric

 

In our scenario, you need to develop a data model that can standardize the data and provide access to the business. By using Dataflows (Gen2), you can connect to the various data sources, and then prep and transform the data. To allow access, you can land the data directly into your Lakehouse or use a data pipeline for other destinations.

What is a dataflow?

Dataflows are a type of cloud-based ETL (Extract, Transform, Load) tool for building and executing scalable data transformation processes.

Dataflows (Gen2) allow you to extract data from various sources, transform it using a wide range of transformation operations, and load it into a destination. Using Power Query Online also allows for a visual interface to perform these tasks.

Fundamentally, a dataflow includes all of the transformations to reduce data prep time and then can be loaded into a new table, included in a Data Pipeline, or used as a data source by data analysts.

How to use Dataflows (Gen2)

Traditionally, data engineers spend significant time extracting, transforming, and loading data into a consumable format for downstream analytics. The goal of Dataflows (Gen2) is to provide an easy, reusable way to perform ETL tasks using Power Query Online.

If you only choose to use a Data Pipeline, you copy data, then use your preferred coding language to extract, transform, and load the data. Alternatively, you can create a Dataflow (Gen2) first to extract and transform the data. You can also load the data into a Lakehouse, and other destinations. Now the business can easily consume the curated data model.

Adding a data destination to your dataflow is optional, and the dataflow preserves all transformation steps. To perform other tasks or load data to a different destination after transformation, create a Data Pipeline and add the Dataflow (Gen2) activity to your orchestration.

Another option might be to use a Data Pipeline and Dataflow (Gen2) for ELT (Extract, Load, Transform) process. For this order, you’d use a Pipeline to extract and load the data into your preferred destination, such as the Lakehouse. Then you’d create a Dataflow (Gen2) to connect to Lakehouse data to cleanse and transform data. In this case, you’d offer the Dataflow as a curated dataset for data analysts to develop reports.

Dataflows can be horizontally partitioned as well. Once you create a global dataflow, data analysts can use dataflows to create specialized datasets for specific needs.

Dataflows allow you to promote reusable ETL logic that prevents the need to create more connections to your data source. Dataflows offer a wide variety of transformations, and can be run manually, on a refresh schedule, or as part of a Data Pipeline orchestration.

Benefits and limitations

There’s more than one way to ETL or ELT data in Microsoft Fabric. Consider the benefits and limitations for using Dataflows (Gen2).

Benefits:

  • Extend data with consistent data, such as a standard date dimension table.
  • Allow self-service users access to a subset of data warehouse separately.
  • Optimize performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources.
  • Simplify data source complexity by only exposing dataflows to larger analyst groups.
  • Ensure consistency and quality of data by enabling users to clean and transform data before loading it to a destination.
  • Simplify data integration by providing a low-code interface that ingests data from various sources.

Limitations:

  • Not a replacement for a data warehouse.
  • Row-level security isn’t supported.
  • Fabric capacity workspace is required.