Product News: AI enables intelligent semantic search and accelerates the use of large-scale data

Learn more
Glossary

Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL) is the data integration process used to combine data from multiple sources into a single, centralized repository.

What is Extract, Transform, Load?

Extract, Transform, Load (ETL) is the three stage data integration process used to combine structured and unstructured data from multiple sources into a single, centralized repository, such as a data warehouse, data mart, or data lake. The three steps involved are:

  • Extract: Raw data is accessed and moved from its source system to a staging area
  • Transform: In the staging area business rules are applied to cleanse, transform and organize the data to ensure quality and consistency
  • Load: The data is then added to the repository

Carried out using software tools, the ETL process can be automated to take place on a regular basis, through batch updates or in real-time when data in a source system changes. These ETL pipelines (or data pipelines), bring together tools and activities to automate data integration, increasing efficiency and enabling them to be replicated, monitored and improved.

ETL vs ELT

ETL has been in use since the 1970s. More recently it has been joined by Extract, Load, and Transform (ELT). As the name suggests, this reverses the order of the second two steps, loading data into the target repository, where it is then transformed. This removes the need for a staging area, and adds flexibility to data analytics. This means it works well for high volume, unstructured big data, such as that stored in data lakes.

However, it requires central repositories both to have data mapping capabilities and access to sufficient processing power to successfully carry out the transformation step efficiently and quickly. This makes it best deployed in the cloud, which provides access to the scalable, on-demand computing power required to undertake transformations.

Why is ETL important?

Structured and unstructured data is now produced by multiple systems inside and outside the organization, in a large number of different formats. To create value this data needs to be made available widely, especially to underpin business intelligence and enable better, data-driven decision making. This requires it to be centralized and standardized to ensure accessibility, quality and governance. Extract, Transform, Load processes enable this, making them central to:

  • Effective, informed data analysis and reporting through a consolidated view of information
  • Ensuring data quality and governance by applying consistent standards
  • Delivering a single version of the truth for everyone
  • Providing historical context for business intelligence
  • Increasing efficiency by replacing manual processes

What are the steps in Extract, Transform, Load (ETL)?

Extract

In the data extraction phase, raw data is copied or exported from source locations to a staging area. Data can be structured or unstructured and can come from sources such as relational databases, data storage platforms, cloud storage, or business applications (such as CRM or ERP systems).

Transform

This is the most important part of the ETL process, as it ensures that the raw, extracted data is checked and transformed in the staging area so that it meets corporate guidelines around standards, quality and accessibility.

Data transformation normally includes:

  • Cleansing — removing/resolving inconsistencies and missing values
  • Standardization — applying standard governance rules to the dataset
  • Deduplication — discarding redundant or duplicate data
  • Verification — removing unusable data and flagging anomalies
  • Encrypting/protecting data – such as by anonymizing sensitive data
  • Sorting — organizing data by type
  • Improving – enriching datasets with additional data (such as reference data) to increase its value

Load

Once the data is transformed, it is then loaded into a centralized repository, such as a data lake or data warehouse. All data can be loaded at once (full load) or at scheduled intervals/when a record changes (incremental load). Typically, this final stage takes place outside peak business hours when traffic on the source systems and the data warehouse is at its lowest.

 

Ebook - Data Portal: the essential solution to maximize impact for data leaders

 

Learn more
The importance of data governance to municipal data portal success Open data & transparency
The importance of data governance to municipal data portal success

What are the challenges municipalities face when it comes to effective data governance. We look at the importance of data portals and robust data governance programs to enable municipalities to securely share reliable, compliant data internally and externally.

New Opendatasoft research finds data democratization still in its infancy Data Trends
New Opendatasoft research finds data democratization still in its infancy

How are organizations embracing greater data sharing and reuse? The latest Opendatasoft/Odoxa Data Democratization Study highlights that while organizations are becoming more mature in their use of data, there’s still a way to go to enable data-centricity.

What is cloud-based data governance and why is it crucial for companies? Data Trends
What is cloud-based data governance and why is it crucial for companies?

Data governance is critical to ensuring that data is reliable, trustworthy and accessible by the right users, enabling organizations to become truly data-centric. Ensuring that cloud-based data is well-governed brings new challenges around control, security and compliance - this blog explains how to overcome them.

The importance of data governance to municipal data portal success Open data & transparency
The importance of data governance to municipal data portal success

What are the challenges municipalities face when it comes to effective data governance. We look at the importance of data portals and robust data governance programs to enable municipalities to securely share reliable, compliant data internally and externally.

New Opendatasoft research finds data democratization still in its infancy Data Trends
New Opendatasoft research finds data democratization still in its infancy

How are organizations embracing greater data sharing and reuse? The latest Opendatasoft/Odoxa Data Democratization Study highlights that while organizations are becoming more mature in their use of data, there’s still a way to go to enable data-centricity.

What is cloud-based data governance and why is it crucial for companies? Data Trends
What is cloud-based data governance and why is it crucial for companies?

Data governance is critical to ensuring that data is reliable, trustworthy and accessible by the right users, enabling organizations to become truly data-centric. Ensuring that cloud-based data is well-governed brings new challenges around control, security and compliance - this blog explains how to overcome them.

Start creating the best data experiences
Request a demo