[WEBINAR] Product Talk: Using AI to enhance the data marketplace search experience

Save your place
Glossary

Data lineage

Data lineage (or data traceability) provides full visibility of the data lifecycle inside and outside the organization, including any changes made.

What is data lineage?

As organizations become increasingly data-driven, they have to trust the data that they are working with. Data lineage (also known as data traceability) aims to build this trust by ensuring that there is a full picture of where particular data has come from, how it has been changed, processed, or enriched, where it has been used, who has used it, and where it will go in the future.

Companies need to be able to trace data upstream and downstream back to its original source to ensure quality, good governance and regulatory compliance, all the way to the end of its lifecycle. This helps them see how data is being reused, both inside and outside the organization.

 

Data lineage covers the full data lifecycle:

  • The origins of the data, and whether it is internal or external
  • The level of sensitivity of the data (such as if it contains personal customer information)
  • The systems it has flowed through
  • Any changes that have been made, including enrichment and standardization to meet governance requirements
  • Who it is shared with (internally and externally) and how this is used (such as for business intelligence, and within operational systems)

Data lineage solutions provide a visual representation of the data lifecycle, enabling data administrators to drill down into how it has been created and then transformed/moved and used throughout the organization and wider external ecosystem.

What is the difference between data lineage and data traceability?

The terms data lineage and data traceability are often used interchangeably as there is no real difference between them. They both describe the same process of understanding the data lifecycle and providing full visibility across it.

A third term – data provenance – refers to the origin of the data, i.e. how and where it was created.

Data lineage/data traceability can be broken down into two areas:

  • Business lineage: Looking at how data has been changed from a business perspective. It provides a simplified view of where data comes from, the policies/processes/standards that were applied to it and how it has been used. This gives business users trust in the data when using it in, for example, decision making.
  • Technical lineage: A more in-depth view of how data moves and transforms between systems, tables and columns, that is normally only understandable by technical/IT users. It covers areas such as the applications data flows through, technical transformations, look ups and staging tables. While too complex for business users it is vital to ensuring technical data quality and debugging errors in the data sharing process.

Why is data lineage important?

Data lineage is vital to delivering confidence in the data that is used to power a business. Strong data lineage allows organizations to:

  • Have trust that the data being used for business operations is accurate and high quality, so that any decisions based on it will therefore be valid. As companies increasingly introduce advanced analytics and AI that automate decision-making, traceability becomes even more critical.
  • Ensure data governance by tracking and monitoring how data is used (and by whom).
  • Support compliance by being able to prove that data meets both organizational policies and external privacy regulations, such as GDPR. This makes data lineage a key part of risk management when it comes to data.
  • Securely protect data by understanding the systems it flows through and who has access to it.
  • Enable debugging by highlighting errors that potentially impact data use and flow.
  • Manage technical migrations, such as to the cloud, by modeling data flows and the impact of any technology/system changes on downstream solutions.

What are the challenges to data lineage?

Organizations generate enormous amounts of data, and increasingly add to this with information from partners and their wider ecosystems. This brings five key challenges to data lineage:

  1. Volume and range: the number of different data sources continues to grow as organizations digitize and more and more data-producing devices (such as IoT sensors) are added to their infrastructure. This means that the amount of data an organization has to manage is growing exponentially and all need to be fully traceable across their life cycles.
  2. Speed: data now moves at a much greater velocity within organizations. Whereas in the past weekly or monthly reporting was standard, users now need access to trusted data on a real-time basis.
  3. Compliance: regulators (and consumers) are increasingly focused on ensuring that information, particularly personal data, is used and protected in ways that meet legislation such as the CCPA and GDPR. This adds a further level of importance to traceability to provide an audit trail to regulators as required.
  4. Complexity: All of these factors mean that organizations have a much more complex data environment to manage, again making traceability key.
  5. Collaboration: monitoring data across the organization and more importantly with external partners requires open collaboration between departments and organizations to break down silos.

Want to learn more about our data democratization platform? Contact one of our experts!

Learn more
How to break down organizational silos to engage everyone in your data project Data access
How to break down organizational silos to engage everyone in your data project

Organizational silos prevent data sharing and collaboration, increasing risk and reducing efficiency and innovation. How can companies remove them and ensure that data flows seamlessly around the organization so that it can be used by every employee?

What is the difference between a data product and a data asset? Data Trends
What is the difference between a data product and a data asset?

Data products and data assets both aim to make data usable and valuable. What are the differences between the two and how do you incorporate them into your data strategy?

The central role of data in delivering the Paris 2024 Olympic and Paralympic Games Company news
The central role of data in delivering the Paris 2024 Olympic and Paralympic Games

As we get closer to the start of the world's biggest sporting event, we look at the role of data in preparing for the Paris 2024 Olympic and Paralympic Games, which start on July 26th 2024.

Start creating the best data experiences