What is a data catalog - practical guide

Summary

1.What is data catalog?

Defining a data catalog

A data catalog, or data asset catalog, provides an inventory of all of an organization’s data, including where it is stored, and the formats that it is in. It identifies, organizes and describes these data assets through comprehensive metadata. This provides organizations with a complete view of their entire data estate.

A data catalog provides technical and data experts with the ability to identify, in real-time, all of the data an organization possesses. This helps them find and focus on the data that matters, thus driving its greater reuse and improving its management. Data cataloging is therefore a critical part of data management.

Just like an index in a book or a catalog in a library, it allows a reader to find where a specific term is mentioned or where a book is located, by searching or browsing. It is important to understand that a traditional data catalog does not store or give direct data access, just tells a user where it can be found, through search, filters and themes.

Data catalogs were originally deployed as part of data governance and compliance initiatives, enabling organizations to see exactly what was in their data estate and ensure that it was well-managed, stored securely and avoided duplication. Once this inventory was in place many organizations have extended its functionality to try and use it to help connect users with data.

2.Why should you use a data catalog?

Today, every organization generates and collects growing volumes of data, from a wide range of sources, spanning enterprise applications such as ERP and CRM, websites, ad platforms, Internet of Things (IoT) sensors, data warehouses and partner/customer data. This is normally distributed across the business, with data often generated within specific departments or business units. Data can be in the form of raw datasets or as data assets including data products or data visualizations. All of this means there is no single, cohesive picture of the entire data landscape.

This brings three major challenges:

Compliance. It is impossible to ensure that data is being stored and protected correctly in line with regulations such as the GDPR or CCPA if the organization does not know where it is located or who can access it. For example, data on customers may be stored on a standalone server in a branch office, without security or access management in place if the central IT team is not aware of it.
Duplication. Similar or even the same data may be stored in different locations within the business. This takes up storage space unnecessarily, reducing efficiency as well as making data more difficult to protect. More importantly, it means there is no single, agreed version of the truth – different teams may be relying on different versions of a data asset, leading to confusion and inconsistency. Users simply won’t trust data, and therefore are unlikely to rely on it.
Access. Data is stored within silos inside expert tools and applications and is not available or even known about to other business users. It cannot be easily found or used, holding back consumption and data democratization. If users don’t understand that a specific data asset exists, they cannot use it.

The data catalog is designed to overcome these challenges, by providing a centralized inventory of all data. Users can browse or search in order to find relevant information, as well as examining the metadata of specific datasets and other data assets. They can see who owns the data and where it is stored. Data catalogs aim to make data management seamless and efficient, by providing technical teams with a complete, up-to-date list of all of their data.

However, data catalogs do not provide direct access to the underlying data, meaning that once a user has located data, they must apply to the specific data owner or producer to request access. This means that data cataloging has to be the first step in data sharing and maximizing data value. On its own it does not drive consumption at scale by the business.

3.What features should a data catalog have?

To ensure that it meets the needs of data teams, an effective data catalog has to possess five key capabilities:

It must be comprehensive, covering all of an organization’s data. An incomplete data catalog that misses specific data assets does not help data teams
It must be trustworthy. As well as covering all data, it has to provide sufficient details on the data it contains (such as where it was generated, who owns it and the frequency of updates) in order that users are confident about what a data asset contains and its quality
It must be continually updated to reflect and include the latest data assets within the business
It must be easily searchable so that relevant data can be found quickly. Primarily this is achieved by describing every data asset using accurate metadata that follows agreed standards
It must be accessible by both humans and AI. AI data catalogs help organizations to find and select relevant data for training AI agents and models to maximize accuracy and value.

To deliver these capabilities, data catalogs should include these features:

Comprehensive metadata

This “data about data” needs to describe what a dataset contains, in order to simplify the understanding and organization of information. It is vital that this metadata is comprehensive and complete to provide full background on a dataset and make it easier to find. Technical, business, and operational metadata should include information on its format, origin, date of creation, owner, and any transformations that have been applied. It should follow an agreed metadata schema, whether developed internally or externally (such as Dublin Core) to ensure consistency both within the organization and with partners outside the business. The data catalog should apply quality checks to metadata to ensure that it is complete, correctly tagged and accurate.

Powerful search functionality

Given the large number of data assets inventoried in the data catalog, it has to be easy for users to find those that they are looking for. The data catalog must make it easy to search by keywords, themes and business terms or to apply filters (such as data owner, format or time of update) to narrow down their search. There should also be the opportunity to browse data, especially by theme, in order to show all data around a specific subject or area.

4.Who uses a data catalog?

Traditional data catalogs are technical tools, used by experts, such as data stewards, data governance teams and data engineers. They require training and knowledge of data terms in order to find and manage data within the data catalog, and often have a technical, basic user interface that is neither intuitive or easy to use. This makes navigation hard for non-specialists, such as business users, preventing them from using the data catalog confidently.

Essentially, they are simply too complex to be used by business teams, hampering wider adoption. This is particularly true as the data catalog doesn’t provide direct access to data, meaning that business users struggle to benefit from using them.

5.What are the benefits of a data catalog?

By providing a complete view of all the organization’s data, a data catalog makes it easier to understand and manage your entire data estate. This understanding of the data that an organization owns is the first step to being able to use it effectively. A comprehensive data catalog therefore benefits organizations in multiple ways:

It supports robust governance through a central catalog of all data, increasing control and security and reducing risk,
It enables regulatory compliance by providing a full record of your data landscape
It helps identify data duplication, increasing operational efficiencies
It breaks down barriers and silos between teams and departments, enabling data to be transparently shared across the business
It creates a single, agreed language for data and how it is described, driving greater consistency
It makes data more accessible across the organization, starting the process of data democratization
It builds confidence and trust in data by making it easier to discover and understand
It saves time for employees, especially data expert, by simplifying access to data
It standardizes data, ensuring it is displayed using consistent terms and formats to make it easy to understand by all
It supports better decision-making by humans and AI by providing information on relevant data from across the business
It gives greater insight into data flows and how data is used within the business through data lineage
It can automate data discovery, profiling and the addition of metadata, saving time and ensuring quality through data catalog tools

Data and business glossaries

Different departments and technical solutions describe data and its attributes in different ways. These definitions and descriptions need to be standardized across the organization to avoid confusion when searching for and accessing data. At a technical level a data glossary sets out the key terms that are used to describe data assets so that they are consistent – for example, defining what “real-time” data actually means.

On the business side, terms and concepts must also be agreed and standardized to ensure consistency and understanding. This business glossary is made up of definitions of the main business terms used to describe data by different teams within an organization, and acts as a centralized source of knowledge around definitions. For example, without it different departments might refer to a “customer” in different ways – for some it could be the overall client organization, while for others it might be the specific contact or department.

Data glossaries must be created in conjunction with all data owners from across the business to gain their buy-in and agreement as part of creating the data catalog. Likewise, the business glossary should be co-developed with business users, defining the business terms and concepts used in the organization. Business glossaries have a wider usage than the data catalog and help to break down silos between departments and build a consistent picture of operations across the organization.

Data dictionary

The data catalog’s data dictionary provides detailed technical information about data assets. Unlike a data glossary, which defines terms used to describe data, the data dictionary focuses on how technical elements are described and documented. These could include listing data elements (names, definitions, purpose) and their properties, reference data, the data source, where it is stored and how data elements are related to each other.

Data lineage

Data is not static – it flows through the organization, undergoes transformations and can be enriched or combined with other data assets during its lifecycle. That means that as well as showing the origin of data, the data catalog has to be able to map its journey and how it has changed or moved over time. Data lineage tools within data catalogs therefore provide a clear visualization of the lifecycle of data assets through accurate mapping. This can be used to identify and reduce data duplication and also to track the relationship between data assets. For example, if a specific data asset is used within a business application, it is vital to identify this in case the data asset becomes unavailable, is retired or its attributes change. This helps trace errors more quickly and keep users informed about availability and usage.

Validation and automation tools

The data catalog must be kept continually updated when new data assets are created within the business. When this new data is added to the data catalog, or existing catalog entries are updated, it is vital that they follow the standards set out in data catalog tools such as the data dictionary and data glossary. However, very often, data formats and sources are heterogeneous, coming from different business applications, databases and storage solutions. The data catalog must harmonize data to make it usable, based on understanding and documenting the content, structure and quality.

Much of this can be automated – data catalog tools can search for and discover new data assets, and then automatically profile and document its exact content, structure, and quality as it is onboarded. As part of the process, the data catalog should generate and harvest relevant metadata, reducing the burden on administrators, particularly when it comes to first creating the data catalog.

Connectors to add all data

Data from different sources follows its own formats and specifications. Built-in connectors within a data catalog make the discovery and addition of new data assets simpler by automatically linking to multiple sources across the organization, such as databases, internal files, external sources, and Internet of Things (IoT) sensors, collecting their metadata in real-time. They contribute to the creation of a comprehensive and centralized data catalog repository of the entire data estate.

Data consumption

While data catalogs began as solutions to simply inventory data, their role has evolved to support greater data consumption across the business. This means that catalog entries must provide more details to help build trust in data from end-users, particularly within the business and make it as easy as possible for them to then download data to encourage reuse. Connecting data catalogs to tools such as data product marketplaces help make consumption seamless by enabling users to visualize or download data, such as via APIs or in common file formats. As part of this data product marketplaces support data governance by controlling and managing access, especially to sensitive data.

6.What are the limitations of a data catalog?

As organizations look to increase the sharing and consumption of data with non-technical business users, they reach the limits of what can be achieved with a traditional data catalog.

This is because a data catalog has been designed for a specific purpose and audience – creating an inventory of an organization’s data by a group of technical users. As a technical tool the data catalog improves data management, compliance and governance, but has not been created to aid data sharing. For example:

It provides a list of available data, but does not include the data itself – users have to follow up with data owners to actually access and consume data
It does not provide a seamless, intuitive interface for non-technical users, making it difficult for those in the business to use it confidently
It does not allow business users to operate independently to seamlessly use data – the data catalog’s technical design mean they are likely to rely on data experts for support
It just describes data via metadata, rather than necessarily deploying the terms that business users actually understand

Essentially a data catalog alone is not enough to support data consumption at scale – as a technical tool it simply inventories data rather than making it easily available to non-technical users. This means it does not deliver consumption, just compliance.

7.Why should you combine a data catalog with a data product marketplace?

Data product marketplaces have many similarities with data catalogs. They provide a single, centralized space to share data, ensuring consistency, comprehensiveness, and a single version of the truth. All types of data asset can be searched for and discovered through a data product marketplace, just as in a data catalog.

What is different is that data product marketplaces are focused on increasing the operational use of data by the business via self-service. To do this they essentially provide an e-commerce style storefront to data catalogs in order to make it easy and seamless for all users to both discover and consume data through a single tool, without requiring technical skills or support. They integrate capabilities such as AI-powered semantic search, personalized recommendations and the ability for data consumers to directly collaborate with data producers, as well as including full documentation on the data itself. This increases confidence and user independence. Granular access management capabilities control access to data through role and request-based permissions, supporting data governance objectives.

Here is a full comparison of the differences and similarities between a data catalog and the data product marketplace:

Data catalog vs data product marketplace

All of this means that before launching a data product marketplace and making data available to all, organizations need to create a complete catalog of all of their data assets. Consistent data cataloging requires assets to be thoroughly documented, including definitions, through a business glossary, ensuring consistency across the organization. An effective governance strategy should either combine both tools, or choose the best to meet the organization’s specific needs.

This means that organizations have two options for their data catalog when implementing a data product marketplace:

Use the data product marketplace as a data catalog

Data product marketplaces include and integrate the essential features of data catalog solutions, such as a business glossary, connectivity to data assets, metadata management and data lineage. If an organization has not yet deployed a data catalog, the capabilities within a data product marketplace will enable them to inventory and catalog data to underpin greater data consumption alongside compliance.

Integrate an existing data catalog with the data product marketplace

For those organizations that have invested in deploying a data catalog solution and are happy with it, the easiest option is to simply integrate it with the data product marketplace. Technical users are still able to use the data catalog, whereas business users benefit from the intuitive, self-service consumption experience that the data product marketplace provides.

This approach also has the benefit of increasing ROI from the existing data catalog implementation itself. Data catalogs can be time-consuming and expensive to deploy, with their impact not fully seen by the business. Complementing them with a data marketplace transforms the value they bring by providing access for all to the data they have inventoried. Data marketplaces put information directly in the hands of business users who can then use it to create tangible value for the organization.

8.Data catalog: Creating a comprehensive view of all your data

Data catalogs are essential parts of the data management stack. They provide a complete and comprehensive view of all of an organization’s data, wherever it is stored and however it has been generated. By applying descriptive metadata they ensure that this data can be understood and categorized, ensuring compliance and enabling technical teams to increase efficiency, reduce duplication and begin the process of data democratization. Combining a data catalog with a data product marketplace then accelerates data value by enabling data consumption at scale through an intuitive self-service experience for all users.

Data product marketplaces demystified

A practical guide for data leaders to generate data value for business users

Download the ebook

What is a data catalog? - Practical guide