I have been fascinated by the number of data catalog that are available both new,startup disrupters and the very much ingrained historical platforms that initially brought the idea up in relation to what data catalogues are!
What is a data catalogue?
Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.
Now, having defined what a data catalogue is it also good to define some of the terminology that's included in a data catalog description.
What is metadata?
You cannot go through talking about a data catalog without talking about metadata.
Fundamentally, the definition of metadata is data that provides information about other data. In other words, it’s “data about data” It consists of labels or markers that describe information, making it easier to find, understand, organize, and use. Metadata can be employed with a wide range of data formats, encompassing documents, images, videos, databases, and beyond.
When I began my journey in this data and Ai space I was amazed by the number of descriptions that kept being referenced and referred to that I thought I knew but actually had to go back and look up to get a fuller understanding and some of them I mention in this blog.
Below are functions of a data catalog best;
1. Manage diverse data assets
For instance, a data catalog could store and interlink an SQL query used for customer segmentation, the Python script used to process the query’s output, and a Tableau dashboard that displays the processed data.
2. End-to-end data visibility
A data catalog could provide information about a data asset from its inception to its current state.
This includes its origin, transformations it underwent, who accessed it, and how it was used in various analyses and reports, all accessible through a single interface.
3. Handle large scale metadata
For example, a data catalog could analyze the metadata from thousands of SQL queries executed in a month.
This helps determining the most frequently accessed tables, commonly joined columns, and deducing the potential data owners or subject-matter experts based on query patterns.
4. Facilitate embedded collaboration
When a data analyst finds an issue with a certain data asset in the catalog, they could directly create a ticket in the organization’s JIRA system from within the catalog, attaching relevant metadata and context, thereby speeding up issue resolution.
5. Contextualize data
A data catalog could provide a complete context of a sales data table - explaining what each column represents, how the sales figures were calculated, any adjustments made, and perhaps a glossary of sales-related terms.
6. Data discovery
Users can leverage the catalog’s intelligent search feature to find specific datasets.
For instance, typing “customer” might suggest “customer demographics”, “customer transaction history” or “customer behavior analysis” based on the metadata and previous search patterns.
7. Data governance
The catalog could enforce a policy that all data assets must have a designated owner and documented business definitions.
It could alert when assets do not comply, promoting adherence to data governance standards.
8. Data security and privacy
The catalog could maintain information about who has access to sensitive data, such as personally identifiable information (PII)
It can even alert when such data is accessed without proper authorization.
9. Promote data literacy
An analyst who is unfamiliar with a particular area, say “logistics data”, could use the data catalog to learn about the datasets available, their relevance, context, and appropriate use cases.
10. Efficient data utilization
A data analyst looking to understand customer behavior might discover through the catalog that a colleague in another department has already created a customer segmentation model.
They could then leverage this existing model, saving time and effort.
With all these definitions and data uses gives more reasons for data catalog adoption in industry.
Traditional Data Catalogs
These are defined by their on-premise and legacy nature. They’re optimized for on-premise deployments and designed to be run by IT professionals. As such, they are challenging for business users, and they are limited in their capabilities for cloud-based storage. Traditional data catalogs are ideal for large companies with a robust IT team that are operating on-premise infrastructure. Because IT resources are available, challenges with business users are easier to mitigate.
“The metadata and semantic data integration visionaries of 20 or 30 years ago — who toiled in frustration due to the limitations of the data management systems of their era — would be excited by the potential of today’s technologies.” - David Stodder,TDWI
Open Source Catalogs
These are built by engineers, for engineers, and require significant time and resources to develop into a functioning data catalog for your organization, but they are free of licensing costs, come with documentation, and allow you to build on work others have done.
Modern Data Catalogs
These are flexible, comprehensive catalogs that are designed for business users working in a cloud-native environment. They are ideal for distributed, digital- and remote-first businesses that wish to make it easy for non-technical business team members to use the data effectively. Their extensible nature means they can connect all types of data assets in a single source of truth, intelligently leverage metadata as a form of big data, and integrate seamlessly into data consumers’ current workflows. Modular and scalable pricing models are characteristic of modern catalogs, but they are not optimized for on-premise data sources.
There is a presentation that I've done that explains what a data catalogue is and what to look for when deciding or going out to market for the next catalogue and how it would benefit your business.
Get in touch and we will gladly give you access to this video link.
Definitions of types of data catalogs was referenced from Atlan link below, rather than me trying to explain these terms again!