“90% of the world’s data was produced in the last two years. While data is rapidly expanding, our capacity to derive value from the data is not keeping pace,” a quote from data intelligence company Collibra describes the data deluge in Big Data. Data deluge refers to when the volume of data being generated is so immense that it is larger than an organisation’s capacity to manage and analyse it. The difficulty in managing and analysing the data results in the collection and storage of a lot of unused data. Therefore, data lineage has been suggested as the solution to data deluge. This article will introduce data lineage and its role in organisations.
What is data lineage?
Data lineage is basically the visual representation of the data’s lifecycle or journey from source to destination.
In other words, it demonstrates how data flows and transforms through its lifecycle across interactions with systems, applications, application programming interfaces (APIs) and reports.
The following 5W questions are useful for tracing the data’s journey:
- When was the data created or transformed?
- Where does the data come from?
- Why was the data created?
- Who is using the data?
- What does the data say?
Since data lineage visually demonstrates the data’s journey, lineage diagrams are usually used. There are two main types of lineage diagrams:
- Business lineage diagram shows how data flows from a data source to a report without the technical details.
- Technical lineage diagram helps data architects to view the technical details of transformations, to drill down into table, column and query-level lineage, and to navigate through data pipelines.
According to a senior data management professional Dr Irina Steenbeek, data lineage is recorded as a set of linked components.
She noted that, although there is no agreed list of data lineage components, there are a few essential components including:
- Data elements: It’s obvious that data elements make up the essential components of data lineage.
- Business processes: These are the business activities related to the analysis of data.
- IT systems: Depending on the business processes, the data flows and transforms through systems like the customer relationship management (CRM) system.
- Data control: These are related to the regulatory requirements that data lineage aims to comply with.
Importance of data lineage
The usual process starts with the data being gathered from a variety of sources and stored in a data lake or data warehouse.
Before the data is given to the user, it is transformed through Extract-Transform-Load (ETL) tools, spreadsheets or ad hoc queries to help the user make decisions.
Data errors can occur during the transformation process, leading to a lack of trust in the data’s quality.
So, mapping out the full context of the data through its lifecycle allows an organisation to assess the data’s quality before analysing it.
The main dimensions of data quality that benefit from data lineage are accuracy, reliability and completeness.
Metadata management can help with this by locating the data presented in the lineage.
Besides that, data lineage also saves time by allowing impact analysis to be done at a granular level and automatically instead of manual impact analysis.
(Note: Impact analysis is a detailed study of how critical products and services are delivered and examines the potential impact of a disruptive event over time.)
Another reason to employ data lineage is its ability to help with regulation compliance.
Mapping data traceability for regulations like General Data Protection Regulation (GDPR), Basel Committee on Banking Supervision’s standard number 239 (BCBS239) and California Consumer Privacy Act (CCPA) can take many hours.
Screwing up this task can result in fines and penalties! Luckily, data lineage can help prevent this by tracking how data flows and transforms through various systems.
A real-life application of data lineage in public transportation
Keolis, a multinational transportation company headquartered in Paris, manages a fleet of 23,000 buses and coaches, 240 km classic or automated subway lines, 660 km of tram lines and 5,800 km of railways on behalf of 300 municipalities in 16 countries.
Over 65% of Keolis users book their tickets, raise queries, and send all the other relevant inquiries using their smartphones, resulting in the receipt of more than 2.5 million messages per day over 70 systems.
Keolis wanted to collect all that data and process it through a single platform, so Keolis teamed up with Talend to develop a data lineage system to keep track of all the data and segregate it based on queries, preferences, etc.
As a result, Keolis is able to offer travelers new multi-device services (website, mobile, tablet, e-shop, and ticket dispensers).
With the Plan/Book/Ticket project, a single app brings together all of the daily services a traveler needs: finding the right route, buying a ticket and validating it.
Keolis has also been implementing a customer data lake to gain a 360 degree view of each customer using the information the customer fills in “My account”.
The customer data lake paves the path to the analysis of customers’ routes.
“In the past, there was no way to avoid going through point of sale to buy a ticket or reload a pass.
We had no record of the transactions, or the customer’s identity, and so we had no way of following our customer’s routes in detail with these manual processes.
With our data lake, we were able to level out this complexity and create interconnections between the ticketing, CRM, payment and other systems so as to give us an overview of customer routes and in return, propose services better tailored to users’ needs.”
Keolis’ Head of IT, Emmanuel Yon explained the significance of the customer data lake.
No wonder data lineage is a big deal for Big Data
The world has reached a point where data generation is overwhelming our capacity to handle it carefully and efficiently.
We are struggling to keep up with this vast ocean of data as we drown in it, making it ever more challenging yet crucial to get a full picture of the data’s lifecycle.
That’s why data lineage is said to be the solution to this big problem.
It visually represents what happens to the data from when it was created at its source to when it serves the specific purposes of its users at its final destination.
As we’ve seen above, using data lineage can help an organisation serve its customers much more efficiently through a single platform by making the data traceable.
Summary
What is data lineage?
Data lineage is a visual representation of the data’s lifecycle or journey as the data interacts with IT systems from its source to its destination.
What are the 5W questions for data lineage?
– When was the data created or transformed?
– Where does the data come from?
– Why was the data created?
– Who is using the data?
– What does the data say?
What are the 2 types of lineage diagrams?
Business and technical.
What are the main components of data lineage?
Data elements, business processes, IT systems and data control.
Why is data lineage important?
It improves data quality, saves time and enables regulatory compliance.