Tracking Data Lineage
Understanding the downstream impact of your data issues
Databand provides the necessary tools to perform an impact analysis on the alerts that are generated by your data pipelines. This is accomplished by providing lists as well as graphical representations of the pipelines and datasets that are potentially affected by your alerts. This information is available on the Alert Details screen within Databand.
Viewing Lineage in Databand
From the Alerts screen, click the View Details button for your alert.
From the Overview tab, you can review the datasets and pipelines that are potentially affected by your alert. Additionally, you can review any dataset operations that were missed as a result of your alert.
Failed Operations vs. Missing Operations
When logging dataset operations with Databand, our logging tools will capture any failures that occurred during the reading or writing of your datasets. In some cases, your pipeline may fail before it reaches certain dataset operations. Databand will distinguish between these two types of issues when reporting on your datasets.
If an issue occurs during the logging process that causes the operation to fail, Databand will report that as a failed operation. If a pipeline fails prior to a dataset operation, or the task that executes a dataset operation is skipped due to some other condition, that operation will be considered a missing operation in the current run.
During each run of a pipeline, the previous run of that pipeline will be evaluated to determine which dataset operations should be logged. Therefore, missing operations are always relative to the previous run of a pipeline.
The Lineage tab provides a graphical representation of how your alert may impact other pipelines, tasks, and datasets. The task or dataset that is the origin of your alert will be indicated by a red error icon. The names of all pipelines, tasks, and datasets that are potentially impacted will be written in red text. Additionally, if you click on any item within the graph, it will highlight any downstream items that are directly related to the selected item by drawing a red box around those related items.
How Does Databand Determine Lineage?
Databand infers lineage across your pipelines by identifying cases where a dataset is written as part of one task and then read by a subsequent task.
For example, say that you have a pipeline that is transforming data and then writing it as a file to S3. A second pipeline has a task that is copying that same file from S3 to a table in Redshift. In this scenario, the second pipeline is dependent on the first pipeline to complete its write operation so that the file can be read as part of the copy activity. As a result of this dependency, lineage is inferred and reported in Databand.
Pre-requisites for Lineage
Since lineage is derived from inferred relationships between datasets, this inherently means that you must be logging datasets as part of your pipeline monitoring. In the absence of logged datasets, Databand will have no way of determining these relationships. Please see Dataset Logging for more information on how you can achieve greater visibility into your data operations and enable lineage tracking across your pipelines.
Updated 9 months ago