[deprecated] Data Tracking Templates

To easily integrate data monitoring and observability, you can use our data monitoring templates. The templates are built as Airflow DAGs so they can be easily run by your Airflow, but can also be adjusted to run from any scheduler or CRON system.

All templates can be modified to better suite the needs of your pipelines or data. The templates are designed to be lightweight on your database resources (i.e. compute time, query time, credit usage, etc.) and non-intrusive in your Airflow environments.

To use the templates, copy the desired template into your Airflow DAGs directory, export the necessary environment variables, and start using the monitor.

Example Templates

  • Redshift Database Monitoring
  • Redshift Table Monitoring

AWS Redshift

  • Redshift Database Monitoring

    • Template code

    • Template includes the following metrics:

      • Number of tables in cluster
      • Minimum number of rows in cluster tables
      • Maximum number of rows in cluster tables
      • Mean number of rows in cluster tables
      • Median number of rows in cluster tables
      • Shape of all tables in tables (columns, rows)
      • Largest table by row count
      • Largest table by column count
      • Disk usage of cluster (Capacity, Free, and Used in GB)
      • Percent Disk usage
    • To use this template DAG, specify the following environment variables:

      export REDSHIFT_CONNECTION_ID="<your Redshift connection ID>"
      export REDSHIFT_CLUSTER_NAME="<your Redshift Cluster name>"
      export REDSHIFT_TABLE_MONITOR_SCHEDULE="<cron format schedule to run monitor>"
      
      # optional variable, default will be public schema: 
      export REDSHIFT_SCHEMA="<target Redshift schema to monitor>"
      
  • Redshift Table Monitoring

    • Template code

    • Template includes the following metrics:

      • Record count of the target table
      • Null/NaN record count for each column the target table
      • Duplicate Record count (all columns match)
      • Minimum of numeric columns in the target table
      • Maximum of numeric columns in the target table
      • Mean of numeric columns in the target table
      • Median of numeric columns in the target table
    • To use this template DAG, specify the following environment variables:

      export REDSHIFT_CONNECTION_ID="<your Redshift connection ID>"
      export REDSHIFT_MONITOR_TARGET_TABLE="<target Redshift table name>"
      export REDSHIFT_TABLE_MONITOR_SCHEDULE="<cron format schedule to run monitor>"
      
      # optional variable, default will be last 1000 rows
      export REDSHIFT_MONITOR_TABLE_LIMIT="<number of rows to monitor>"
      

What’s Next
Did this page help you?