Data Reliability Engineering for ETL System


A $100B NYSE listed retail company required a data reliability engineering team to help them stabilize the ETL system by reducing issues and getting complete control of end-to-end data flow in the system to improve time-to-repair and timely arrival of quality data.

Client Challenges and Requirements

  • System had multiple issues related to incorrect SLAs, bad data, long haul to detect issues and multiple instances of major incidents
  • Stage environment was difficult to monitor due to out of sync code issues making it unstable and not an environment to carry out integration testing
  • Inefficient collaboration between operations and development for the betterment of production stability

Bitwise Solution

Data reliability team followed below practices to bring in the change.


  • Weekly problem management of issues occurring in the system and initiate change management for permanent fixes.
  • Collaborate with source team and development SME to check if the SLAs can be revisited for recurring failures and in case of bad data from source then if it can be handled in the source else implement data quality measures post file is received from source.
  • Document major incident issues and track lineage to reduce the time-to-detect and time-to-repair for the future major incident issues.
  • Reduced vulnerabilities on stage by implementing gate keeping measures and continuous monitoring to make sure it runs in sync with production.
  • Implemented data observability for critical business areas to proactively capture in case of any data misses.

Tools & Technologies We Used

Informatica PowerCenter & IDMC

Key Results

Incidents reduced by > 60%

Stage in sync with Prod helps to carry out integration test

Quick response to major incidents and reducing blast radius

Incident Time-to-Detect and Time-to-Repair reduced by 40%

Download Case Study

    To get our latest updates subscribe to our Newsletter.

    Ready to start a conversation?