About the Role:
Diamond Trust Bank is seeking a Senior Site Reliability Engineer to ensure 24/7 reliability, security, and regulatory compliance of our digital banking services across a multi cloud environment.
This role reports directly to the Head of Cloud Engineering and collaborates closely with Cloud Engineering, DevSecOps, Architecture, ENOC, Cybersecurity, Product Delivery, and Engineering teams.
The ideal candidate is a seasoned reliability engineer who is passionate about proactive observability, automation, operational excellence, and compliance alignment. They will help define SLOs, ensure platform resilience, automate recovery, and support continuous improvement of mission-critical banking systems.
Together, we protect the trust and availability of digital financial services that millions rely on.
Key Responsibilities:
Define and enforce SLOs and Error Budgets for mission-critical banking channels, ensuring compliance with CBK and business continuity directives.
Implement, maintain, and enhance observability stacks for traceability across inter-bank transactions and payment APIs.
Automate operational workflows, infrastructure provisioning, and recovery processes using Terraform, Crossplane, and ArgoCD.
Integrate anomaly detection insights with SIEM platforms (e.g., Sentinel) to support unified reliability-security monitoring.
Conduct chaos engineering and resilience testing to validate RTO/RPO and high-availability commitments.
Lead and document incident post-mortems, ensuring corrective actions inform continuous improvement and regulatory audit readiness.
Personal Competencies:
Strong analytical, diagnostic, and troubleshooting skills
Ability to collaborate with cross-functional teams in high-stakes environments
Clear communicator who can translate technical findings into business impact
Proactive mindset with strong ownership and accountability
Ability to thrive under regulated operational processes and controls
Skills & Qualifications:
Bachelor's degree in computer science, Engineering, or related field
5+ years of experience managing large-scale production systems in cloud environments
Proven experience maintaining uptime and latency SLAs for digital banking or financial systems
Expert proficiency in observability tools (Dynatrace, Grafana, OTEL, Jaeger)
Familiarity with CBK ICT Risk Management Guidelines, Basel III Operational Risk Principles, and PCI DSS
Hands-on experience with Terraform, Crossplane, GitOps workflows, and automated deployment pipelines
Closing Date: November 28, 2025