Senior Site Reliability Engineer at PesaLink
PesaLink
The Senior SRE will be responsible for driving maturity of SRE principles such as SLA,SLO, incident report and RCA documentation, and problem management. They will use their skills to troubleshoot and resolve issues (with SLAs and to reduce MTTR) and put in place measures to prevent recurrence (thereby increasing MTBF). They will also help in ensuring that observability spans all systems and help improve and make recommendations on how we can improve our observability and monitoring posture. They will also help in API integrations, our developer portal and sandbox and generally improvement in our integration workflow
DUTIES AND RESPONSIBILITIES
Investigate, troubleshoot and resolve incidents, find RCA, Log analysis using a mix of observability tools and manual log analyses do reports on lessons learned.
Support workflows and service management consisting of closing incidents within service levels agreement (SLA), engaging 3rd line support where necessary, managing problems ensuring updated bug records and root cause analyses (RCA).
Assisting developers consuming various APIs (synchronous, asynchronous, REST and SOAP) , enhance our integration workflows, documentation and developer portal and sandbox.
Work in collaboration with other engineering/IT teams provide 24/7 support in line with ITIL and SRE principles.
Taking part in testing such as UATs and SITs and unit testing before rollout.
Planning and executing business continuity planning (BCP) and disaster recovery planning.
Stay up to date with industry best practices and emerging technologies in APIs, infrastructure management, and monitoring.
Create, develop, and maintain comprehensive documentation for payment systems, including detailed architectural diagrams, technical specifications, integration plans, user guides, and troubleshooting procedures, ensuring a clear and up-to-date resources for system implementation, integration and support.
Ensure the stability and performance of all platforms through ensuring that monitoring is designed in the solution. That is, it is not an afterthought, right from logging to ensuring all items are monitored.
Help in ensuring that observability and monitoring spans across all our systems allowing for easy correlation of fault and identification of cascading failures.
Take lead in complex integrations of payment systems with internal and external applications, including financial institutions, banks, and third-party payment processors.
System Design /Solution Architecture: Provide input in design for availability, scalability, resilience, fault tolerance and elasticity.
Train and mentor new engineers ensuring that they develop and grow their expertise in reliability engineering and IT governance principles such as change and incident management.
Evangelize reliability practices to the organization so that they are familiar with reliability engineering.
Support containerised workloads and troubleshoot and make recommendations on how we can improve our systems.
Develop automation scripts using any scripting language (e.g., Python, Bash, Ruby, or others) to streamline deployment, monitoring, and management tasks.
Troubleshoot and resolve API issues, security concerns, and system failures.
EDUCATION SKILLS & COMPETENCIES REQUIRED
Bachelor's degree in computer science, Software Engineering, Information Technology or a related field.
Proficiency in API, API integrations and supports API first solutions.
5+ years in troubleshooting and resolving production issues, particularly for API based systems.
Knowledge of ITIL and SRE principles such as change management, incident management, SLAs ,
SLOs, blameless postmortem and problem management.
Good understanding of APIs technologies such as REST/JSON, REST/XML and SOAP