Lead Site Reliability Engineer
Our client is seeking an experienced Lead Site Reliability Engineer to drive reliability strategy, operational excellence, and automation across global cloud infrastructure. This role is critical in ensuring platforms remain highly available, scalable, secure, and performant even during extreme traffic spikes or infrastructure failures.
As the Lead Site Reliability Engineer, you will combine deep technical expertise with leadership capability to build resilient distributed systems, lead incident response, define reliability standards.
Key Responsibilities
Global Reliability Strategy
- Define and implement SRE vision, principles, and governance across global and regional teams.
- Establish enterprise SLIs, SLOs, SLAs, and error budget frameworks aligned with business impact.
- Standardize production readiness reviews and reliability assessments.
Platform Architecture & Scalability
- Architect and govern highly available, multi-AZ / multi-region cloud infrastructure (AWS).
- Lead Kubernetes platform strategy and container orchestration standards.
- Drive Infrastructure as Code adoption (Terraform preferred) across regions.
- Design global disaster recovery (DR) and business continuity strategies.
- Ensure resilience and elasticity during high-traffic events
Operational Leadership
- Own the global incident management framework (SEV classification, escalation, communication).
- Lead major incident response and executive stakeholder updates.
- Conduct root cause analysis (RCA) and champion blameless postmortems.
- Drive measurable improvements in MTTD and MTTR.
- Reduce operational toil through automation and platform engineering best practices.
- Define enterprise observability strategy (metrics, logs, tracing).
- Standardize monitoring frameworks and alert quality across regions.
- Lead performance optimization initiatives (latency, throughput, resilience).
- Improve deployment reliability using progressive delivery models (Blue/Green, Canary).
Security, Risk & Compliance
- Enforce Least Privilege and Defense-in-Depth principles.
- Partner with Security teams to embed DevSecOps practices.
- Ensure compliance with global regulatory standards (SOC2, ISO, PCI where applicable).
Key Qualifications
- At least 5 years of experience in SRE or (DevOps, Platform Engineering, or Cloud Infrastructure)
- Proven leadership experience in global or multi-region environments.
- Strong track record managing high-availability, mission-critical production systems.
- Hands-on expertise with AWS and cloud-native architectures.
- Deep knowledge of Kubernetes and container orchestration.
- Infrastructure as Code (Terraform preferred).
- Strong understanding of CI/CD, GitOps, and automation-first practices.
- Experience with observability platforms (Prometheus, Grafana, ELK, Datadog, OpenTelemetry).
- Strong networking and distributed systems knowledge.
- Experience handling major incident management in high-pressure environments.
- Excellent stakeholder communication skills with global and regional teams.
Due to the high volume of applications, our team will only be in touch if your application is shortlisted.
Robert Walters Recruitment (Thailand) Limited
Recruitment License No.: น. 1188 / 2551
About the job
Contract Type: Perm
Specialism: Tech & Transformation
Focus: Architecture
Industry: IT
Salary: Performance Bonus
Workplace Type: Hybrid
Experience Level: Senior Management
Location: Bangkok
FULL_TIMEJob Reference: 6NF05Z-1E344C75
Date posted: 27 February 2026
Consultant: Supapuck Siriprayoon
bangkok tech-transformation/architecture 2026-02-27 2026-04-28 it Bangkok TH Robert Walters https://www.robertwalters.co.th https://www.robertwalters.co.th/content/dam/robert-walters/global/images/logos/web-logos/square-logo.png true