Site Reliability Engineer (Thai Nationality)
Our client is currently looking for a Site Reliability Engineer to own the reliability, scalability, and performance of high-throughput, transaction-driven production systems. This role will focus on ensuring service availability, operational resilience, and safe delivery through strong observability, automation, and incident leadership practices.
The ideal candidate has deep hands-on experience operating complex distributed systems in production, leading incident response, and embedding reliability and security into system design and day-to-day operations.
Key Responsibilities
- Own service reliability for transaction platforms by defining and maintaining SLOs, SLIs, and error budgets.
- Ensure high availability, performance, and scalability of high-throughput, transaction-driven systems.
- Identify, troubleshoot, and remediate capacity, performance, and scalability constraints.
- Perform capacity planning and optimize systems for peak and sustained traffic.
- Design, implement, and operate observability solutions (monitoring, alerting, logging, tracing, dashboards).
- Analyze transaction flows to detect bottlenecks, failure modes, and systemic risks.
- Act as primary responder for production incidents, leading diagnosis, mitigation, and recovery.
- Define, execute, and continuously improve incident response, escalation, and post-incident review processes.
- Lead root cause analysis and implement preventive measures to reduce recurrence.
- Design and operate CI/CD pipelines to support safe, repeatable, low-risk deployments.
- Automate deployments, rollbacks, recovery, and operational workflows.
- Ensure services comply with security, access control, and regulatory requirements.
- Partner with security teams to implement encryption, IAM, and platform safeguards.
- Embed reliability, security, and operational resilience into system design and operations.
Key Qualifications
- Hands-on experience designing, operating, and scaling large-scale production systems with a strong focus on reliability, performance, and resilience.
- Strong analytical and troubleshooting capabilities across distributed systems, infrastructure, networking, and applications.
- Practical use of SLOs and error budgets to guide engineering priorities and operational trade-offs.
- Experience building and maintaining observability platforms, including metrics, logging, and distributed tracing.
- Proven ability to anticipate, assess, and reduce risks related to availability, latency, and capacity.
- High proficiency in automation and scripting to eliminate manual work and improve system stability.
- Hands-on experience operating workloads on cloud platforms such as AWS, Azure, and GCP.
- Solid operational knowledge of containerization and orchestration technologies, including Docker and Kubernetes.
- Experience designing, implementing, and governing CI/CD pipelines for production-grade systems.
- Working knowledge of security controls, IAM, and compliance requirements in transaction-heavy environments.
Due to the high volume of applications, our team will only be in touch if your application is shortlisted.
Robert Walters Recruitment (Thailand) Limited
Recruitment License No.: น. 1188 / 2551
About the job
Contract Type: Perm
Specialism: Tech & Transformation
Focus: Systems Integration
Industry: IT
Salary: Performance Bonus
Workplace Type: On-site
Experience Level: Mid Management
Location: Bangkok Province
FULL_TIMEJob Reference: JFP4GQ-9C336D6D
Date posted: 23 January 2026
Consultant: Supapuck Siriprayoon
bangkok tech-transformation/systems-integration 2026-01-23 2026-03-24 it Bangkok TH Robert Walters https://www.robertwalters.co.th https://www.robertwalters.co.th/content/dam/robert-walters/global/images/logos/web-logos/square-logo.png true