Senior Site Reliability Engineer

Apply now

8 - 10 years México/LATAM Remote/Full Time Advanced

Description:

We are hiring a highly experienced, development-oriented, collaborative, and detail-focused Senior Site Reliability Engineer (SRE) responsible for solving complex operational, scalability, and reliability challenges. In this role, you will apply software engineering methodologies to system administration processes while partnering closely with software engineers and product teams to improve platform stability, performance, and operational excellence.

We are looking for someone with extensive experience operating and scaling production systems who can confidently lead technical initiatives, automate infrastructure, improve reliability, and drive best practices across engineering teams. This is a senior-level position intended for engineers who have already managed large-scale production environments and can make sound technical decisions with minimal guidance.

The ideal candidate must have a solid background as a Software Engineer before transitioning into Site Reliability Engineering or DevOps. We are specifically looking for engineers who have hands-on software development experience and now apply software engineering principles to build reliable, scalable, and automated infrastructure.

What will you do?

Gain a deep understanding of our platform, how it serves our clients, and how they interact with it.
Monitor and maintain system availability, performance, and overall health.
Build tools and systems to automate infrastructure management and operational processes.
Operate production environments with a strong focus on reliability, scalability, performance, and uptime.
Design, implement, and maintain Infrastructure as Code (IaC) using Terraform.
Develop and improve CI/CD pipelines for efficient and reliable software delivery.
Build and maintain monitoring, alerting, and logging solutions using tools such as New Relic, SumoLogic, Pingdom, CloudWatch, and CloudTrail.
Lead incident response, perform root cause analysis, and implement long-term corrective and preventative solutions.
Participate in on-call rotations and ensure effective incident management and escalation.
Collaborate closely with development teams to improve deployment strategies, testing, and release processes.
Create and maintain operational documentation, runbooks, and disaster recovery procedures.
Define, measure, and optimize system reliability using SLOs, SLIs, and key operational metrics.
Mentor other engineers and promote reliability best practices across the organization.

Requirements:

8-10+ years of professional experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or closely related roles.
Bachelor's Degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Advanced English communication skills, both verbal and written.
Previous professional experience as a Software Engineer before transitioning into Site Reliability Engineering or DevOps
Strong hands-on programming experience with the ability to design, develop, debug, and maintain production-quality software.
Ability to apply software engineering principles to infrastructure, automation, and operational challenges.
Proven experience operating, maintaining, and scaling large-scale production environments.
Expert-level experience with AWS and cloud-native technologies.
Deep understanding of Linux system internals, networking, distributed systems, and service-oriented architectures.
Strong experience with containerization and orchestration technologies, including Docker and Kubernetes.
Extensive hands-on experience with Infrastructure as Code (IaC), particularly Terraform.
Experience working with relational databases such as MSSQL, MySQL, and Aurora MySQL, as well as NoSQL databases, especially DynamoDB.
Strong knowledge of observability concepts, including monitoring, metrics, logging, tracing, SLOs, and SLIs.
Experience designing, building, and maintaining CI/CD pipelines using tools such as Jenkins, CodePipeline, or CodeDeploy.
Experience leading incident response, root cause analysis, and post-incident improvements.
Strong automation mindset with scripting or programming experience.
Ability to lead technical initiatives, influence architectural decisions, and collaborate effectively across cross-functional teams.
Excellent analytical, troubleshooting, and problem-solving skills.
Passion for continuous improvement, operational excellence, and platform reliability.

Nice to have:

Experience supporting large-scale SaaS platforms.
Experience with high-availability and mission-critical production environments.
Experience defining reliability standards and operational best practices.
Experience with capacity planning, performance optimization, and cost optimization in AWS.