Site Reliability Engineer

Apply now
  • year-experience 5-7 years place México/LATAM contract Remote/Full Time
Description:

We are hiring a development-oriented, collaborative, and detail-focused Site Reliability Engineer (SRE) responsible for solving operational, scalability, and reliability challenges. In this role, you will apply software engineering methodologies to system administration processes and collaborate with software engineers and product developers to optimize system performance, stability, and reliability.

The ideal candidate will focus on improving and automating operational tasks while ensuring system availability and scalability. You will manage critical aspects such as latency, performance efficiency, monitoring, emergency response, capacity planning, and change management alongside your team.

We are seeking a proactive individual with strong leadership, resource administration, and communication skills who thrives in a team-oriented environment. A background in development, combined with hands-on SRE or DevOps experience, is essential.

What will you do?

  • Gain a deep understanding of our platform, how it serves our clients, and how they interact with it.

  • Monitor and maintain system availability, performance, and overall health.

  • Build tools and systems to automate infrastructure management and operations.

  • Run production environments with a holistic view of reliability, uptime, and scalability.

  • Implement Infrastructure as Code (IaC) using tools like Terraform.

  • Develop and manage CI/CD pipelines for seamless code integration and deployment.

  • Create and maintain robust monitoring, alerting, and logging frameworks using tools such as New Relic, SumoLogic, Pingdom, CloudWatch, and CloudTrail.

  • Lead incident response efforts, perform root cause analysis, and implement preventative measures.

  • Participate in on-call rotations and ensure proper incident management and escalation.

  • Collaborate with developers to enhance release processes, testing, and deployment automation.

  • Document operational processes and create detailed runbooks/playbooks for emergency response.

  • Measure and optimize system performance using SLOs, SLIs, and key metrics.

Requirements:
  • 5–7 years of proven experience in a Site Reliability Engineering or DevOps role.
  • Bachelor’s Degree in Computer Science, Engineering, or related field, or equivalent practical experience.
  • Advanced English communication skills, both verbal and written.
  • Background in software development (no longer a full-time developer but with hands-on past experience).
  • Expert-level experience with AWS (mandatory) and cloud-native technologies.
  • Strong understanding of Linux system internals, networking, distributed systems, and service-oriented architectures.
  • Proficiency in containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Hands-on experience with Infrastructure as Code (IaC), particularly with Terraform.
  • Experience with relational databases (MSSQL, MySQL, Aurora MySQL) and NoSQL (especially DynamoDB).
  • Knowledge of observability concepts, including metrics, logging, tracing, SLOs, and SLIs.
  • Familiarity with CI/CD tools (e.g., Jenkins, CodePipeline, CodeDeploy).
  • Ability to lead and influence technical decisions in a cross-functional team environment.
  • Proactive mindset with strong problem-solving and automation skills.
  • Passion for continuous improvement, scalability, and operational excellence.
WHY WORK WITH US?