What you will do
- Design, build, and maintain scalable backend and platform components;
- Implement and manage observability solutions across distributed systems;
- Configure dashboards, alerts, and APM for tracing, metrics, and logging;
- Monitor and improve system reliability, scalability, and performance;
- Deploy, operate, and maintain services in Kubernetes environments;
- Integrate observability tools into CI/CD pipelines and cloud infrastructure;
- Automate monitoring and operational workflows using scripting;
- Provide operational and training support for observability platforms, especially Datadog;
- Collaborate with engineering teams to improve system visibility and reliability practices.
Must haves
- 4+ years of experience with Python, Node.js, or Java;
- Hands-on experience with API integrations;
- Strong experience in Kubernetes environments;
- Experience with Datadog or similar tools such as Prometheus and Grafana;
- Ability to configure dashboards, alerts, and APM;
- Experience monitoring containerized and microservices architectures;
- Hands-on experience with AWS;
- Experience integrating observability tools into cloud environments;
- Experience with CI/CD integrations for observability;
- Ability to automate monitoring and operational tasks using scripting;
- Upper-intermediate English level.
Nice to haves
- Experience owning and operating an internal engineering platform, especially observability platforms;
- Demonstrated ownership of reliability, scalability, and performance;
- Ability to proactively lead maintenance and platform improvements;
- Experience installing and configuring Datadog agents and integrations;
- Experience managing API keys and secure configurations;
- Experience managing user roles and access controls;
- Familiarity with Go (Golang);
- Experience with additional observability tools such as New Relic, Dynatrace, Elastic Stack, or Splunk.
We are looking for a Senior Site Reliability Engineering to strengthen our platform reliability and observability capabilities. You will own the design and operation of monitoring infrastructure — including Datadog APM, alerting, and distributed tracing — across Kubernetes-based microservices on AWS. The role spans backend engineering and SRE practice in roughly a 65/35 split, with direct involvement in CI/CD integration and observability automation. You will also support internal teams in adopting monitoring best practices as we modernize our R&D platform.
About the role
The benefits of joining us
Professional growth
Accelerate your professional journey with mentorship, TechTalks, and personalized growth roadmaps
Competitive compensation
We match your ever-growing skills, talent, and contributions with competitive USD-based compensation and budgets for education, fitness, and team activities
A selection of exciting projects
Join projects with modern solutions development and top-tier clients that include Fortune 500 enterprises and leading product brands
Flextime
Tailor your schedule for an optimal work-life balance, by having the options of working from home and going to the office – whatever makes you the happiest and most productive.
Your AgileEngine journey starts here
2 min
Tell us about yourself
2 sec
Confirm requirements
30 - 60 min
Pass a short test
5 min
Record a short video
→ Introduce yourself on a video, instead of waiting for an interview
Live interview
Ace the technical interview with our team
→ Schedule a call yourself right away after your video is reviewed
Live interview
Final interview with your team
→ Get to know the team you will be working with
Get an offer
As quick as possible







