Certified Remote
PUBLISHED
Oct 7, 2025
Join Okta as a Principal Site Reliability Engineer specializing in Observability to lead the design and implementation of advanced monitoring and telemetry systems that ensure the reliability and scalability of our identity management platform. Collaborate with cross-functional teams to proactively identify and resolve issues, driving innovation in observability practices for a seamless user experience.
Okta is seeking a Principal Site Reliability Engineer with a focus on Observability to join our infrastructure team. In this leadership role, you will architect and optimize observability solutions that provide deep insights into our global, cloud-native identity platform, serving millions of users worldwide.
You will drive the evolution of our monitoring, logging, and tracing capabilities, ensuring we can detect, diagnose, and mitigate issues before they impact customers. Collaborate with engineering teams to embed observability best practices into the software development lifecycle, leveraging tools like OpenTelemetry and advanced analytics to enhance system performance and reliability.
Key responsibilities include leading observability initiatives, conducting root cause analyses on complex incidents, and scaling our telemetry infrastructure to support rapid growth. You will also mentor team members, contribute to open-source projects, and represent Okta at industry events on SRE and observability topics.
If you are passionate about building resilient systems and thrive in a fast-paced environment, this role offers the opportunity to make a significant impact on Okta's mission to secure and enable digital interactions.
The employer recommends obtaining this certification to validate your skills and enhance your application.
Note: You can still apply for this position without the certification, but having it will make your profile stand out and may be required to move forward in the hiring process.