This website uses cookies to offer you the best experience online. By continuing to use our website, you agree to the use of cookies. If you would like to know more about cookies and how to manage them please view our Privacy Policy & Cookies page.
Position : Site Reliability Engineer ( Job Code - RR4157)
Location : Dallas, TX, USA
Company Description: Apexon is a digital-first technology services firm backed by Goldman Sachs Asset Management and Everstone Capital. We specialize in accelerating business transformation and delivering human-centric digital experiences. For over 17 years, Apexon has been meeting customers wherever they are in the digital lifecycle and helping them outperform their competition through speed and innovation.
Required Skills & Experience
Experience with monitoring tools (Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.), and any experience/exposure to modern DevOps is a plus (AWS, Kubernetes, Terraform)
Utilize existing tools to create telemetry streams from each system that DevOps maintains.
Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
Correlate data from disparate systems to determine underlying causes to issues that may be occurring in seemingly unrelated parts of the enterprise.
Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
Continually improve these dashboards to make their information more useful in real time as well as for after-the-fact analysis.
Generate "Postmortem" reports for unplanned outages or system failures.
Prepare "Scope of Impact" reports for upcoming planned outages or system changes.
Work with the other members of DevOps and the Infrastructure team to ensure that underlying resources are ready for failover and to help plan for future growth.
Maintain failover documentation and S.O.P.s.
Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and our business teams.
Continually seek to improve our failover procedures.
Desired Skills & Experience
Mastery in at least two or more software languages (e.g., Python, Java, Go, etc.) with respect to designing, coding, testing, and software delivery.
At least two years of experience working with data systems.
The SRE is the "Control Tower" of DevOps. As such, they need to be familiar with how our data systems work and interact with one another.
The candidate should have a basic understanding of computer programming and data systems architecture.
Ability to interact with various groups within the business to inform them of the basic details of upcoming changes or to communicate the current state of system failures or outages.
Ability to interact with other developers and management to help define, implement, and enforce patterns for proper metric telemetry from systems, proper logging, and resilient failover patterns.
Should always be seeking to improve our system telemetry, uptime, and recoverability.
Disclaimer: If you feel that this is a good match for your skillsets, please submit a current word version of your resume along with a cover letter describing your skills, experience and salary expectations. We are an Equal Opportunity Employer (EOE). You can read our job applicant privacy policy here .
Site Reliability Engineer Job at Apexon
Company Description: Apexon is a digital-first technology services firm backed by Goldman Sachs Asset Management and Everstone Capital. We specialize in accelerating business transformation and delivering human-centric digital experiences. For over 17 years, Apexon has been meeting customers wherever they are in the digital lifecycle and helping them outperform their competition through speed and innovation.
Required Skills & Experience
Experience with monitoring tools (Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.), and any experience/exposure to modern DevOps is a plus (AWS, Kubernetes, Terraform)
Utilize existing tools to create telemetry streams from each system that DevOps maintains.
Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
Correlate data from disparate systems to determine underlying causes to issues that may be occurring in seemingly unrelated parts of the enterprise.
Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
Continually improve these dashboards to make their information more useful in real time as well as for after-the-fact analysis.
Generate "Postmortem" reports for unplanned outages or system failures.
Prepare "Scope of Impact" reports for upcoming planned outages or system changes.
Work with the other members of DevOps and the Infrastructure team to ensure that underlying resources are ready for failover and to help plan for future growth.
Maintain failover documentation and S.O.P.s.
Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and our business teams.
Continually seek to improve our failover procedures.
Desired Skills & Experience
Mastery in at least two or more software languages (e.g., Python, Java, Go, etc.) with respect to designing, coding, testing, and software delivery.
At least two years of experience working with data systems.
The SRE is the "Control Tower" of DevOps. As such, they need to be familiar with how our data systems work and interact with one another.
The candidate should have a basic understanding of computer programming and data systems architecture.
Ability to interact with various groups within the business to inform them of the basic details of upcoming changes or to communicate the current state of system failures or outages.
Ability to interact with other developers and management to help define, implement, and enforce patterns for proper metric telemetry from systems, proper logging, and resilient failover patterns.
Should always be seeking to improve our system telemetry, uptime, and recoverability.
Disclaimer: If you feel that this is a good match for your skillsets, please submit a current word version of your resume along with a cover letter describing your skills, experience and salary expectations. We are an Equal Opportunity Employer (EOE). You can read our job applicant privacy policy here .