Site Reliability Engineer (ITSM, Azure DevOps)Amsterdam
For our client in Amsterdam we are looking for a Site Reliability Engineer (ITSM, Azure DevOps)
A collaborative, communicative Site Reliability Engineer will change the way we’re working.
You will be working in a customer centric company where applications are essential. For that reason, we want to guarantee the total availability of them, joining a team where you will be an important team member involved in global digital transformation.
The globally scalable platform will create a differentiating customer experience and cater for growth by leveraging the innovation and development power within the company.
The Site Reliability Engineering (SRE) team is a multidisciplinary team of senior engineers with proven track records in development and operations across applications and infrastructure. The primary goal is to continuously and structurally improve the reliability and maintainability of the IT environments involved with the Platform, delivered and managed from different (international) client domains.
Team vision: The Platform is our product, in Production, Acceptance and Test. Our Technology setup, WoW and Practices ensure our platform is high available, always responsive, and scalable towards clients Entities around the globe without sacrificing velocity and agility. As we move forward we are setting the benchmark of excellence for operating platforms.
Responsibilities & Activities:
• Ensure Service Level Objective (SLO) levels are set and met
• Drive Always Available mindset and behavior within the organization. Be able to recognize shortcomings in knowledge and expertise, and deliver the necessary resources, skills, guidance and training to DevOps teams where needed.
• Define and enhance standards for logging monitoring and alerting, and actively monitor end to end platform performance through white and black box monitoring tools.
• Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job, but should not be an objection if and when required.
• Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads / DevOps teams
• Drive Continuous improvement on all services in the Platform through analysis of the current level of service, functional and technical setup, code, dev/ops practices and the underlying causes of incidents, underperformance, etc.
• Organization and coordination of platform tests like DDOS, DR, Ceiling/Break, and Penetration tests.
• Setting up and maintaining automatic reporting and feedback loops
• Contribute to automating Build, Test and Deployment practices through the CI/CD pipeline
• Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments.
• Initiate and contribute to new SRE initiatives like AI Ops, Chaos Engineering, migrations to Public Cloud, and Error Budgeting
• Participate and initiate experiments with new tools and concepts, and evaluate its value against set goals
You are an enthusiastic Software and/or Reliability Engineer with a focus on creating amazing solutions and frameworks. You have solid technical knowledge, and use that to formulate solutions, support and coach other engineers. You have a passion for highly resilient and reliable software and really hate repetitive manual tasks preventing you to do really cool stuff! You are able to inspire squads to spread the SRE mind-set. You are enthusiastic about transferring your knowledge to others within your team, but also with all DevOps teams in the Tribe and the rest of the company.
• Operations expert: 5+ years of experience working using Agile DevOps principles
• Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based, successful call rate, response times), MTTR, and MTBF.
• Good understanding of microservices architecture and related high availability / resilience patterns and experience building systems with multiple layers of redundancy to withstand failures in software, hardware, network infrastructure.
• Proven experience:
o worked as Site Reliability Engineer or DevOps engineer
o script in at least one of the following: Ruby, Python, Bash, PowerShell
o set up Build and Deployment pipelines in Azure DevOps (ADO)
o set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting
• Able to coordinate/lead incident response and root cause analysis activities
• Understanding of IT Service Management processes
Prior work experience with tools:
• CI/CD Pipeline: Azure Devops / Jenkins / Gitlab
• Cloud computing and container orchestration: Linux VM’s and Kubernetes container platforms. Knowledge of Openshift and AKS and related certifications are a pre.
• Service mesh and SDK’s
• logging/monitoring/alerting: Kafka, ELK, and Prometheus. Experience with black box monitoring tools like Rigor/Splunk and AI Ops tools like Loom is a pre.
• Backlog management: Azure Boards
• ITSM: SNOW
The ideal candidate has:
• A Bachelor or Master’s degree in computer science or related field
• Experience coaching and training DevOps engineers on technical subjects
• Previous experience in working in a DevOps team
• Understanding of application risk journeys