Project Description
Our client is a Danish jewelry brand, and one of the most famous jewelry brands in the world. From our side, we’re focusing on creating a team to improve automation processes and develop great partnerships for years.
As a Site Reliability Engineer, you’ll be working with the team on creating an SRE process from scratch for one of the biggest jewelry e-commerce projects in Europe, assessing process maturity in several Dev teams, implementing Observability with tools like NewRelic, OpsGenie for a range of existing on-premise and cloud applications (Azure): e-com SFCC/SFRA, IBM Sterling OMS, Data & Analytics, ERP.
Responsibilities
- Creating metric/log based monitors and dashboards (NewRelic) and alerting capabilities (OpsGenie)
- Defining SLOs and measuring SLIs, Error budgets of production applications/services
- Improving operaitonal KPIs like MTTD/MTTR, service availability & reliability
- Onboarding production application/services to SRE process
- Ensuring site performance and capabilities by participating in performance, load, and stress testing
- Evangelizing SRE’s mission to the company including cloud engineering best practices and operational readiness
- Work with engineering teams to refine deployment and release processes
- Monitor and stress test systems to collect metrics for tuning and capacity planning
- Work to automate detection and resolution of recurring issues (problem management)
- Ensure safety, predictability, repeatability, and suitability of all build and deploy processes
- automate repetitive tasks and prevent incident re-occurrence
Skills Required
- Experience with Azure Cloud
- Experience with one or more: Salesforce Commerce Cloud (SFCC,Demandware), IBM Sterling OMS, MS Dynamics ERP, web-applications, REST API, Event Driven Architecture (Kafka)
- Experience with either .NET or JAVA software and systems
- Expert knowledge in all aspects of designing, developing, and managing large real-time systems
- Comfortable scripting and debugging distributed web-based applications
- Natural collaboration skills and an eye for continuous improvement
- Fluent in scalability and root cause analysis exercises (blameless RCA, Postmortems)
- Dedicated to continuous integration and improving processes (ADO Pipelines creation & improvement)
- Strong hands-on technical experience in software deployment and operations on public Cloud platforms, CI/CD, deployment automation, and Pipelines
Will be a plus:
- Experience with incident command/management (ServiceNow), ITSM and ITIL frameworks
- Experience in training and educating to engineering as a whole on infrastructure and internal tooling (on Azure cloud, NewRelic, Azure DevOps Pipelines, writing runbooks/SOP, ‘solution design articles’ for SRE/Support cases
- pro-activeness and persistence in driving team’s tasks to completion with stakeholders inside company as well as with 3rd party vendors
- extreme ownership & knowledge sharing within organization
- ability to explain complex technical problems in simple words