• 10+ years of Infrastructure Engineering experience
• 5+ years managing Site Reliability, NOC, or mixed engineering teams and/or Managers in globally distributed environments
• Past Experience in Incident Management and strong understanding of ITIL service operations and SCRUM methodologies
• Experience growing high performing, globally distributed engineering teams
• Passionate about employee development with experience successfully coaching and managing managers and individuals to achieve goals
• Strong communication, organizational, analytical and problem-solving skills and attention to detail
• Experience in a large-scale Linux data center environment with knowledge in administration, troubleshooting
• Process improvement and change management
• CI/CD mindset
• Has a passion for: Teamwork and collaboration, Adaptability, Communication, Problem Solving, Customer Focus, Results, and Innovation.
• Entrepreneurial-spirited, Results-driven, communicator, aloha spirit
• High-level compensation and regular performance based salary and career development reviews;
• Medical insurance (health), employee assistance program;
• Paid vacation, holidays and sick leaves;
• Sport compensation;
• English classes with native speakers, trainings, conferences participation;
• Team buildings, corporate events.
This role to reshape, innovate, and refunction our globally distributed Site Reliability teams at Aspire Global. You will be responsible to build a resilient platform with customer experience at the center of what we do. empowering to envision incident response by building out best in class tools, diagnostics, configurations, processes, and partnerships with a CI/CD mindset.
This role will be a balance of technical, influential leadership, and managerial expertise. You will proactively set technical direction on incident bridges and marshall resources accordingly. You will also ensure that investigations are following appropriate troubleshooting paths, monitoring, triage and change execution remain optimal. This position will involve fostering and maintaining strong relationships cross-functionally by ensuring the SRE team are vital stakeholders within any process and procedural enhancements, including M&A. As a managerial leader, you will inspire, coach, and mentor your managers and individual contributors to develop their career aspirations into reality.
Lead and manage a team’s responsible for: Incident Management, Detection, Change Execution/Approvals, and maintenance for all integrated properties, as well as root cause analysis/remediation and other proactive measures to improve the stability of customer performance and minimize risk of impact to customers. The team work collaboratively with internal R&D Teams, and partner closely with various teams to drive resiliency improvements and reduce our MTTD and MTTR. You will manage a highly skilled team that currently work on shift rotation.
• Ensures optics proactively in diagnostics, detection, configuration, application, develop service-ownership to fill gaps and provide detective in customer experience
• Creating capabilities to have SR team respond in a timely manner to incidents and find root cause
• Work successfully with other cross-cloud service owners (developers, DBAs, Network etc) with positive relationships but with influence
• Proactive measure to impact customers beyond current NOC SRE team — We want to actually solve the problems and configure visibility
• Involved in public cloud tooling in Linux environments
• Collaborate with SR dashboards and analytics to give predictive insights on data center environments for customers
• Passionate about engineering productivity and service ownership and customer success
• Passionate about Continuous Integration and Delivery and driving teams to adopt this delivery model
• Excited by building reliable, self-healing services on unreliable hardware
• Experience designing, developing, debugging, and operating resilient distributed systems that run across thousands of compute nodes in multiple datacenters.