• Minimum of 6+ years of experience leading technical operations for a high-usage, web-based software service, ideally built using open-source software components
• Understanding a broad range of systems development, infrastructure transformation, and technical operations management tools in an Ubuntu Linux environments (including Passenger, Nginx, Ansible, DataDog, CloudFlare, MySQL/PostgreSQL, Apache, IPTables, ELK logging, Docker, Rancher, LDAP, OpenVPN, etc.)
• Understanding a broad range of tools and systems within the Docker environment including Docker itself and Rancher.
• Fully internalised process for deploying, changing, documenting and managing projects for technical operations, including various structured, standards-based and agile development methods. A deep understanding of, and experience with, setting standards and development of procedures that deliver an end-to-end, tightly monitored systems infrastructure
• Experience in developing and deploying backup & recovery for emergencies, implementing good security culture across server, application and physical security levels, and in architecting fault-tolerant solutions
• Proactive ability to define and document efficient and replicable processes, systems and structures to manage incidents, communications and planning.
• Exceptional communication skills with the ability to prioritise and convey information clearly within a given context and audience.
Strong preference for Ivano-Frankivsk office on-site with reimbursement of relocation expenses
— Competitive salary.
— Career and professional growth.
— Cozy fully-equipped office.
— Great work-life balance with flexible working hours and free office lunches.
— Paid vacation, sick leaves and stipend for Language classes, gym, IT events, etc.
— Regular performance reviews
• Dedicated AWS account (or bare metal servers, per your choice) for infrastructure automation testing, development and general learning.
• Retina MacBook Pro or another laptop of your specification, peripherals and displays included.
• Books, library & conference budget.
1. High Quality of Service & Incident Response
• Develop well-thought out & detailed numbered checklists to manage servers, perform maintenance and routine tasks. Attention to routine deadlines (domain renewals, SSL renewals, pen tests, server renewals, and so on) is an absolute must.
• Respond to operational issues in the agreed upon response time based on severity (immediate if fully down / 24 hours the same business day for routine / non-urgent requests). An incident response must be managed according to our ISO 27001 plan.
• Maintain our ISO 27001 certification; ensure adherence to the ISMS standard across all teams within Faria through employee training and compliance monitoring; update the ISMS documentation to reflect new practices.
• Proper, end-to-end Incident Management Cycle (e.g. all servers are down, respond and acknowledge first, formulate the plan based on established procedures, communicate during the incident, when an incident is resolved, post reasons & incident report for the application developer and reflect on whether checklist procedure can be improved).
• 3rd Party Incidents (e.g. Cloudflare, proactively resolve the issue with the vendor to a conclusion, do not simply close the issue and wait for someone else to reach a resolution and conclusion, communicate the attempts of resolution & dialogue to keep application owner & product teams aware of progress)
• Ensure that SLAs are > 99.5% through the above actions (CF unavailability counts towards SLA)
2. Server Management
• Maintaining our inventory list of servers, domains, and SSL certificates, with renewal dates and usage.
• Ensuring that servers are utilised / if not being utilised raise to application owner to shut down to conserve costs.
• Audit & cross-check on a monthly basis to ensure that any new servers have been added to the master list, any consolidation of applications on single servers are reflected and aim to optimise server infrastructure for cost efficiency.
• Provide tools via Docker infrastructure to allow development teams to scale development and staging instances as needed for efficient work while monitoring costs and usage.
• Proper remote presence & etiquette (acknowledging requests in a timely fashion over Slack, not leaving requests unacknowledged at all)
• Tagging the appropriate person and persistently reminding them every 24 hours until full resolution is achieved (not having things fall through the cracks)
• Effective adherence to Basecamp procedures (organising day-to-day work and large-scale tasks in a calm manner with priority-driven sequencing)
• Tracking of BAU and automation work within DevOps, with the goal of reducing BAU to 50% or less of daily operations on an ongoing basis.
• Manage on-call list for operational incidents with established procedures for detecting and notifying staff of critical changes to infrastructure.
4. Manage Domain & SSL certificates
• Configuring DNS and ensuring that domains are routed correctly across application and public websites (e.g. blog.faria.co was still active for 5 months after faria.co public site relaunch).
• Proactively monitoring domain & SSL expiry to ensure successful renewal with an eye towards cost management and 100% on-time renewals
• Ensuring WHOIS details are updated in line with ICANN requirements and reflecting the correct legal entity owning the domains.
• Audit & cross-check on a monthly basis to ensure that any newly purchased domains have been added to the master Excel list.
Faria was founded in 2006 to transition schools off paper onto a Curriculum First learning platform, which acts as the core repository of a school’s curriculum and academic records, including attendance, assessment, coursework, exams and activities management.
Today, we serve over 2,300 international schools and over 600,000 students, including 4 in 5 IB Diploma students, in 120 countries with a distributed global team. Our graduating student cohort equals ~12% of all inbound international students attending university in the US, UK, Australia and Canada.
Our service commitment to schools encompasses global