Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
Good understanding of cloud technologies
Experience with algorithms, data structures, complexity analysis and software design.
Good understanding of Java, hands-on experience in troubleshooting nontrivial problems like multithreading race conditions, memory leaks, cache issues, etc
Good understanding of SQL, experience with query optimization and performance tuning
Good understanding of high load systems development practices, reliability measuring, failover processes
Understanding of microservices architecture, containers, orchestration frameworks
Deep understanding of Unix/Linux systems administration
Knowledge and understanding of network theory (MAC addresses, IP packets, DNS, OSI layers, and load balancing).
Ability to get to the root cause of problems and facilitate this approach within the team
Ability to conduct post mortems and learn from past failures.
Driving a constant measurable system improvement process
Good English communication and interpersonal skills
— Experienced colleagues who are ready to share knowledge;
— The ability to switch projects, try yourself in different roles;
— More than 150 workplaces for advanced training;
— Study and practice of English: courses and communication with colleagues and clients from different countries;
— Support of speakers who make presentations at conferences and meetings of technology communities.
The ability to focus on your work: a lack of bureaucracy and micromanagement, and convenient corporate services;
Lack of dress code, friendly atmosphere, concern for the comfort of specialists;
Flexible schedule and the ability to work remotely;
The ability to work in any of our development centers.
Analyze and improve the availability, latency, performance, and efficiency of the applications
Proactive support of production applications (both in-office and out of hours) across a range of domains, these are mainly written in Java and use Oracle databases.
Improve the monitoring and alerting of the applications
Capacity planning and provisioning
Improve and standardize build pipelines, identify and reduce any areas of manual toil through automation.
Consult in areas of reliability and scalability for the development of new applications.
Work together with teams in other departments to find solutions
Conduct periodic on-call duties
THE POSITION IS OPEN IN WROCLAW, LUBLIN, and SOFIA. RELOCATION OPPORTUNITIES AVAILABLE.
Our client is one of the biggest online retailers worldwide with an annual revenue of £1 billion. Over the years we helped the client develop web-portals, mobile apps, delivery control systems, staff management tools, data storage and much more. The systems we’ve built together are in operation 24/7, contributing to the client’s success.
Site Reliability Engineering is a new role, first introduced by Google, that combines the skills of developers and ops to deliver more reliable, scalable software. The goal is to analyze a diverse set of applications (primarily built using Java, Oracle, AWS, Google Cloud services and a number of other technologies) and bind them into a reliable self-healing suite, working within defined reliability requirements. This requires proactive work to ensure observability, analyze potential bottlenecks and suggest their fixes before they become a production incident.