Patrianna is a super fast-growing product development company headquartered in Gibraltar with colleagues around the world. We are looking for exceptional, smart talents striving to be number one. Motivated and capable of scaling up business functions at pace through domain expertise and a desire to continuously improve.
13 грудня 2024

Monitoring Specialist (вакансія неактивна)

віддалено

Dive into the pulse of cutting-edge solutions with Patrianna LTD! 🚀

Are you ready to dive into the dynamic world of social gaming and be part of a rapidly expanding team? We’re on the lookout for a talented Monitoring Specialist (support) to join our Patrianna LTD team on a full-time basis.

🌟 What You Gain?

Dynamic Environment: Step into the heart of a super fast-growing social gaming company, where innovation and creativity thrive.
Global Impact: Be at the forefront of crafting a global social entertainment platform, with a primary focus on captivating the North American market.
Limitless Growth: Take your career to new heights with opportunities for advancement and personal development. Join us in the exhilarating journey of continuous growth.
Massive Reach: Contribute to the development of client web and mobile apps that engage with up to 150 million customers worldwide.
Commitment to Excellence: We’re dedicated to delivering high-quality code, ensuring predictable behavior in production, seamless scaling, and automation every step of the way.

We are looking for a skilled Monitoring Specialist to join our 24×7 SRE team. The ideal candidate will work non-business hours aligned with European time to ensure seamless operations and system reliability. This role focuses on monitoring and diagnostics across a multi-site production environment, primarily for Java-based applications on Google Cloud Platform (GCP). Leveraging modern monitoring tools, the SRE will proactively identify, analyze, and resolve issues, maintaining high service performance and reliability.

Key Responsibilities:

  • Production Monitoring & Alerting
    • Oversee multi-site production environments using tools like Prometheus, Grafana, and Sentry to monitor application performance, database health, and event streams.
    • Continuously monitor performance metrics, setting up alerts to identify potential issues before they impact system availability.
  • Log Analysis & Diagnostics
    • Analyze logs across applications, databases, and event streaming services (Kafka) to detect irregularities and gain insights into root causes.
    • Use tools like ELK and GCP-native monitoring solutions to maintain visibility and optimize system behavior.
  • Database & Event Stream Monitoring
    • Monitor and tune performance for databases like PostgreSQL/AlloyDB and Spanner, focusing on query optimization, performance metrics, and troubleshooting.
    • Manage and monitor Kafka clusters, including consumer lag tracking and data pipeline health, to ensure continuous data processing.
  • Error Tracking & Troubleshooting
    • Use Sentry and similar tools to track, document, and resolve errors, escalating issues to the engineering team when necessary.
    • Follow troubleshooting protocols and assist in root cause analysis to resolve incidents in a structured and efficient manner.
  • Network & Security Insights
    • Collaborate with Cloudflare tools to monitor network performance and ensure security standards, with an emphasis on DDoS protection and latency optimization.
    • Work closely with the Engineering and DevOps teams to develop proactive monitoring and performance strategies.

Required Skills & Qualifications:

  • Cloud Platform Expertise: Knowledge of the Google Cloud Platform and associated services.
  • Monitoring & APM Tools: Experience with Prometheus, Grafana, Sentry, and ELK, plus familiarity with Kubernetes (K8s) and GCP-native monitoring solutions.
  • Database Systems: Knowledge of PostgreSQL/AlloyDB and Spanner, especially for performance tuning, query optimization, and diagnostics.
  • Event Streaming: Hands-on experience with Kafka, including the ability to monitor Kafka clusters, track consumer lag, and manage data pipeline reliability.
  • Networking & Security: Familiarity with Cloudflare, DDoS protection strategies, and network performance monitoring.
  • Problem-Solving Skills: Excellent analytical skills to troubleshoot complex, multi-layered cloud systems, perform root cause analysis, and address issues in a dynamic environment.

Nice-to-Have Skills:

  • Scripting: Experience with Python or Bash for automation and scripting tasks.

Schedule Requirements:

  • This role operates during non-business hours aligned with European time to provide continuous coverage and support for our production environments.