Grafana SRE Architect --Onsite at NJ at Remote, Remote, USA |
Email: [email protected] |
Grafana SRE Architect Location: Basking Ridge, NJ (Onsite) Job Summary The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafanas ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments. Key Responsibilities Architecture & Design Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance. Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog). Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage/retention strategies. SRE Leadership Implement SRE principles: SLAs/SLOs/SLIs, error budgets, and blameless post-mortems. Build automated monitoring/alerting systems to preemptively identify system bottlenecks and failures. Lead incident response, root cause analysis, and remediation for observability-related outages. Collaboration & Integration Partner with DevOps teams to embed Grafana into CI/CD pipelines and automate provisioning via IaC (Terraform, Ansible). Work with developers to instrument applications for observability (OpenTelemetry, custom exporters). Advise stakeholders on cost-effective monitoring strategies and resource optimization. Performance Optimization Tune Grafana dashboards, queries, and data sources for high-performance environments. Optimize PromQL/Loki LogQL queries and manage large-scale time-series databases (Mimir). Conduct capacity planning and disaster recovery testing for Grafana ecosystems. Governance & Security Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements. Monitor Grafana stack health, perform upgrades, and enforce version control. Mentorship & Innovation Mentor SRE/engineering teams on Grafana best practices and SRE culture. Stay ahead of Grafana/Observability trends and pilot new tools (e.g., AI-driven anomaly detection). Education & Experience Bachelors/Masters in Computer Science, Engineering, or related field. 10+ years in SRE/DevOps roles, with 5+ years hands-on Grafana experience. Proven track record in designing large-scale observability solutions. Managing offshore teams Open to work overlapping hours with offshore teams Technical Skills Expertise in Grafana: Dashboards, plugins, alerting, and integrations (Prometheus, Loki, Mimir, Tempo). Cloud Platforms: AWS/GCP/Azure, Kubernetes, and serverless architectures. Automation: Terraform, Ansible, Python/Go scripting. Monitoring Tools: Thanos, Cortex, Jaeger, OpenTelemetry. Database Optimization: Time-series data (Mimir), log management (Loki). Certifications (Preferred) Grafana Certified: Observability Engineer/Administrator. AWS/GCP/Azure Architect or DevOps certifications. Soft Skills Leadership in cross-functional teams and crisis management. Strong communication for technical and non-technical audiences. Analytical problem-solving and strategic thinking. Preferred Qualifications Contributions to Grafana/Prometheus open-source projects. Experience with AI/ML model monitoring. Knowledge of regulatory frameworks (GDPR, HIPAA). -- sai krishna Account Manager [email protected] linkedin: https://www.linkedin.com/in/sai-krishna-putta-845518232/ Email is the Best way To reach me . -- Keywords: continuous integration continuous deployment artificial intelligence machine learning information technology golang New Jersey Grafana SRE Architect --Onsite at NJ [email protected] |
[email protected] View All |
01:44 AM 28-Feb-25 |