Hybrid Site Reliability Engineer ANY VISA at Atlanta, Georgia, USA |
Email: [email protected] |
Must have local Atlanta, GA(30354)-Local candidate Job Title: Site Reliability Engineer (SRE) - AWS and Dynatrace Location: Atlanta, GA(30354)-Local candidate Duration: 12+ Months Visa: No H1B/OPT MOI: Skype Need Linkedin-2 candidate Position Overview: As a Site Reliability Engineer (SRE), you will be responsible for designing and maintaining high-performance, scalable, and reliable systems. You will lead efforts in observability by building actionable dashboards using AWS CloudWatch and Dynatrace. Your role will also involve defining Service Level Agreements (SLAs) and Service Level Objectives (SLOs), driving automation for monitoring and alerting, and implementing strong practices for root cause analysis in AWS environments. Key Responsibilities: Observability & Monitorin - Build, maintain, and optimize observability dashboards using Dynatrace and AWS CloudWatch for real-time insights into system performance and reliability. - Ensure comprehensive monitoring, alerting, and logging strategies are in place, leveraging AWS native services and Dynatrace for end-to-end observability. SLA/SLO Management: - Define, monitor, and manage SLAs, SLOs, and SLIs to meet service reliability and performance requirements. - Work closely with development and operations teams to align on SLO targets and ensure reliability goals are me Root Cause Analysis (RCA): - Conduct in-depth root cause analysis (RCA) for incidents, identifying key bottlenecks and system failures, particularly in AWS cloud technologies. - Automate RCA processes and contribute to knowledge-sharing by creating postmortem reports. Cloud Performance & Reliability: - Improve the reliability, performance, and scalability of AWS-based systems by developing and implementing best practices in AWS architecture. - Collaborate with DevOps and engineering teams to improve system infrastructure, deployment pipelines, and cloud automation. Automation & Optimization: Automate repetitive tasks in monitoring, incident management, and alerting, reducing operational toil and manual intervention. - Lead the effort to build self-healing systems by integrating auto-remediation tools and scripts for issue resolution. Collaboration & Documentation: - Partner with cross-functional teams to integrate observability into the development lifecycle, ensuring monitoring and logging requirements are met. - Document system architecture, monitoring practices, and incident resolution playbooks for future reference Required Skills & Experience 5+ years of experience as a Site Reliability Engineer (SRE) or related role with a focus on observability, monitoring, and AWS cloud technologies. Strong hands-on experience with Dynatrace for monitoring, root cause analysis, and performance optimization. Proficiency in AWS CloudWatch for creating dashboards, logs, alarms, and metrics Demonstrated experience in defining and managing SLAs, SLOs, and SLIs to ensure system reliability In-depth understanding of AWS technologies including EC2, Lambda, RDS, VPC, S3, etc Expertise in root cause analysis and troubleshooting large-scale distributed systems in AWS. Experience with automation tools (e.g., Terraform, Ansible, or CloudFormation) for infrastructure and incident management Nice-to-Have Familiarity with other monitoring tools such as Prometheus, Grafana, or New Relic. Knowledge of scripting languages (e.g., Python, Bash) for automating tasks. Experience working in a DevOps or Agile environment. -- Keywords: sthree information technology Georgia Hybrid Site Reliability Engineer ANY VISA [email protected] |
[email protected] View All |
11:33 PM 10-Jan-25 |