Site Reliability Engineer (SRE) - Remote- Need 15+ Exp at Remote, Remote, USA |
Email: [email protected] |
http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2258923&uid= From: Laxmivijay, Procorp Systems [email protected] Reply to: [email protected] Hello Hope you re doing well! Job Title: Site Reliability Engineer (SRE) / DevOps Engineer Remote Overview: We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) / DevOps Engineer to join our team. In this role, you will be responsible for ensuring the reliability, performance, and scalability of our data infrastructure and applications. You will work closely with development and data teams to automate processes, optimize systems, and maintain a highly available environment. A key component of this role will be expertise in Apache Druid and Apache Airflow. Responsibilities: Reliability and Availability: Ensure high availability and reliability of production systems. Implement and maintain robust monitoring and alerting systems. Participate in on-call rotations to respond to incidents and outages. Conduct post-incident reviews and implement preventative measures. Automation and Infrastructure as Code (IaC): Automate infrastructure provisioning, configuration, and deployment using IaC tools (e.g., Terraform, Ansible). Develop and maintain CI/CD pipelines to streamline software releases. Optimize and automate data pipelines and workflows. Apache Druid Management: Manage and optimize Apache Druid clusters for high performance and scalability. Troubleshoot Druid performance issues and implement solutions. Design and implement Druid data ingestion and query optimization strategies. Apache Airflow Orchestration: Design, develop, and maintain Airflow DAGs for data orchestration and workflow automation. Monitor Airflow performance and troubleshoot issues. Optimize Airflow workflows for efficiency and reliability. Monitoring and Logging: Implement and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack). Analyze metrics and logs to identify performance bottlenecks and potential issues. Create and maintain dashboards for visualizing system health and performance. Collaboration and Communication: Collaborate with development, data, and operations teams to ensure smooth operations. Communicate effectively with stakeholders regarding system status and incidents. Document processes and procedures. Qualifications: Strong experience in SRE or DevOps roles. Proficiency in Linux/Unix systems administration. Experience with cloud platforms (AWS, GCP, Azure). Strong scripting skills (Python, Bash). Experience with containerization and orchestration technologies (Docker, Kubernetes). Deep understanding of Apache Druid and Apache Airflow. Experience with monitoring and logging tools. Knowledge of CI/CD pipelines. Excellent problem-solving and troubleshooting skills. Strong communication and collaboration skills. Thanks & Regards Laxmivijay US IT Technical Recruiter. ProCorp Systems Inc 2222 W Spring Creek Pkwy, STE 202,Plano,Texas 75023 Mail: [email protected] LinkedIn: https://www.linkedin.com/in/laxmi-vijay-a5aa47231/ -> https://procorpsystems.co/ Keywords: continuous integration continuous deployment information technology Colorado Site Reliability Engineer (SRE) - Remote- Need 15+ Exp [email protected] http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2258923&uid= |
[email protected] View All |
06:51 PM 17-Mar-25 |