| Veera - Sr. Site Reliability Engineer |
| [email protected] |
| Location: Dallas, Texas, USA |
| Relocation: |
| Visa: |
| Resume file: Veera B_1773067652540.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Veera Bobba
Sr. Site Reliability Engineer www.linkedin.com/in/veera-bobba-vbw83uug | [email protected] | +1 872-362-6006 PROFESSIONAL SUMMARY Senior Site Reliability Engineer with over a decade of experience building resilient, scalable cloud infrastructure across AWS, Azure and GCP, specializing in reliability, observability, and automation for large-scale, customer-facing platforms. Deep experience with AWS services like EC2, S3, RDS, Lambda, EMR, MSK, EKS, Route 53, KMS, VPC, IAM, CloudWatch, CloudTrail, and Step Functions. Built multi-region infrastructure for pharmacy systems with Route 53 failover, developed serverless applications reducing costs 30%, and managed Apache Spark on EMR and Apache Kafka on MSK. Executed AWS cloud migration moving 50+ legacy applications from VMware vSphere to AWS microservices architecture, executed Oracle to RDS database migrations using AWS DMS and Schema Conversion Tool, and designed disaster recovery using Route 53 health checks and RDS cross-region replicas achieving 4-hour RTO maintaining PCI-DSS and SOX compliance. Proficient in Azure VMs, AKS, App Service, Azure SQL, Key Vault, Azure Monitor, Azure DevOps, and Data Factory. Automated VM provisioning reducing setup time from hours to minutes using PowerCLI and Azure Resource Management templates, managed AKS clusters running production workloads, and integrated Azure Active Directory for SSO. Managed hybrid AWS/Azure environments maintaining HIPAA, PCI-DSS, SOC 2, and SOX compliance for PHI handling across multi-cloud infrastructure using AWS PrivateLink and Azure Private Link, AWS KMS & Azure Key Vault for key rotation, unified IAM policies enforcing least-privilege access with MFA, and audit logging via CloudTrail and Azure Activity Logs. Hands-on experience with Google Cloud Platform (GCP) operating production workloads on Compute Engine and GKE, leveraging Cloud Functions, Cloud Run, Cloud SQL, BigQuery, and Cloud Storage, with Terraform-based automation, Cloud Monitoring, Pub/Sub, Secret Manager, and Cloud Armor enforcing reliability, observability, and security controls. Designed and operated multi-region disaster recovery on GCP using Cloud SQL automatic failover, GCS multi-region storage, and Cloud Load Balancing, achieving a 99.95% SLA for GKE-based microservices serving 500,000+ policyholders. Applied chaos engineering practices by simulating infrastructure and application failures in Kubernetes environments, validating failover, auto-healing, recovery automation, and alerting against defined reliability targets. Strong Linux/Unix administration background managing SUSE Enterprise Linux, Red Hat, Solaris 10, and HP-UX with expertise in VXVM, LVM, ZFS, RAID, kernel tuning, patch management, monitoring supporting high-availability environments. Strong Infrastructure as Code expertise using Terraform, CloudFormation, Cloud Deployment Manager, Ansible, Chef, and Puppet reducing infrastructure setup from 3 days to 2 hours and configuring from 4 hours to 15 minutes ensuring compliance. Built CI/CD pipelines using Jenkins, GitHub Actions, Azure DevOps, Harness, and CircleCI automating build, test, and deployment for Java and Node.js applications, using blue-green & canary deployment strategies enabling zero-downtime releases. Implemented GitOps workflows using Argo CD and Flux CD automating deployments for Spark clusters, Databricks workspaces, and Airflow on Kubernetes, reducing deployment time from 2 hours to 15 minutes with rollback capabilities. Experience with containerized platforms, building and operating Docker-based applications on Kubernetes (AKS, GKE, EKS), managing multi-namespace environments, autoscaling with Horizontal Pod Autoscaler (HPA), configuring Ingress controllers, enabling communication between microservices using Istio service mesh and deploying workloads using Helm. Embedded DevSecOps controls across CI/CD pipelines by integrating SonarQube, Trivy, Snyk, Checkmarx, OWASP ZAP, and Burp Suite, enabling policy-based security gates and pre-production blocking of high-risk builds before production. Proficient in serverless architectures using AWS Lambda, Azure Functions, Cloud Functions, Cloud Run, S3, SNS, SQS, and Pub/Sub, implementing AWS SAM for IaC and building fault-tolerant event-driven workflows reducing operational overhead. Managed production-grade Apache Spark on AWS EMR and Apache Kafka with MSK platforms with Airflow on Kubernetes, focusing on automation, cost optimization, and observability to improve deployment efficiency and operational stability. Observability experience implementing Dynatrace, OpenTelemetry, Splunk, Grafana, Prometheus, Datadog, ELK Stack, and CloudWatch. Defined SLOs/SLIs establishing targets enabling data-driven decisions about stability versus feature velocity. Skilled in database administration across MySQL, PostgreSQL, MongoDB, Redis, RDS, Azure SQL, Cloud SQL, BigQuery, and Oracle. Improved Redis cache hit rates, reducing database load by 65% and API response times from 800ms to 250ms. Automated data quality validation using Great Expectations, leveraged GenAI tools like GitHub Copilot and ChatGPT. Implemented comprehensive FinOps strategy through Reserved Instance purchasing for baseline RDS workloads, Spot Instance orchestration for EMR batch jobs with automated fallback to On-Demand, S3 Intelligent-Tiering for archived prescription data. Experience working with ticketing, incident, and collaboration tools including Jira, ServiceNow, PagerDuty, and Confluence, supporting incident management, cost optimization initiatives, and effective knowledge transfer through clear documentation. TECHNICAL SKILLS Cloud Platforms: AWS, Azure, GCP AWS Services: EC2, S3, RDS, Lambda, EMR, MSK, EKS, Route 53, KMS, VPC, CloudWatch, CloudTrail, Step Functions, S3 Glacier, Boto3, DMS, SCT, VPC Peering, VPC Endpoints Azure Services: AKS, Azure VMs, VM Scale Sets, Azure SQL, Key Vault, Azure Monitor, Data Factory, Azure DevOps, Azure AD, Load Balancers, App Service, Activity Logs, Azure CLI, Private Link GCP Services: GKE, Compute Engine, Cloud Functions, Cloud Run, BigQuery, Cloud Storage, Cloud SQL, Secret Manager, Pub/Sub, Cloud Armor, Stackdriver, Cloud KMS, Cloud Logging, Cloud Load Balancing, Managed Instance Groups, Cloud Deployment Manager, Artifact Registry, gcloud Data & Streaming Platforms: Spark, Kafka, Storm, Zookeeper, Airflow, Databricks, Delta Lake, Kafka MirrorMaker, AWS EMR, AWS MSK ML/AI & Data Quality: MLflow, Great Expectations, Model Registry, Model Versioning GenAI Tools: GitHub Copilot, ChatGPT Build Tools: Maven, Gradle, Ant, npm, RPMBuild Version Control: Git, GitHub, Bitbucket, GitLab, Azure Repos CI/CD & GitOps: Jenkins, Azure DevOps, GitHub Actions, GitLab CI, Harness, CircleCI, Argo CD, Flux CD, AWS CodePipeline Infrastructure as Code: Terraform, Ansible, Chef, Puppet, CloudFormation, AWS SAM, Cloud Deployment Manager Container Orchestration: Docker, Kubernetes, OpenShift, Helm, EKS, AKS, GKE, KubernetesExecutor, Cluster Autoscaler, Karpenter, Ingress Controllers Monitoring & Observability: Splunk, Grafana, Prometheus, ELK Stack (Elasticsearch, Logstash, Kibana), OpenTelemetry, CloudWatch, Azure Monitor, Stackdriver, Nagios, SLOs/SLIs APM Tools: Dynatrace, Datadog APM, AppDynamics, New Relic Databases & Caching: MySQL, PostgreSQL, Oracle, MongoDB, Redis, RDS, Azure SQL, Cloud SQL, Cosmos DB, BigQuery, Apache Cassandra. Web/Application Servers: Nginx, Tomcat, Apache HTTP Server Operating Systems: Red Hat Linux, Ubuntu, CentOS, SUSE Enterprise Linux, Solaris 10, HP-UX, UNIX File Systems & Storage: ZFS, LVM (Logical Volume Manager), VXVM (Veritas Volume Manager), Solaris Volume Manager, RAID Security & Vulnerability Scanning: SonarQube, OWASP ZAP, Snyk, Checkmarx, Trivy, Burp Suite, JFrog X-ray, ECR Inspector Compliance & Security Frameworks: HIPAA, PCI-DSS, SOC 2, SOX, NIST 800-53, CIS Kubernetes Benchmarks Programming & Scripting: Python (Boto3), Bash, Shell, Java, PowerShell, Ruby, Perl, Groovy, YAML Artifact Management: JFrog Artifactory, Nexus Repository Service Mesh & Security: Istio, mTLS, Open Policy Agent (OPA), Kubernetes RBAC, IAM, Azure AD, Kafka TLS/SASL Networking: VPC, Security Groups, NACLs, VPC Peering & Endpoints, PrivateLink, Azure Private Link, DNS, DHCP, NTP, Load Balancers Virtualization: VMware vSphere, vCenter, PowerCLI Cloud Cost Management: FinOps, Cost Optimization, Spot Instances, Reserved Instances Collaboration & ITSM: Jira, ServiceNow, Confluence, PagerDuty Additional Capabilities: Chaos Engineering, Multi-Region Disaster Recovery, RTO/RPO Planning, Blue-Green Deployments, Canary Deployments, GitOps CERTIFICATIONS AWS Security Champion Microsoft Azure AZ - 400 GCP Professional Cloud DevOps Engineer GitHub Professional Certificate Docker Foundations Certificate Databricks Fundamentals Certificate PROFESSIONAL EXPERIENCE Walgreens: June 2023 Current (Location: Deerfield, IL) Sr Site Reliability Engineer / Platform Engineer Maintained 99.9% uptime for pharmacy prescription data platform processing 50,000+ daily prescription transactions across 9,000 retail locations, managing infrastructure across AWS and Azure including Kafka, Spark, and Airflow for workflow orchestration. Implemented multi-region disaster recovery across AWS and Azure using Route 53 DNS failover, Kafka MirrorMaker for data replication, and RDS cross-region replicas, achieving 4-hour RTO and 15-minute RPO tested quarterly through DR drills. Optimized Kafka clusters on AWS MSK handling 100,000+ messages/hour for prescription data streams, effectively reducing consumer lag from 5 minutes to 30 seconds through partition rebalancing, consumer group tuning, and increased broker resources. Improved Redis caching for pharmacy portal APIs from 60% to 92% hit rate by implementing cache warming strategies and optimizing TTL policies, significantly reducing database queries by 65% and API response times from 800ms to 250ms. Built Terraform modules for deploying Spark clusters on EMR, Airflow on Kubernetes, and Kafka topics with standardized configurations, reducing environment provisioning from 3 days to 4 hours and eliminating drift across dev/staging/prod. Designed and operated multi-node Apache Cassandra clusters for high-volume prescription and patient activity data, handling millions of records per day with tunable consistency levels to balance latency and data durability across regions. Implemented GitOps using Argo CD for Kubernetes deployments on EKS and AKS, enabling declarative infrastructure management and automated rollbacks, reducing deployment time from 2 hours to 20 minutes improving deployment consistency. Deployed and scaled Apache Airflow on Kubernetes (EKS) coordinating 500+ DAGs for ETL pipelines and data workflows, implementing KubernetesExecutor for dynamic pod scaling which reduced task queue backlogs and decreased job failures by 35%. Automated Spark job deployments on EMR using Terraform and AWS Step Functions, strategically implementing auto-scaling policies based on queue depth and cluster utilization, effectively reducing manual cluster management by 70%. Optimized Spark workloads across EMR for batch processing and Databricks for ML model training, tuning configurations (executor memory, shuffle partitions) which improved query execution times by 40% for large-scale data transformations. Configured Databricks infrastructure and integrated MLflow for ML experiment tracking, model registry, and versioning, setting up S3 storage and IAM access controls, enabling data scientists to track 50+ ML experiments and promote models to production. Built and maintained Apache Storm topologies for real-time processing of prescription event streams from Kafka, enabling sub-second validation, enrichment, and routing of pharmacy transactions before persistence to downstream systems. Scaled EKS cluster running Airflow workloads using Cluster Autoscaler and Karpenter, handling 2-3x traffic spikes during flu season by dynamically scaling executor pods, achieving 30% cost savings by running worker nodes on Spot Instances. Configured Delta Lake on S3 for ACID-compliant data lake storage, enabling reliable data processing with schema evolution and time travel capabilities for historical data analysis and advanced ML feature engineering pipelines while ensuring data quality. Established SLOs for critical data pipelines with 99.5% availability and implemented monitoring using Dynatrace for application metrics, Prometheus for infrastructure metrics, and Grafana for dashboards tracking job success rates, latency, and throughput. Implemented Dynatrace APM observability for Spark workloads and API services, leveraging automatic instrumentation, service flow mapping, and Davis AI analytics to proactively detect latency and error anomalies, cutting MTTD by 30%. Deployed Splunk Enterprise for centralized logging across all services, onboarding application and infrastructure logs via Splunk Universal Forwarders and HTTP Event Collector (HEC), building custom field extractions, sourcetypes, and index-time parsing, and implementing log retention policies to enable faster incident investigation and support compliance auditing. Integrated ServiceNow with monitoring alerts and PagerDuty for incident management, creating runbooks for common Kafka, Spark, and Airflow failures that reduced MTTR from 45 to 20 minutes through faster escalation and documented procedures. Conducted chaos engineering and resilience testing by simulating region failures, database corruption, and Kafka broker outages, validating automated failover and recovery workflows, documenting lessons learned, and updating runbooks to consistently meet 4-hour RTO and 15-minute RPO targets across multiple exercises with zero data loss. Configured AWS IAM and Azure AD with least-privilege policies for access to S3, RDS, Databricks, Apache Kafka, and Apache Airflow resources, implementing RBAC and MFA requirements ensuring HIPAA compliance for PHI data handling. Implemented encryption at rest using AWS KMS for RDS databases and S3 buckets, encryption in transit using TLS 1.3 for all APIs and Kafka connections (SASL/SSL), and configured VPC endpoints to prevent PHI data from traversing public internet. Enabled AWS CloudTrail and Azure Activity Logs for audit compliance, capturing all API calls and administrative actions with 7-year retention in S3 Glacier for regulatory requirements, passing external HIPAA audits with zero critical findings. Automated security compliance checks using Python scripts validating encryption status, public exposure, and IAM policies across all resources, proactively generating weekly reports and auto-remediating common issues like unencrypted S3 buckets. Reduced monthly AWS and Azure costs by 18% through rightsizing EMR clusters based on utilization metrics, implementing Spot Instances for Spark batch jobs, and automated shutdown of idle development resources using Lambda functions. Developed Python automation scripts using Boto3 and Airflow APIs for infrastructure monitoring, automated remediation of failed Airflow tasks, and proactive Kafka consumer lag alerts, effectively reducing manual operational work by 50%. Implemented data quality validation using Great Expectations integrated into Airflow DAGs, proactively catching schema changes and critical data anomalies before they impacted downstream ML models, reducing data quality incidents by 30%. Leveraged GitHub Copilot, ChatGPT for efficiently generating Terraform modules and Python automation scripts, significantly accelerating development of infrastructure code and effectively reducing time spent on repetitive boilerplate configurations. Environment: Apache Spark, Apache Kafka, Apache Cassandra, Apache Storm, Apache Airflow, Databricks, Delta Lake, Redis, MLflow, Great Expectations, AWS (EC2, S3, RDS, Lambda, EMR, MSK, EKS, Route 53, KMS, VPC, CloudTrail, Step Functions, S3 Glacier, Boto3), Azure (AKS, Azure AD, Activity Logs), Kubernetes, Docker, Helm, Argo CD, GitOps, Terraform, Cluster Autoscaler, Karpenter, KubernetesExecutor, Spot Instances, Dynatrace, Splunk, Grafana, ServiceNow, PagerDuty, Python, Kafka MirrorMaker, TLS 1.3, SASL/SSL, Kubernetes RBAC, IAM, MFA, SLOs, HIPAA, Multi-Region DR, Chaos Engineering, RTO/RPO, GitHub Copilot, ChatGPT Cisco: August 2020 May 2023 (Location: San Jose, CA) Site Reliability Engineer / Observability Engineer Built multi-cloud observability platform across Azure and GCP ingesting metrics, logs, and traces from 200+ microservices and 50+ AKS/GKE clusters, processing 10TB+ telemetry data daily with sub-minute latency maintaining 99.9% availability. Managed Azure VM Scale Sets with Load Balancers and GCP Managed Instance Groups with Cloud Load Balancing, implementing health probes, autoscaling policies, & rolling updates maintaining availability during node failures and zone outages. Built data ingestion workflows using Azure Data Factory and GCP pipelines migrating telemetry and configuration data between systems, implementing validation checks & deduplication logic ensuring consistency without data loss during cross-cloud transfers. Provisioned AKS and GKE clusters using Terraform implementing VPC/VNet design, private clusters, node pool separation & namespace isolation for observability, ingestion & analytics services ensuring secure multi-tenant operation across environments. Developed reusable Terraform and Puppet modules abstracting cloud differences between Azure & GCP, standardizing naming conventions, tagging strategies, enabling teams to deploy stacks without cloud-specific rewrites or manual adjustments. Automated VM and cluster lifecycle operations using vCenter, PowerCLI, Azure CLI, and GCP gcloud, implementing health checks and compliance scanning, reducing manual provisioning effort and maintenance overhead across hybrid environments. Standardized networking, IAM/RBAC, tagging, and quota enforcement across Azure and GCP through policy driven Terraform modules with validation & drift detection, reducing configuration drift & accelerating security review cycles refining governance. Designed CI/CD pipelines using Harness, Azure DevOps, and GitHub Actions automating deployment of observability agents, dashboards, and alert rules, incorporating automated validation, canary releases, and rollback triggers ensuring safe deployments. Integrated Gradle-based automated testing into CI pipelines validating ingestion logic, alerting rules, and dashboard rendering before deployment, implementing unit & integration tests preventing faulty configurations from reaching production environments. Built custom Java microservices for telemetry enrichment, aggregation, and alert routing deployed on AKS and GKE, designed stateless and horizontally scalable, enabling elastic scaling during traffic spikes ensuring high availability. Implemented Helm-based deployment patterns for Alertmanager, Grafana, Fluent Bit, and Elasticsearch, supporting versioned releases, rollback procedures, environment-specific overrides, and GitOps-driven workflows ensuring consistent deployments. Implemented PostgreSQL with automated backups for historical metric persistence and Redis for hot-path caching, reducing alert query latency from seconds to milliseconds and improving dashboard load times by 60% under peak usage. Defined SLIs and SLOs for ingestion latency (sub-minute p95), data freshness (under 2-minute lag), and alert delivery (99.9% success), enforcing 99.9% availability targets using error budgets driving reliability improvements. Integrated ELK Stack with Azure Monitor and GCP Cloud Logging correlating application logs with infrastructure events, accelerating root-cause analysis during incidents, reducing mean time to recovery (MTTR) from 45 to 20 minutes. Integrated Datadog APM for distributed tracing across AKS and GKE workloads, implementing trace sampling and service dependency mapping, enabling p95/p99 latency analysis and early detection of performance regressions before customer impact. Conducted current-state assessments across 200+ microservices, identifying blind spots in distributed tracing and log correlation, designed and executed the target-state strategy moving from fragmented tools to unified Datadog and OpenTelemetry framework. Performed gap analysis on existing telemetry ingestion pipelines, identifying redundant metric streams, implemented filtering at the OTel Collector level to reduce ingestion noise by 30% while improving fidelity for critical SLIs/SLOs. Implemented automated anomaly detection on error rates and latency thresholds using machine learning models on historical patterns, enabling proactive incident detection and reducing mean time to detection (MTTD) by 40% minimizing false positives. Maintained ISO 27001 compliance by implementing information security controls including data classification, access management, incident response procedures, and regular vulnerability scanning, conducting quarterly compliance audits. Ensured SOC 2 compliance for observability platform by implementing encryption key rotation policies, maintaining access logs, enforcing MFA requirements, & documenting change management procedures, passing annual audits with zero critical findings. Enforced service-to-service security using Istio with mTLS encryption and Open Policy Agent (OPA) policies controlling workload access to telemetry data, ensuring zero-trust security model preventing unauthorized access to monitoring data. Secured telemetry pipelines using Azure Key Vault and GCP Cloud KMS for encryption key management, secret rotation, and certificate automation, implementing encryption at rest and in transit ensuring Cisco security compliance. Implemented Azure AD and GCP IAM-backed RBAC for AKS and GKE with namespace-level permissions, enforcing least-privilege access to sensitive monitoring data and production operations, conducting quarterly access reviews ensuring compliance. Created Bash, Python, PowerShell, and Groovy automation for self-healing workflows including automatic restart of failed collectors, proactive gap alerts, disk cleanup, and compliance checks, reducing manual operational work by 60%. Integrated health-check automation with ServiceNow, creating priority-based incidents when SLIs breached thresholds, enriching tickets with diagnostics and intelligent routing to on-call engineers, improving SLA tracking and response times. Authored Confluence runbooks covering Azure, GCP, observability troubleshooting, incident response playbooks, and disaster recovery procedures, reducing on-call escalations and accelerating new team member onboarding improving knowledge sharing. Environment: Azure (VM Scale Sets, Load Balancers, AKS, Key Vault, Azure Monitor, Data Factory, Azure DevOps, Azure AD, Azure CLI), GCP (Managed Instance Groups, Cloud Load Balancing, GKE, Cloud KMS, Cloud Logging, IAM, gcloud), vCenter, PowerCLI, Terraform, Puppet, Python, PowerShell, Groovy, Bash, Docker, Kubernetes, Helm, Harness, GitHub Actions, Gradle, Java, , Alertmanager, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Datadog APM, OpenTelemetry, SLIs/SLOs, PostgreSQL, Redis, ServiceNow, Confluence, Istio, mTLS, Open Policy Agent (OPA), RBAC, ISO 27001, SOC 2, VPC/VNet, GitOps Progressive: October 2017 July 2020 (Location: Mayfield Village, OH) DevOps Engineer / DevSecOps Engineer Modernized Progressive s cloud foundation by maintaining GCP environments supporting claims processing and policy systems. Standardized infrastructure using Terraform and infrastructure-as-code best practices, improving consistency, reliability, and security while enabling teams to provision new environments 40% faster and accelerate delivery of critical insurance features. Ensured seamless reliability for auto insurance underwriting applications by automating end-to-end provisioning on GCP. I utilized Terraform, Cloud Deployment Manager templates and Chef to eliminate manual configuration errors, guaranteeing that every environment met the strict hardening requirements necessary for processing sensitive insurance quotes and regulatory compliance. Developed event-driven serverless solutions using GCP Cloud Functions and Cloud Run to reduce operational hustle for the engineering team. Automated repetitive tasks like resource tagging and security remediation, allowing the team to focus on innovation. Implemented GCP Pub/Sub to decouple claims processing workflows and handle traffic spikes gracefully. Balanced performance with cost-efficiency by managing complex data workloads across Azure Cosmos DB and GCP BigQuery. I optimized resource allocation to ensure fraud detection analytics remained fast while keeping monthly cloud spend within budget. Supported a GitOps culture by establishing GitHub as the single source of truth for all infrastructure and application code. Utilized Flux CD to automatically synchronize Helm charts with Google Kubernetes Engine clusters, enforcing declarative deployments, reducing manual intervention, and ensuring production environments remained consistent, version-controlled, and audit-ready. Empowered development teams through high-velocity CI/CD pipelines built with GitHub Actions, Jenkins, and CircleCI. Streamlined Gradle, NPM, and RPMbuild workflows, enabling Java and Node.js microservices to be promoted to production environments safely, consistently, and at scale with automated testing, security checks, release controls, and rollback support. Ensured software quality and trust by embedding automated testing frameworks like JUnit, Mocha, Selenium, and PyTest directly into our release cycles. This ensured that every claim-tracking update met high functional standards before reaching a customer. Strengthened application stability on GKE by implementing Namespaces and Resource Quotas. This prevented service interference and ensured that high-traffic portals remained responsive even during major weather events and peak renewal seasons. Improved application speed by optimizing Restful API calls and tuning Tomcat configurations. I focused on reducing latency for high-traffic endpoints, directly improving the user experience for our quote engines and customer dashboards significantly. Maintained smooth access for millions of policyholders by using Nginx as a reverse proxy and ingress controller in GKE. This ensured high-traffic portals remained fast and responsive even during peak insurance renewal seasons or major weather events. Enforced a DevSecOps Zero-Trust security model by configuring granular IAM and Kubernetes RBAC permissions. I strictly applied the Principle of Least Privilege to protect sensitive PII and maintain compliance with PCI-DSS and HIPAA regulations. Prioritized policyholder privacy by integrating SAST tools like Snyk, SonarQube, and Checkmarx alongside DAST scanners like OWASP ZAP and Burp Suite into Jenkins and GitLab CI/CD pipelines, establishing automated security gates that blocked vulnerable code from progressing to production environments ensuring continuous protection of sensitive customer data. Hardened the software supply chain by using Trivy to systematically identify and patch CVEs within Docker images. I ensured that every artifact in the Artifact Registry complied with Progressive s strict security standards before reaching production. Defended the cloud against DDoS attacks by configuring Cloud Armor and VPC Firewalls. I also moved the organization away from insecure practices by using Secret Manager to automate the rotation of sensitive API keys & database credentials. Deployed New Relic APM to monitor Tomcat servers, providing comprehensive JVM performance metrics like heap memory, garbage collection, thread pool, and transaction traces across distributed microservices. Custom dashboards track real-time metrics like Apdex, error rates, and throughput, enabling proactive performance degradation detection before customer impact. Boosted system visibility for the Operations center by connecting Stackdriver and Grafana to our Kubernetes clusters. My proactive alerting strategies helped reduce system downtime by 30%, catching issues before they impacted policy issuance. Simplified troubleshooting by centralizing all logs through the ELK Stack (Elasticsearch, Logstash, Kibana). This gave the team a clear view of distributed system health and provided the audit trails needed to meet HIPAA and PCI-DSS compliance. Streamlined incident response by integrating ServiceNow for automated ticketing and Jira for tracking infrastructure blockers. This ensured that production issues affecting policyholders were communicated clearly and resolved within strict SLAs. Eliminated manual toil by writing Python, Bash, and Ruby scripts to automate routine operations. From validating claim data to cleaning up idle resources, these scripts saved the team hours of manual work every week and reduced the risk of human error. Built multi-region DR strategy on GCP leveraging Terraform for infrastructure replication, Cloud SQL automatic failover, GCS multi-region storage, and Cloud Load Balancer global traffic routing, conducting bi-annual disaster recovery tests validating 99.95% availability SLA commitments for insurance policy management systems serving 500,000+ policyholders. Environment: GCP, GKE, Cloud SQL, BigQuery, Pub/Sub, Cloud Functions, Cloud Armor, VPC, VPC Firewalls, IAM, Cloud Deployment Manager, Artifact Registry, Secret Manager, Stackdriver, Terraform, Chef, Python, Bash, Ruby, Jenkins, GitHub Actions, CircleCI, Flux CD, Git, Docker, Kubernetes, Helm, Nginx, Java, Tomcat, RESTful APIs, New Relic, ELK Stack, Grafana, ServiceNow, Jira, DevSecOps, Trivy, Snyk, SonarQube, Checkmarx, OWASP ZAP, Burp Suite, Kubernetes RBAC, Service Account Key Rotation, PCI-DSS, HIPAA Paychex: December 2015 September 2017 (Location: Rochester, NY) Cloud DevOps Engineer Migrated legacy monolithic Java applications from on-premises VMware vSphere infrastructure to a decoupled microservices architecture on AWS, improving system reliability and scalability to handle loads during high-volume payroll processing cycles. Automated provisioning of AWS cloud infrastructure, including VPCs, EC2 instances, and RDS databases, using Terraform modules. This eliminated configuration drift, minimized human error, and reduced environment setup time from weeks to hours. Developed reusable CloudFormation templates to standardize application stacks across Development, QA, and Production, utilizing parameters to handle environment-specific configurations and ensuring strict consistency to eliminate configuration drift. Wrote Ansible playbooks to automate the post-provisioning configuration of EC2 instances and OpenShift nodes, ensuring that all migrated servers consistently met strict internal security baselines and OS hardening requirements without manual intervention. Simplified the build process by setting up robust CI/CD pipelines in Jenkins, using Maven and Ant to compile, test, and package Java applications, significantly faster feedback loops and eliminated manual handoffs between development and operations teams. Integrated automated testing frameworks such as JUnit and Selenium into the Jenkins pipeline and established strict quality gates using SonarQube, which ensured that code vulnerabilities and bugs were detected and resolved early in the development lifecycle. Managed source code repositories in Bitbucket and Git by designing and enforcing strict branching strategies like GitFlow and Pull Request policies, enabling parallel feature development, minimizing merge conflicts, and ensuring code stability. Executed the migration of on-premises legacy data from Oracle DB to Amazon RDS by configuring AWS Database Migration Service (DMS) and the Schema Conversion Tool (SCT), ensuring data consistency and minimal downtime during the cutover. Managed highly available database layers using Amazon RDS (MySQL/PostgreSQL) with Multi-AZ deployments and read replicas, ensuring synchronous data replication and automatic failover capabilities to support critical 24/7 payroll processing. Managed JFrog Artifactory as the centralized binary repository manager for versioned Java artifacts and Docker images, creating a single source of truth ensuring consistency across environments and a secure, reliable chain for all build dependencies. Containerized microservices using Docker to ensure execution consistent and used Helm charts to manage complex deployments on OpenShift Kubernetes, standardizing configuration templates and drastically simplified version control and release rollbacks. Executed Blue/Green deployment strategy by using OpenShift Routes to seamlessly switch traffic between versions, ensuring zero downtime for critical payroll applications and allowing for immediate, low-risk rollbacks if issues were detected in production. Collaborated closely with developers in Jira to maintain visibility into migration sprints and project status, facilitating daily stand-ups to identify and resolve infrastructure blockers that were slowing down the release cycle before they impacted deadlines. Monitored Java application performance using AppDynamics to analyze critical JVM metrics like heap memory and garbage collection, while tracking OpenShift cluster health with Prometheus and Grafana identifying issues before users were affected. Configured AWS CloudWatch to centrally aggregate logs and metrics from all AWS resources, defining precise thresholds and alerts for critical infrastructure failures to ensure rapid incident response and reducing the Mean Time to Resolution (MTTR). Developed automation scripts using Python (Boto3) to interact with AWS APIs and automate routine cloud operations, such as identifying orphaned EBS snapshots and enforcing resource tagging, removing the need for manual inventory checks. Wrote robust Perl and Bash scripts to automate routine system administration tasks such as custom log rotation and file cleanup, which prevented disk space exhaustion incidents and reduced the manual effort required for daily maintenance by 30%. Secured the AWS environment by defining IAM policies that enforced the principle of least privilege and configuring restrictive Security Groups, while enabling CloudTrail to keep a full history of API changes for troubleshooting and security auditing. Designed a secure multi-VPC network architecture by using VPC Peering connections to separate shared services from application environments and used VPC Endpoints to access S3 and DynamoDB, ensuring that data traffic never traversed the public internet. Ensured compliance with strict PCI-DSS and SOX regulations by implementing encryption at rest for all RDS databases and S3 buckets using AWS KMS and enforcing strict network segmentation to isolate and protect payroll data from unauthorized access. Environment: AWS (EC2, S3, RDS, VPC, VPC Peering, VPC Endpoints, CloudFormation, CloudWatch, CloudTrail, IAM, Security Groups, KMS, EBS, DMS, SCT), VMware vSphere, OpenShift, Kubernetes, Docker, Helm, Terraform, Ansible, Jenkins, Maven, Ant, Git, Bitbucket, MySQL, PostgreSQL, Oracle, JFrog Artifactory, JUnit, Selenium, SonarQube, AppDynamics, Prometheus, Grafana, Python, Perl, Bash/Shell, Jira, PCI-DSS, SOX. Ceva Logistics: May 2014 October 2015 (Location: Mumbai, India) Linux System Administrator Set up and managed file systems and RAID volumes using VXVM and Solaris Volume Manager on Solaris, plus LVM on Linux and HP-UX to ensure high uptime and efficient disk space utilization, while working with ZFS on Solaris 10 for advanced storage management including creating pools, taking snapshots, making clones, and migrating ZFS datasets between zones. Supported platform, application, and QA teams by maintaining their SUSE and Solaris environments, keeping configurations consistent and ensuring teams had necessary resources for testing and deployments, while tuning Linux and Unix systems by adjusting kernel parameters to optimize performance based on application requirements and resource utilization patterns. Made depot files for patches and systematically managed package installations using depot tools on HP-UX, while effectively building RPM packages with RPMBuild on Linux to standardize and simplify software rollouts across multiple servers, ensuring version consistency and significantly reducing deployment errors through comprehensive automated package management. Installed Nagios to effectively monitor system health and send alerts when issues were detected, while troubleshooting VM and network problems to significantly minimize downtime and maintain service availability, providing the operations team with comprehensive visibility into infrastructure status and enabling substantially faster incident response across all critical systems. Kept DNS and NTP services running properly to maintain time synchronization and name resolution across the network, while maintaining MySQL servers to ensure high availability and optimal performance for applications relying on database services. Applied critical security patches and updates on Linux servers to consistently maintain current security posture and protect against vulnerabilities, while systematically validating patch installations and conducting thorough post-patch testing to effectively ensure systems remained stable and applications continued functioning correctly after updates were applied. Wrote comprehensive shell scripts to systematically automate routine administrative tasks including service restarts, log rotation, and system health checks, saving significant time and substantially reducing manual operational overhead while ensuring consistent execution of maintenance procedures across multiple servers and improving overall operational efficiency and reliability. Compiled and installed PostgreSQL 8.3.1 from source on SUSE Enterprise Linux 10sp1, while strategically developing comprehensive shell scripts to systematically handle automated startup procedures for different application environments, ensuring database services initialized correctly and application dependencies were consistently met during system boot sequences. Environment: Solaris 10, SUSE Enterprise Linux, Linux, HP-UX, VXVM (Veritas Volume Manager), Solaris Volume Manager, LVM (Logical Volume Manager), ZFS, Depot Tools, RPM/RPMBuild, Nagios, DNS, NTP, MySQL, PostgreSQL 8.3.1, Shell Scripting, Bash, Kernel Tuning, RAID, System Administration, Patch Management, Log Rotation, Service Management EDUCATION JNTUH, Bachelor s in Computer Science June 2010 April 2014 Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning user experience javascript business intelligence sthree database active directory hewlett packard Arizona California Illinois New York Ohio |