| Manoj Kumar Vallabhaneni - Data Engineer |
| [email protected] |
| Location: Texas City, Texas, USA |
| Relocation: Yes |
| Visa: Green Card |
| Resume file: Manoj Sr Lead Data_Engineer _1765555985446.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Manoj Vallabhaneni
Sr. Data Engineer E: [email protected] Phone: 502-230-2077 https://www.linkedin.com/in/manoj-v-23a418377/ PROFESSIONAL SUMMARY Data Engineer with 12+ years of IT experience delivering data-driven solutions across healthcare, insurance, finance, and software industries. Specialized in Big Data & Cloud Engineering, building high-performance ETL pipelines, streaming systems, and enterprise data lakes across AWS, Azure, and GCP. Skilled in Spark (PySpark, Scala, Databricks), Kafka (MSK, Kinesis), and the Hadoop ecosystem to handle large-scale batch and real-time data processing. Proficient in data warehousing & integration using Snowflake, Redshift, Oracle, SQL Server, Informatica, Azure Data Factory, dbt, and Cloud Data Fusion. Strong expertise in data modeling (Star/Snowflake), BI/reporting with Power BI, QuickSight, Looker, and QlikView, delivering actionable insights to stakeholders. Hands-on experience in performance engineering using JMeter, LoadRunner, BlazeMeter, Dynatrace, and AppDynamics, optimizing systems under high concurrency and reducing latency by up to 40%. Experienced with Kubernetes, Airflow, Jenkins, and GitLab CI/CD, enabling containerization, orchestration, and automated workflows for data pipelines. Proven track record of achieving 20% infrastructure cost savings, 40% faster ETLs, and driving $35M YoY revenue uplift through data engineering optimizations. Designed and delivered HIPAA-compliant healthcare pipelines and led cross-domain migration projects ensuring secure, scalable, and compliant analytics platforms. Adept in processing structured, semi-structured, and unstructured data (CSV, Parquet, ORC, Avro, JSON, XML, TXT) across high-volume ingestion pipelines. Extensive experience with on-prem to cloud migrations, modernizing legacy ETL and warehouse platforms into Snowflake, Redshift, and cloud-native lakes. Skilled in data quality, governance, and security with expertise in row-level security, column masking, IAM policies, encryption, and GDPR/HIPAA compliance. Strong knowledge of NoSQL databases (MongoDB, Cassandra, HBase, CosmosDB) to support real-time, schema-flexible data solutions. Proficient in monitoring and observability frameworks including ELK Stack, Splunk, Grafana, and CloudWatch for proactive troubleshooting and optimization. Delivered BI dashboards, APIs, and SQL-driven analytics on Snowflake/Redshift to support diverse consumption patterns across business units. Highly collaborative, working closely with business analysts, actuaries, data scientists, and product owners to design scalable solutions aligned to business needs. Demonstrated ability to mentor junior engineers, review ETL designs, and enforce best practices in performance tuning and data engineering standards. Designed and implemented real-time and batch ingestion frameworks using Kafka, Spark Streaming, and cloud-native services to support enterprise analytics. Led data migration projects from legacy RDBMS to cloud platforms (AWS, Azure, GCP), ensuring scalability, cost efficiency, and zero data loss. Delivered enterprise data models and compliant ETL pipelines supporting regulated industries, with a focus on accuracy, lineage, and audit readiness. Recognized for performance optimization, reducing ETL runtimes, improving query performance, and cutting infrastructure overhead through tuning and automation. Adept at building enterprise data models and developing high-quality ETL pipelines that ensure data accuracy, lineage, and compliance in regulated industries like healthcare and insurance. Hands-on experience with data governance and security controls, including IAM, encryption, and row/column-level security to meet HIPAA and GDPR standards. CERTIFICATIONS Databricks Certified Data Professional AWS Certified Data Analytics Specialty Azure Data Engineer (DP-203) Google Cloud Professional Data Engineer Snowflake SnowPro Core Databricks Data Engineer Professional Big Data & Processing Apache Spark (PySpark, Scala, Spark SQL, Databricks), Hadoop (HDFS, Hive), Delta Lake (time travel, OPTIMIZE/VACUUM), Apache Hudi/Iceberg, IBM Streams Cloud Platforms AWS: S3, Glue, EMR, Redshift, Lambda, Athena, DMS, SNS/SQS, CloudWatch/CloudTrail, Lake Formation, Secrets Manager,Azure: Data Factory (ADF), Synapse, Cosmos DB, Data Lake, Blob Storage, Purview,GCP: BigQuery, Cloud Storage, Data Fusion Streaming & Messaging Apache Kafka (MSK), Kafka Connect, Schema Registry, ksqlDB, AWS Kinesis ETL & Data Integration Informatica PowerCenter, AWS Glue, Azure Data Factory, Cloud Data Fusion, dbt (macros/tests), SQL-based ELT, Shell/Python frameworks Databases & Warehousing Snowflake (Streams/Tasks/Snowpipe), Amazon Redshift (Spectrum), BigQuery (partitioned/clustered tables), Oracle (PL/SQL), SQL Server (T-SQL), PostgreSQL, Teradata, MongoDB, Cassandra, HBase Data Modeling Dimensional (Star/Snowflake), OLTP/OLAP, canonical/curated data layers, data contracts Orchestration Airflow (DAG SLAs/retries), Azure Data Factory (parameterized templates), Prefect/Dagster Observability & Logging ELK (Elasticsearch/Logstash/Kibana), Grafana, CloudWatch, OpenTelemetry; DLQs, replay strategies Programming & Scripting Python (pandas, PySpark, NumPy, REST), Scala, SQL, Java, Bash/PowerShell Containers & Platform Docker, Kubernetes (autoscaling, secrets/configs), Helm; CI/CD with Jenkins, GitLab CI, GitHub Actions Security & Compliance IAM/RBAC, KMS encryption, tokenization, row/column-level security, audit trails; HIPAA/GDPR/SOX alignment Performance Engineering Load/Stress/Scalability (JMeter, LoadRunner, BlazeMeter), SQL/ETL tuning, CPU/memory profiling BI & Consumption Power BI, QuickSight, Looker, Tableau; HTML/JS dashboards TECHNICAL SKILLS WORK EXPERIENCE Norton Healthcare, Kentucky May 2024 Present Lead Data Engineer Defined Big Data strategy and roadmap; transformed complex business problems into scalable solutions using Spark, Kafka, Hadoop, AWS, and Azure pipelines, reducing data delivery delays by 35%. Installed, configured, and maintained enterprise ETL/data pipelines across AWS (Glue, EMR, Redshift, S3) and Azure Data Factory (ADF), ensuring 99.9% availability in production. Authored PySpark UDFs in Python for row/column transformations, aggregations, data labeling, and cleansing; improved pipeline reusability and reduced code duplication 25%. Developed batch and streaming Spark applications (PySpark, SparkSQL, Scala) to process multi-TB datasets from S3, HDFS, and NoSQL (HBase, Cassandra), cutting ETL runtime by 40%. Built real-time streaming pipelines using Kafka (MSK), Spark Streaming, and AWS Kinesis to capture and process millions of events per day with latency <2s, stored as Parquet in HDFS/S3. Automated Kafka regression validation scripts for ETL testing across Redshift, Oracle, MongoDB, and SQL Server using Python & Shell scripts, reducing manual QA effort 50%. Managed and optimized Kubernetes clusters (pods, config maps, secrets) with autoscaling to meet workload spikes, achieving 20% infra cost savings. Designed and tuned IBM Streams applications integrated with Hadoop & Spark, handling low-latency, high-throughput clinical and operational streaming workloads. Converted SQL Server & Oracle stored procedures into distributed ETL on Hadoop (Hive, SparkSQL, Scala, Java), reducing dependency on legacy RDBMS. Deployed Enterprise Data Lake on GCP (Google Cloud Storage + BigQuery) using Informatica & Cloud Data Fusion, enabling a centralized single source of truth for analytics. Implemented ETL pipelines into Azure CosmosDB for semi-structured/unstructured data ingestion, leveraging NoSQL models to support real-time query access. Built monitoring/alerting frameworks with CloudWatch, CloudTrail, and Linux shell scripts, cutting incident resolution time by 30%. Ingested logs into Elasticsearch via Logstash & Kinesis Firehose and visualized metrics in Kibana dashboards, reducing job failure detection time from hours to minutes. Developed REST APIs in Java, distributed ETL in Python, and SQL pipelines to integrate source target systems; ensured end-to-end lineage and compliance. Processed hourly/daily batch jobs from diverse sources (Adobe, Kafka, AWS Glue, NoSQL) and unified them into common data models persisted in HDFS. Built Databricks SparkSQL jobs with advanced optimizations (distributed cache, partitioning, bucketing, map-side joins), improving Glue job performance 35%. Created and deployed ADF pipelines (JSON templates) connecting Azure SQL, Blob, ADLS, and Synapse DW; reduced ingestion latency by 40%. Modeled data using Star/Snowflake schemas in Erwin; improved BI query performance and reduced redundancy across OLTP/OLAP systems. Designed workflows in Apache Airflow for authoring, scheduling, and monitoring DAGs, providing observability and recovery for critical pipelines. Preprocessed near-real-time data from S3 via Spark Streaming + Shell scripts, transformed on the fly, and persisted to HDFS, achieving sub-second latency SLAs. Built and optimized real-time dashboards in Power BI/HTML/JavaScript on top of BigQuery, Redshift, and ElasticSearch, providing stakeholders with actionable KPIs. Environment: Spark (Scala, PySpark, SparkSQL, Databricks), Kafka (MSK), AWS (S3, Glue, Redshift, EMR, Lambda, Athena, Kinesis, SNS, SQS, DMS, CloudWatch, CloudTrail), Azure (ADF, CosmosDB, Data Lake, Blob, SQL DW, Synapse), GCP (BigQuery, GCS, Data Fusion), Kubernetes, IBM Streams, Informatica, Snowflake, Hive, HBase, Cassandra, MongoDB, ElasticSearch, Logstash, Kibana, Airflow, Erwin, Python, Java, SQL, Shell, Linux Brown-Forman, Kentucky November 2022 April 2024 Lead Data Engineer Orchestrated the setup of a centralized Data Lake, integrating Teradata and multiple source systems into the Hadoop ecosystem using Sqoop and Hive, enabling seamless enterprise-wide data access. Designed and implemented end-to-end data ingestion pipelines with PySpark and Scala, sourcing data from APIs, RDBMS, files, and Kafka streams to support both real-time and batch processing needs. Enhanced ingestion performance by 20x compared to legacy tools (StreamSets) by leveraging Spark s multi-core parallelism, reducing data availability SLAs from hours to minutes. Optimized Hive table structures for incremental data loads, including schema evolution, data type conversions, and automated error handling to ensure consistent ingestion quality. Developed partitioned SQL queries and Spark transformations to maximize performance on large-scale datasets stored in Parquet and Avro formats. Engineered secure and scalable data pipelines on AWS (EMR, S3, Redshift, Glue, Data Pipeline), ensuring fault-tolerant data transfers and compliance with governance standards. Automated and orchestrated workflows using Airflow, Oozie, and Autosys, scheduling Spark and shell jobs for efficient utilization of compute resources and timely data delivery. Built Spark-based data quality frameworks with validation, reconciliation, and monitoring scripts, improving data reliability for downstream reporting and analytics. Leveraged NoSQL (HBase) for semi-structured data management and automated cleansing processes with Python and Hive scripts to handle diverse data sources. Designed MapReduce and Spark jobs for complex aggregations and transformations, supporting advanced analytics and reporting use cases across finance domains. Collaborated with cross-functional teams to deliver self-service analytics by preparing curated datasets in Hive/S3 and exposing them via Power BI and Tableau dashboards. Contributed to modernization efforts by migrating HiveQL pipelines to PySpark, improving maintainability, scalability, and execution efficiency. Applied best practices in Git, Jenkins, Maven, and Agile to streamline CI/CD workflows and accelerate delivery cycles across data engineering projects. Partnered with data scientists and business analysts to design reusable data models, accelerating delivery of predictive analytics and risk scoring use cases. Conducted performance tuning workshops, mentoring junior engineers on optimizing Spark and SQL jobs, which reduced recurring processing bottlenecks by 30%. Implemented row-level and column-level security in Hive and AWS environments, ensuring compliance with internal policies and regulatory requirements (SOX/HIPAA). Spearheaded a data migration initiative from on-prem Teradata to AWS Redshift, reducing infrastructure costs while improving scalability and query response times. Environment: Spark (Scala, PySpark, SQL), Hadoop (Hive, Sqoop, MapReduce, YARN), AWS (S3, EMR, Glue, Redshift), Airflow, Oozie, Autosys, HBase, Teradata, PL/SQL, Python, Git, Snowflake, Jenkins, IntelliJ, Tableau, Power BI, Unix Shell. State of Florida Department of health, FL July 2021 October 2022 Sr. Data Engineer Partnered with business and technical stakeholders to define non-functional requirements (NFRs) and created a performance testing strategy aligned to SLAs (p95 < 2.5s, error rate <1%). Conducted load, stress, endurance, and scalability tests using JMeter & LoadRunner, validating system resilience under >1,000 concurrent virtual users. Developed parameterized and modular test scripts in JMeter and LoadRunner (HTTP, WebSocket, REST protocols), reducing script maintenance time by 30%. Designed and deployed Kafka topics, producers, and consumers with optimized partitioning and replication; achieved stable throughput of 8,000 messages/sec with sub-second latency. Enhanced CI/CD pipelines (Jenkins, GitLab) with integrated data validation checks and automated regression test triggers, improving release confidence and cutting post-release defects by 35%. Created performance regression suites covering capacity, scalability, and stability testing, ensuring system reliability during quarterly release cycles. Tuned performance scripts by leveraging correlation, dynamic data handling, and think-time logic; improved test efficiency and reduced false failures by 25%. Implemented BlazeMeter for continuous performance testing; integrated with CI/CD pipelines to auto-trigger tests during builds, cutting defect detection time from days to hours. Monitored performance metrics via APM tools (Dynatrace, AppDynamics), correlating system resource utilization with response times; resolved root causes of thread contention and memory leaks, improving p95 latency by 28%. Provided recommendations for Kafka consumer group scaling and partition strategy, improving system fault tolerance and reducing lag by 40% under peak loads. Authored comprehensive performance reports with graphs and trend analysis in Grafana/Excel, presenting findings to stakeholders to guide scaling and optimization decisions. Delivered continuous improvements by aligning test automation with DevOps, enabling faster feedback loops and supporting Agile delivery across multiple sprints. Collaborated closely with development and infrastructure teams to identify performance bottlenecks early in the SDLC, reducing costly production escalations. Designed performance baselines and benchmarks for critical insurance applications, ensuring consistent system behavior under peak policy renewal periods. Automated performance result validation and reporting pipelines, cutting manual reporting effort by ~40% and enabling quicker stakeholder sign-off. Implemented proactive monitoring dashboards in Grafana and Dynatrace that provided real-time visibility into system health, reducing incident response times. Environment: JMeter, LoadRunner, BlazeMeter, Kafka, Jenkins, GitLab CI/CD, Dynatrace, AppDynamics, Grafana, REST/HTTP/WebSocket, Linux Nationwide Insurance, California Jul 2019 June 2021 Data Engineer Conducted data quality assessments on legacy systems, applying SQL validations and reconciliation scripts to sign off on migration readiness with 99.9% accuracy. Designed, migrated, and maintained ETL pipelines from legacy on-premise systems to AWS (Glue, PySpark, S3, Lambda), processing 3 5 TB/day and reducing batch cycle times by 42%. Optimized AWS Lambda functions (memory, concurrency, cold-start reduction) to handle high-volume event streams (500K+ events/day), improving throughput by 30%. Automated PySpark ETL jobs in AWS Glue Catalog with partitioning and job bookmarks, cutting job runtimes by 35% and reducing downstream SLA violations. Collaborated with actuarial teams to measure Policy Renewal Rate post-migration; uplift in retention drove $35M YoY revenue growth attributable to improved data accuracy and availability. Integrated BlazeMeter performance tests with Grafana dashboards, creating real-time visualizations of throughput, response times, and error rates for Continuous performance monitoring. Diagnosed production performance bottlenecks using APM tools (Dynatrace, AppDynamics); resolved thread contention and inefficient SQL queries, improving average response time by 28%. Implemented ELK (Elasticsearch, Logstash, Kibana) and Splunk log analysis to detect anomalies, reducing critical incidents by 25% per quarter. Conducted CPU/memory profiling using PerfMon and custom monitoring scripts to identify resource bottlenecks; tuning efforts cut infrastructure overhead by 15%. Delivered comprehensive performance reports from load/stress tests, presenting findings to stakeholders and influencing system scaling policies that reduced costs by 20%. Environment: AWS (Glue, S3, Lambda, CloudWatch), PySpark, SQL, BlazeMeter, Grafana, Dynatrace, AppDynamics, ELK, Splunk, Linux, Git Autodesk, California Sep 2017 June 2019 Data Engineer Defined and tracked KPIs for custom case support efficiency (avg. resolution time, SLA compliance %), enabling support teams to improve case turnaround and reducing backlog by 18%. Collected user feedback on dashboard usability and satisfaction; iterated on design changes that drove a 11.5% net profit improvement through better decision-making visibility. Built SQL views and LookML Explores for support case data, ensuring accurate metrics and faster query performance (40% reduction in query time). Designed and deployed interactive dashboards in Looker and QlikView, now live in production, providing executives with real-time insights on customer support KPIs. Modeled data for efficient storage and retrieval (fact/dimension schema design), improving dashboard refresh performance by ~30%. Preprocessed structured/unstructured data using Python (pandas) + SQL for cleaning, normalization, aggregation, and transformation; increased data quality to 98%+ accuracy. Developed drill-through and drill-down reports with parameters and filters in Looker/QlikView, enabling multi-level KPI analysis for business users. Applied advanced calculations (window functions, custom measures) in dashboards to uncover performance bottlenecks and patterns in case management. Partnered with business stakeholders to translate reporting needs into technical requirements, ensuring dashboards reflected accurate KPIs and supported faster decision-making. Introduced version control (Git) and peer-review processes for SQL/LookML changes, improving collaboration and reducing production defects. Streamlined ETL workflows by documenting best practices and automating data validation checks, which reduced manual QA efforts by 20%. Environment: Looker, QlikView, SQL, Python (pandas, NumPy), Data Modeling (Star/Snowflake), Git, Jira, Linux Client: USAA, USA Dec 2015 Nov 2017 Role: Data Engineer Partnered with business analysts and technical teams to translate business requirements into scalable architectural solutions, ensuring seamless alignment of ETL workflows, data warehouse design, and reporting needs. Designed and developed ETL processes using Talend and Shell scripts, automating data ingestion from SFTP into Amazon S3 and HDFS, including metadata table updates, file validation, and reconciliation checks for improved data quality. Implemented a multi-layered Amazon Redshift data warehouse (Staging, Consolidated, Published, and Outbound layers), enabling efficient data transformation, historical data retention, and business-rule application for reporting and analytics. Developed transformation logic in Talend to process historical datasets, apply business rules, and prepare outbound tables that supported three years of reporting for business users. Leveraged Snowflake as a high-performance analytical database for business stewards, supporting ad-hoc analysis, while enabling BI teams to build interactive dashboards in Tableau. Coordinated with operations and architecture teams to validate technical changes, manage Talend job deployments, regression testing, and DB object modifications, ensuring compliance with enterprise standards. Provided go-live support and hyper-care for production deployments, collaborating with stakeholders to ensure timely delivery of data products and quick resolution of post-implementation issues. Environment & Tools: Talend, Amazon Redshift, Snowflake, Tableau, Shell scripting, SQL, S3, HDFS, Metadata Management, ETL Development, Data Warehousing, Regression Testing, Agile B2B Software Technologies, India July 2013 November 2015 Data Engineer Designed and implemented Informatica mappings and reusable mapplets to extract, transform, and load data from Oracle relational sources, ensuring 99.8% data accuracy in downstream reporting systems. Collaborated with business analysts to create Source-to-Target Mapping (STM) documents, translating business rules into ETL logic and reducing rework in development by 20%. Developed database objects including Tables, Stored Procedures, Functions, and Triggers in Oracle PL/SQL, streamlining data validation workflows and improving execution efficiency by 30%. Optimized SQL queries through advanced indexing, partitioning, and query refactoring, reducing query response times by up to 45% for analytical reporting workloads. Applied performance tuning techniques at the session, mapping, and database levels in Informatica to improve ETL runtime by 40% for large datasets (100M+ records). Utilized LoadRunner (VuGen & Controller) to design and execute performance test scripts simulating 500 1,000 concurrent users, uncovering bottlenecks in CPU and memory utilization. Analyzed CPU, memory, and I/O utilization metrics in production and staging environments using PerfMon and AWR reports, resolving contention issues and eliminating memory leaks. Designed KPIs for sales and revenue tracking using SQL and Informatica transformations, providing actionable insights that improved forecasting accuracy by 15%. Executed stress, load, and endurance testing scenarios in LoadRunner, ensuring system stability under peak transaction loads and validating SLAs for response time (<3 seconds). Provided post-deployment production support, performing root-cause analysis of ETL and performance defects, and reducing incident resolution time by 25% Environment: Informatica PowerCenter 10.x, Oracle 11g/12c, PL/SQL, LoadRunner 12.x, SQL Developer, Windows/Linux, Agile-Scrum Keywords: continuous integration continuous deployment quality analyst javascript business intelligence sthree database active directory information technology golang procedural language Florida |