Home

Jithendra Bojedla - Senior Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation: Yes
Visa:
Resume file: Jithendra Bojedla_1779291389723.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
JITHENDRA BOJEDLA
LinkedIn: https://www.linkedin.com/in/jithendra7915/ | Phno: (971) 340 - 7915 | [email protected]
SR. BIG DATA ENGINEER
PROFESSIONAL SUMMARY:
Bring 10 years of hands-on experience in information technology specializing in Big Data Engineering, delivering impactful data solutions across healthcare, retail, banking, logistics, and manufacturing sectors. Demonstrated expertise in building enterprise-scale data platforms that transform raw information into actionable business intelligence for diverse stakeholders.
Designed enterprise data pipelines supporting batch and real-time processing using Apache Spark, PySpark, Spark SQL, and Apache Kafka for streaming ingestion, implementing Python and SQL transformation logic to ensure high throughput and low-latency delivery across distributed systems.
Architected scalable data lake and warehouse solutions on AWS using S3, Redshift, Glue, EMR, and Lambda, integrated with Azure Data Factory, Synapse Analytics, Azure Data Lake Storage, GCP BigQuery, Dataflow, and Cloud Storage for petabyte-scale processing.
Built optimized ETL/ELT workflows using Python and SQL with Apache Airflow orchestration, implementing Star and Snowflake data modeling schemas to ensure efficient organization, query performance optimization, and seamless integration with business intelligence tools and analytical workloads.
Implemented streaming architectures using Apache Kafka integrated with Apache Spark Structured Streaming, PySpark, and Hadoop ecosystem components for high-volume ingestion, ensuring fault tolerance, exactly-once processing semantics, and maintaining data consistency from source to target systems.
Optimized query performance in Snowflake, Amazon Redshift, and Google BigQuery through advanced indexing, partition pruning, and materialized views, conducting comprehensive SQL tuning and Python transformation optimization to reduce processing time and minimize compute costs.
Developed CI/CD pipelines using Docker containerization and Kubernetes orchestration for automated deployment, implementing infrastructure as code using shell scripting and Python to ensure consistent provisioning, version control, and seamless promotion across development, staging, and production environments.
Established data quality frameworks and governance standards implementing automated validation rules, schema enforcement, and lineage tracking integrated with security controls, ensuring regulatory compliance, maintaining audit trails, and implementing role-based access controls across AWS, Azure, and GCP platforms.
Led technical design discussions and mentored engineers on distributed systems architecture, performance optimization, and cloud-native patterns, conducting code reviews focused on Apache Spark optimization, Kafka streaming patterns, and SQL tuning to improve team productivity and solution reliability.
Collaborated with data scientists and business stakeholders to design scalable solutions supporting AI/ML pipelines by building feature engineering workflows using Python and PySpark, implementing preprocessing pipelines that ensured clean, structured, and timely data for model training and inference.
Managed large-scale distributed systems within Hadoop ecosystem by configuring HDFS, YARN, MapReduce, and Apache Spark, implementing partitioning strategies, compression techniques, and cluster tuning using SQL and Python to maximize resource utilization and minimize job execution times.
Built data lakehouse architectures using Delta Lake and Apache Iceberg on AWS S3 and Azure Data Lake Storage, implementing ACID transactions, schema evolution, and time travel features with Spark and SQL to enable unified batch and streaming analytical workloads.
Implemented automated orchestration workflows using Apache Airflow integrated with AWS Glue, Azure Data Factory, and GCP Dataflow, developing custom operators in Python for retry logic, error handling, and alerting to ensure reliable end-to-end processing with comprehensive observability.
Developed real-time processing solutions using Apache Kafka and Apache Spark Streaming for low-latency analytics, implementing event-driven architectures with AWS Lambda and Azure Functions for serverless Java and Python data processing with automatic scaling capabilities.
Designed cloud-native transformation pipelines on AWS using EMR and Glue integrated with Azure Synapse, Spark pools, GCP Dataflow, and BigQuery, handling JSON, Parquet, Avro, and CSV formats using Python, SQL, and Java transformations.
Applied Agile and Scrum methodologies delivering projects through sprint planning and retrospectives, maintaining stakeholder communication to gather requirements, providing technical guidance on Snowflake, Redshift, SQL optimization, and ensuring alignment between technical implementations and business objectives.
Implemented advanced performance tuning for Snowflake using clustering keys and micro-partitions, optimizing Amazon Redshift with distribution styles and sort keys, and tuning Google BigQuery with partitioned tables using SQL to reduce query costs and improve response times.
Developed comprehensive security and compliance frameworks implementing encryption using AWS KMS and Azure Key Vault, establishing row-level security in Snowflake and BigQuery, implementing data masking with SQL and Python, and ensuring GDPR and HIPAA compliance.
Built containerized data processing applications using Docker and deployed on Kubernetes for scalable execution of Python, Java, and Spark jobs, implementing CI/CD automation using Jenkins and Azure DevOps for streamlined deployment and automated testing.
Architected Java-based microservices on AWS using Lambda, EMR, and Glue integrated with Apache Kafka and Spark, developing real-time data processing applications with Java frameworks, SQL, and Python for cloud-native execution and orchestration.
Demonstrated strong analytical and problem-solving abilities troubleshooting pipeline failures, investigating performance bottlenecks in Hadoop and Spark distributed systems, using advanced SQL debugging and Python profiling tools to identify inefficiencies and implement corrective measures.
TECHNICAL SKILLS:
Big Data Ecosystem Tools: Apache Spark, Databricks, Flink, Hive, HBase, Hadoop, MapReduce, YARN, Oozie, Zookeeper, Kafka, NiFi, Elastic MapReduce (EMR), Sqoop, Pig, Flume, Fact and Dimension Tables, Hortonworks Ambari, Cloudera Distribution, Scala.
Relational and NoSQL Databases: Snowflake, Amazon Redshift, BigQuery, Oracle, SQL Server, PostgreSQL, DB2, DynamoDB, Cosmos DB, MongoDB, Cassandra, HBase, RDBMS.
Programming Languages: Python, Scala, SQL, PL/SQL, Java, JavaScript, Shell Scripting, HTML, SAS, Perl.
ETL and Data Integration: AWS Glue, Azure Data Factory, Informatica, Talend, IBM DataStage, Apache Airflow, DBT, Spring Boot, Apache NiFi, Jenkins, TOAD, Teradata SQL Assistant, GitHub, Confluence, Shell Scripts, Notepad++, Putty, Visual Source Safe.
Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).
Services: Amazon Web Services S3, Glue, Redshift, Redshift Spectrum, Athena, Kinesis, MSK, Lambda, Step Functions, Lake Formation, SageMaker, OpenSearch Service, QuickSight. Microsoft Azure ADF, Databricks, Synapse, ADLS. Google Cloud Platform BigQuery, Dataflow, Pub/Sub, Dataproc.
Visualization and Reporting: Power BI, Tableau, Amazon QuickSight, MS Excel, SPSS.
Containerization & Orchestration Tools: Docker, Docker Hub, Docker Registry, Kubernetes, Terraform, CloudFormation.
Methodologies: Agile, Scrum, Kanban, Waterfall.
Libraries & ML Tools: Pandas, NumPy, SciPy, PyArrow, Matplotlib, Seaborn, Scikit-learn, TensorFlow.
Algorithms: Logistic Regression, Random Forest, Decision Trees, Support Vector Machine (SVM), Gradient Boosting, K-Nearest Neighbor (KNN), A/B Testing.
IDEs & Frameworks: Jupyter Notebook, Pycharm, Anaconda, Django, Flask, MLflow.

PROFESSIONAL EXPERIENCE:
Huntington Columbus, OH
Role: Sr. Big Data Engineer March 2024 - Current
Responsibilities:
Designed enterprise-scale ETL pipelines using Apache Spark, PySpark, and Spark SQL to process over 500 million daily banking transactions, implementing batch processing and real-time streaming with Apache Kafka for fraud detection while ensuring PCI-DSS and SOX compliance across AWS S3 and Amazon Redshift.
Built cloud-native data lake solutions on AWS utilizing AWS S3 for raw storage, AWS Glue for ETL orchestration and metadata cataloging, Amazon Redshift for analytics, and AWS EMR clusters running Apache Spark for distributed processing of transaction datasets.
Implemented Star schema and Snowflake schema designs within Amazon Redshift and Snowflake data warehouses, optimizing dimensional models for retail banking analytics while enforcing data governance policies through Collibra for lineage tracking and metadata management across all systems.
Developed complex ETL workflows using Python, SQL, and Shell scripting to transform financial data from core banking systems into Amazon Redshift and Snowflake, applying data quality validation rules and audit controls for SOX compliance requirements.
Engineered real-time streaming pipelines using Apache Kafka and PySpark Structured Streaming to capture high-velocity transaction events, enabling near-instant fraud detection while maintaining event ordering and exactly-once processing semantics across distributed Kafka clusters and AWS EMR environments.
Optimized query performance in Amazon Redshift and Snowflake through distribution key selection, sort key optimization, materialized views, and SQL query refactoring, reducing execution times by implementing columnar storage strategies and partition pruning for financial datasets.
Established CI/CD pipelines using Git for version control, Jenkins for automated deployments, Docker for containerizing Python and PySpark applications, and Kubernetes on AWS EKS to manage scalable data processing workloads across development and production environments.
Orchestrated end-to-end workflows using Apache Airflow, defining DAGs to schedule ETL jobs, coordinate dependencies between AWS Glue jobs, AWS EMR clusters, and AWS Lambda functions, implementing retry logic with Prometheus and Grafana monitoring integration.
Developed serverless solutions using AWS Lambda functions in Python for event-driven transformations triggered by S3 bucket events, integrating with AWS Glue data catalog and Amazon Redshift for automated ingestion from external banking partners and regulatory systems.
Implemented comprehensive data governance frameworks using Collibra for business glossary management and policy enforcement, establishing data lineage tracking across ETL pipelines, defining data quality rules in Python and SQL for GLBA and PCI-DSS compliance.
Built AI/ML pipelines preparing training datasets using PySpark and Pandas for credit risk modeling, integrating LangChain and Hugging Face Transformers for natural language processing, deploying LLMs for sentiment analysis with data stored in Amazon Redshift and Snowflake.
Maintained Hadoop ecosystem components including HDFS and Hive, migrating legacy on-premises Hadoop workloads to cloud-native AWS EMR and AWS Glue while preserving partitioning strategies and optimizing PySpark jobs using EC2 Spot Instances and auto-scaling.
Developed transformation logic using advanced SQL techniques including window functions and CTEs within Amazon Redshift and Snowflake, implementing slowly changing dimensions for customer history tracking and creating aggregate tables for regulatory capital adequacy reporting requirements.
Configured monitoring solutions using Prometheus for metrics collection from Kubernetes pods running PySpark applications, Grafana dashboards for pipeline performance visualization, and AWS CloudWatch for tracking AWS Glue, AWS Lambda, and Amazon Redshift cluster health.
Collaborated with data scientists to deliver scalable solutions for risk analytics and customer modeling, providing clean datasets through Amazon Redshift and Snowflake views, enabling self-service analytics through Tableau dashboards connected to optimized SQL data marts.
Mentored engineers on Apache Spark optimization, AWS best practices, Python standards, and data modeling principles, conducting code reviews through Git pull requests, leading technical design discussions for ETL pipelines, establishing data governance and security standards.
Implemented security controls including AWS IAM role-based access, S3 bucket policies with encryption, Amazon Redshift row-level security, Snowflake data masking for PII protection, and audit logging to satisfy PCI-DSS, SOX, and GLBA compliance audits.
Utilized Erwin Data Modeler for designing Star schema and Snowflake schema architectures, maintaining enterprise data models for core banking entities, generating DDL scripts for Amazon Redshift and Snowflake deployment and physical implementation.
Participated in Agile development using JIRA for sprint planning and backlog refinement, delivering iterative enhancements to ETL processes and data pipelines, presenting technical solutions to risk officers and compliance teams in bi-weekly reviews.
Designed disaster recovery architectures implementing cross-region replication for AWS S3 data lakes, automated backup strategies for Amazon Redshift and Snowflake warehouses, Apache Kafka topic replication, documented runbooks for incident response aligned with operational resilience.
Processed banking transactions using Java-based Apache Spark jobs deployed on AWS EMR, integrating Java applications with AWS Lambda for serverless processing, utilized AWS SDK for Java to interact with S3 and Amazon Redshift for real-time ingestion.
Developed Java utilities for data quality validation integrated with PySpark pipelines, deployed Java microservices on AWS EKS using Kubernetes and Docker, interfaced with Apache Kafka using Java consumer and producer APIs for streaming transaction processing.
Built Java-based ETL components using AWS Glue custom classifiers, integrated Java applications with Snowflake JDBC drivers for data loading, orchestrated Java batch jobs through Apache Airflow and monitored execution using Prometheus and Grafana dashboards.

Environment: EMR, Tableau, Erwin, Grafana, Aws Redshift, Pyspark, AWS EKS, Spark SQL, Apache Spark, Amazon Redshift, Prometheus, Shell Scripting, AWS EC2, Collibra, Jira, SQL, AWS, Pandas, Hadoop, Glue, Docker, Kubernetes, Git, Jenkins, Snowflake, Apache Kafka, Airflow, Lambda, Python, Langchain, Hugging Face Transformers, Amazon Web Services, S3
Johnson & Johnson New Brunswick, NJ
Role: Sr Data Engineer May 2021 - February 2024
Responsibilities:
Designed scalable data pipelines using Azure Data Factory, Apache Spark, and PySpark to process over fifteen million HIPAA-compliant patient records daily from HL7 FHIR interfaces while maintaining data lineage in Alation for data governance across Azure cloud infrastructure.
Built enterprise data lake solutions on Microsoft Azure using Azure Data Lake Storage, Azure Synapse Analytics, and Snowflake with Star schema modeling patterns to support clinical analytics workloads and optimized query performance using SQL across multi-terabyte healthcare datasets.
Developed real-time streaming architectures leveraging Apache Kafka and Apache Spark Streaming to ingest HL7 and FHIR messages from EHR systems, transformed patient data using PySpark and Spark SQL while ensuring HIPAA compliance throughout ETL workflows with validation checkpoints.
Orchestrated batch and real-time workflows using Apache Airflow with Python and Shell scripting, managed dependencies across one hundred fifty DAGs coordinating Azure Databricks jobs and Apache NiFi dataflows while maintaining CI/CD pipelines through GitHub Actions for version-controlled deployments.
Implemented data transformation logic using SQL, Python, and dbt within Azure Synapse to cleanse patient demographics and claims data from disparate systems, applied data quality frameworks and documented transformation rules in Confluence for regulatory compliance.
Optimized query performance in Azure Synapse Analytics and Snowflake warehouses by implementing indexing strategies and partition pruning using SQL, reduced query execution time by half for Star schema fact tables while collaborating with Data Scientists for AI/ML predictive modeling.
Built cloud-native solutions on Microsoft Azure utilizing Azure Data Factory, Azure Synapse Analytics, ADLS Gen2, Azure Databricks, and Azure Kubernetes Service to create comprehensive data lakehouse architecture implementing data governance policies through Alation with HIPAA-compliant access controls.
Deployed Docker containerized Spark applications orchestrated on Azure Kubernetes Service clusters, managed resource allocation for variable processing loads from batch jobs and real-time streams while implementing CI/CD automation using GitHub and GitHub Actions for consistent deployments.
Developed AI/ML data pipelines supporting predictive healthcare analytics by engineering feature stores using PySpark, NumPy, and Azure Databricks, created reusable transformation functions tracked in MLflow while ensuring PHI data adhered to HIPAA de-identification standards throughout machine learning workflows.
Implemented comprehensive ETL workflows using Apache NiFi for legacy system ingestion and Azure Data Factory for cloud processing, handled over two hundred data sources applying Shell scripting and Python for extraction while maintaining data lineage documentation throughout integration layers.
Established Data Governance and Data Quality frameworks by implementing validation rules using Python and SQL within ADF pipelines, created profiling reports in Power BI, and cataloged assets in Alation with metadata enabling cross-functional teams to discover HIPAA-compliant datasets.
Optimized distributed processing on Hadoop ecosystem and Apache Spark clusters by tuning Spark SQL execution plans and implementing broadcast joins for Snowflake schema dimension tables while using PySpark APIs to process nested HL7 FHIR JSON structures from healthcare APIs.
Designed CI/CD pipelines using GitHub Actions integrated with Azure DevOps for automated testing and deployment of ADF pipelines, Databricks notebooks, dbt models, and Docker images while implementing code reviews and compliance checks for HIPAA-regulated healthcare data.
Monitored production data pipelines using ELK Stack for centralized logging, implemented custom Python scripts for anomaly detection across Azure infrastructure, Apache Kafka brokers, and Spark applications ensuring rapid incident response and maintaining HIPAA compliance audit logs.
Collaborated with Data Scientists and stakeholders in Scrum ceremonies translating business needs into technical designs for Azure Synapse and Snowflake reporting layers, provided mentorship on PySpark optimization and SQL practices while leading design reviews for ETL solutions.
Engineered reusable Python libraries and PySpark transformation functions for HL7 message parsing and FHIR resource validation, packaged components as Docker containers deployed on Azure Kubernetes Service while maintaining version control in GitHub and documentation in Confluence.
Implemented data security controls across Azure Data Lake and Azure Synapse by configuring role-based access control and dynamic data masking for PHI fields ensuring HIPAA compliance while conducting regular access audits documented in Alation for data governance oversight.
Built automated data quality monitoring frameworks using Python, SQL, and Azure Data Factory to validate incoming healthcare datasets, implemented completeness and accuracy checks with alerts configured for data stewards and maintained detailed quality metrics in Power BI dashboards.
Designed dimensional data models using Star schema and Snowflake schema patterns in ER/Studio for clinical analytics domains, implemented slowly changing dimensions to maintain historical accuracy while optimizing physical implementation in Azure Synapse and Snowflake through distribution strategies.
Led implementation of Apache Kafka streaming platform for real-time patient event processing, designed topic structures integrating Kafka Connect with Azure services and implemented Spark Structured Streaming consumers written in PySpark ensuring HIPAA-compliant message encryption throughout streaming architecture.
Integrated Java-based custom processors with Apache NiFi and Azure services to handle complex HL7 message transformations, deployed Java applications on Azure Kubernetes Service for real-time validation and utilized Java libraries for FHIR resource parsing within cloud-native healthcare interoperability workflows.
Built Java microservices deployed on Azure Kubernetes Service consuming Apache Kafka streams for patient data processing, utilized Java Spring Boot frameworks integrated with Azure Data Lake Storage and Azure Synapse Analytics for seamless data ingestion and transformation across cloud infrastructure.
Developed Java-based data validation utilities integrated with Azure Data Factory and Apache Spark, implemented custom Java UDFs for Spark SQL processing and deployed containerized Java applications using Docker on Azure cloud ensuring HIPAA compliance throughout ETL workflows.

Environments: Apache Spark, Tensorflow, Azure Databricks, SQL, ADLS, Azure, Numpy, Apache Kafka, Airflow, Azure Data Factory, Python, Er/Studio, Elk Stack, Synapse, Power Bi, ADF, Pyspark, Spark SQL, Confluence, Azure Synapse, Github, Github Actions, Kubernetes, Dbt, Azure Data Lake, Apache Nifi, Microsoft Azure, Azure AKS, Mlflow, Alation, Scikit-learn, Shell Scripting, Hadoop, Snowflake, Docker
Home Depot Atlanta, GA
Role: Data Engineer July 2018 April 2021
Responsibilities:
Designed scalable batch pipelines using Apache Spark, PySpark, and Python on Google Cloud Platform, processing over fifty million daily retail transactions while ensuring ETL workflows maintained data quality standards and PCI-DSS compliance requirements across distributed environments.
Built enterprise data warehouse solutions on GCP BigQuery and Snowflake implementing Star schema models using SQL and Power Designer, optimizing query performance through partitioning and clustering to support real-time retail analytics dashboards accessed by stakeholders.
Implemented streaming architectures using Apache Kafka, GCP Pub/Sub, and Apache Beam on GCP Dataflow to ingest point-of-sale transaction streams, enabling merchandising teams to respond to inventory fluctuations with sub-minute latency using PySpark Structured Streaming.
Developed complex ETL workflows using Python, Apache Spark, and Spark SQL to extract data from legacy retail systems, transform records, and load cleansed datasets into GCP BigQuery and Snowflake while applying Data Quality validation through Great Expectations.
Orchestrated end-to-end pipelines using Apache Airflow deployed on GCP GKE, defining DAGs with task dependencies for batch ETL jobs processing nightly sales aggregations, while integrating CI/CD pipelines via GitLab CI for automated Python and PySpark deployment.
Optimized query performance in Snowflake and GCP BigQuery through advanced SQL tuning including query rewriting, partitioning on transaction dates, and clustering on customer identifiers, reducing execution times and enabling near-real-time reporting for retail operations teams.
Established CI/CD pipelines using GitLab, GitLab CI, Docker, and Kubernetes to automate build and deployment processes for PySpark applications, containerizing Apache Spark jobs and deploying workloads on GCP GKE aligned with DevOps best practices.
Engineered data transformation logic using Python, SQL, and PySpark to cleanse retail transaction data and customer loyalty interactions, applying business rules for tax calculations and fraud detection while ensuring PCI-DSS compliance for sensitive cardholder information.
Implemented Data Governance frameworks by defining metadata standards and validation rules using Great Expectations integrated within Apache Airflow pipelines, ensuring GDPR compliance and maintaining audit trails for regulatory reporting across all retail data assets.
Integrated Apache Kafka event streaming with GCP Pub/Sub to build hybrid cloud ingestion patterns processing real-time inventory updates and customer actions, routing processed messages to GCP BigQuery and Snowflake using Apache Beam on GCP Dataflow.
Developed monitoring solutions using Splunk for centralized logging and Slack for automated alerting, tracking Apache Spark job metrics, Airflow DAG failures, and Kafka consumer lag to maintain SLA commitments for critical retail reporting workloads.
Collaborated with data scientists and business analysts within Agile environments, participating in sprint planning to deliver scalable solutions supporting merchandising optimization and demand forecasting using Python-based pipelines feeding feature stores in GCP BigQuery and Snowflake.
Built serverless workflows on GCP Cloud Run using containerized Python applications and Apache Beam runners, processing incremental retail data updates and executing transformations, reducing infrastructure overhead while maintaining Docker-based portability and Kubernetes-compatible deployment patterns.
Designed data lakehouse architectures combining GCP BigQuery for structured queries and GCP Cloud Storage for raw storage, enabling unified access for SQL-based business intelligence and PySpark-based analytics supporting retail transaction history analysis.
Optimized Hadoop ecosystem integrations by migrating legacy clusters to cloud-native GCP services, refactoring Apache Spark jobs to leverage GCP Dataflow for managed processing, improving job performance through autoscaling and integration with GCP BigQuery and Snowflake.
Mentored junior engineers conducting technical reviews, sharing expertise in Apache Spark optimization, Python best practices, SQL tuning, Kafka stream processing, and GCP architecture while fostering knowledge transfer around Data Quality frameworks and CI/CD automation standards.
Implemented security controls across pipelines ensuring PCI-DSS and GDPR adherence, configuring GCP IAM policies, Snowflake role-based access controls, field-level encryption using Python cryptography libraries, and audit logging via Splunk for comprehensive compliance monitoring.
Performed advanced query optimization in Snowflake and GCP BigQuery, analyzing execution plans and implementing incremental patterns with clustering keys and partition pruning, improving concurrency for retail dashboards supporting daily sales and weekly inventory reconciliations.
Developed comprehensive testing frameworks using Python unit tests, PySpark integration tests, and Great Expectations validation suites within GitLab CI, ensuring code quality and data contract compliance across Apache Airflow workflows and SQL stored procedures.
Deployed containerized Apache Spark applications using Docker on GCP GKE, configuring resource quotas and pod autoscaling, while integrating Kubernetes-native monitoring with Splunk and establishing automation for cluster provisioning across multi-tenant retail analytics environments.

Environments: Spark SQL, Apache Spark, Snowflake, Pyspark, GCP, Python, GCP GKE, Gcp Bigquery, Apache Beam, Flume, Gitlab, Power Designer, Airflow, Splunk, Docker, SQL, Hadoop, GCP Pub/Sub, Apache Kafka, Shell Scripting, Gitlab Ci, Google Cloud Platform, Great Expectations, Kubernetes, Looker, Dataflow, GCP Cloud Run
Honeywell Charlotte, NC
Role: Data Engineer December 2016 June 2018
Responsibilities:
Designed scalable manufacturing data pipelines to ingest machine sensor data and production metrics into a centralized cloud data lake using AWS S3. Developed transformation logic using Python, processed high volume datasets with Apache Spark, and performed analytical queries using SQL.
Built enterprise ETL workflows integrating manufacturing execution systems and operational databases into cloud storage. Utilized AWS Glue for data integration and catalog management, implemented transformation scripts using Python, processed large datasets with Apache Spark, and ensured data consistency using SQL.
Developed batch processing frameworks for manufacturing production data including equipment logs and operational metrics. Leveraged AWS EMR for distributed processing, implemented transformation pipelines using PySpark, stored processed datasets in AWS S3, and performed aggregation queries using SQL.
Implemented real time ingestion pipelines capturing machine telemetry and factory floor events. Utilized Amazon Kinesis for streaming ingestion, processed event streams using Apache Spark Streaming, implemented transformation logic using Python, and stored processed operational datasets in AWS S3.
Designed scalable data warehouse solutions to support manufacturing analytics and operational reporting. Loaded curated datasets into Amazon Redshift, implemented dimensional models using SQL, orchestrated ETL pipelines using Apache Airflow, and enabled efficient analytical queries for production performance monitoring.
Built secure data ingestion frameworks integrating production systems, sensor platforms, and enterprise applications. Implemented event driven integrations using AWS Lambda, processed incoming datasets using Python, stored standardized data within AWS S3, and executed transformations using Apache Spark.
Implemented dimensional data models supporting manufacturing analytics including production efficiency and equipment utilization reporting. Designed fact and dimension tables using Amazon Redshift, managed metadata using AWS Glue Data Catalog, implemented transformation workflows with Apache Spark, and validated datasets using SQL.
Developed automated data orchestration pipelines ensuring reliable processing of manufacturing operational datasets. Managed workflow scheduling using Apache Airflow, executed ETL transformations using Python, triggered data integration jobs through AWS Glue, and stored curated datasets within AWS S3 data lake environment.
Implemented data quality validation frameworks ensuring accuracy of production and operational datasets. Developed validation scripts using Python, executed reconciliation queries using SQL, monitored pipeline performance using Amazon CloudWatch, and maintained consistent storage of validated datasets in AWS S3.
Optimized large scale manufacturing data processing workloads by implementing distributed processing solutions. Executed complex transformations using Apache Spark, processed workloads on AWS EMR, stored optimized columnar datasets in AWS S3, and improved analytical query performance using SQL.
Implemented secure governance and access control policies for manufacturing analytics datasets. Managed permissions using AWS IAM, applied encryption using AWS KMS, monitored infrastructure activity using AWS CloudTrail, and ensured secure storage and controlled access to datasets stored in AWS S3.
Collaborated with analytics and operations teams to deliver manufacturing insights supporting production monitoring and operational optimization. Provided curated datasets through Amazon Redshift, implemented transformation logic using Python, performed analytical queries using SQL, and enabled integration with enterprise reporting tools.

Environments: AWS S3, AWS Glue, AWS EMR, Amazon Kinesis, AWS Lambda, Amazon Redshift, AWS Glue Data Catalog, Amazon Cloudwatch, AWS IAM, AWS KMS, AWS Cloudtrail, Apache Spark, Apache Spark Streaming, Pyspark, Apache Airflow, Python, SQL, ETL, Data Warehousing, Dimensional Modeling, Data Lake, Batch Processing, Real-time Data Processing, Manufacturing Data Analytics
Mastek Mumbai, India
Role: ETL Developer May 2015 October 2016
Responsibilities:
Developed scalable ETL pipelines to ingest application and operational datasets into centralized cloud storage supporting enterprise analytics. Utilized Azure Data Factory for orchestrating workflows, implemented transformation logic using SQL, performed data processing using Python, and stored curated datasets within Azure Data Lake Storage.
Designed robust data integration frameworks connecting enterprise software applications and relational databases into analytical environments. Leveraged Azure SQL Database for structured storage, implemented complex transformations using SQL, orchestrated pipelines with Azure Data Factory, and maintained scalable storage using Azure Data Lake Storage.
Built automated data ingestion pipelines integrating multiple enterprise systems and external data sources for reporting and analytics. Developed transformation scripts using Python, scheduled workflows using Azure Data Factory, validated datasets with SQL, and stored processed data within Azure Data Lake Storage.
Developed scalable batch processing workflows to transform large application datasets for analytics and reporting. Leveraged distributed processing capabilities of Azure Databricks, implemented transformation pipelines using PySpark, orchestrated workflows using Azure Data Factory, and persisted processed datasets in Azure Data Lake Storage.
Implemented enterprise data warehouse solutions supporting business intelligence and application analytics. Loaded curated datasets into Azure Synapse Analytics, designed dimensional models using SQL, orchestrated data ingestion pipelines through Azure Data Factory, and ensured scalable data storage using Azure Data Lake Storage.
Designed secure data integration pipelines connecting application logs, operational databases, and enterprise systems. Implemented event driven ingestion using Azure Functions, processed incoming data using Python, staged intermediate datasets within Azure Blob Storage, and stored curated outputs in Azure Data Lake Storage.
Implemented metadata management and schema governance supporting enterprise data integration processes. Managed metadata definitions using Azure Purview, executed data transformation queries using SQL, orchestrated ETL workflows through Azure Data Factory, and stored standardized datasets within Azure Data Lake Storage.
Developed automated orchestration frameworks to manage reliable execution of ETL pipelines across enterprise data platforms. Scheduled workflows using Azure Data Factory, implemented transformation scripts using Python, processed large datasets with Azure Databricks, and stored curated data within Azure Data Lake Storage.
Implemented data validation and monitoring frameworks ensuring reliability and accuracy of integrated enterprise datasets. Developed validation scripts using Python, executed reconciliation checks using SQL, monitored pipeline execution through Azure Monitor, and ensured reliable dataset storage within Azure Data Lake Storage.
Optimized large scale data processing workloads by implementing distributed transformation strategies. Processed complex data transformations using Azure Databricks, implemented analytical queries using SQL, orchestrated ETL pipelines through Azure Data Factory, and maintained scalable storage within Azure Data Lake Storage.
Implemented secure access management and governance for enterprise data platforms. Managed identity access using Azure Active Directory, applied encryption using Azure Key Vault, monitored operational metrics through Azure Monitor, and ensured secure dataset storage within Azure Data Lake Storage.
Collaborated with analytics and application teams to deliver reliable reporting datasets supporting operational insights and business analytics. Provided curated datasets using Azure Synapse Analytics, implemented transformation logic using SQL, orchestrated pipelines with Azure Data Factory, and ensured scalable storage within Azure Data Lake Storage.
Environments: Azure Data Factory, Azure Data Lake Storage, Azure SQL Database, Azure Synapse Analytics, Azure Databricks, Azure Blob Storage, Azure Functions, Azure Purview, Azure Monitor, Azure Active Directory, Azure Key Vault, Python, Pyspark, SQL, ETL, Data Warehousing, Dimensional Modeling, Data Integration, Data Transformation, Batch Processing, Enterprise Data Analytics.

EDUCATION:
Bachelors in Computer Science : VR Siddhartha Engineering College July 2011 April 2015
Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database microsoft mississippi procedural language Delaware Georgia New Jersey North Carolina Ohio

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7339
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: