Home

pankaj bora - Data Engineer
[email protected]
Location: Fenwick Island, Delaware, USA
Relocation:
Visa:
SAI CHARAN
PROFESSIONAL SUMMARY

Accomplished Data Engineer with over 8+ years of experience architecting and implementing high-performance data pipelines, enabling seamless integration and optimization of scalable analytics solutions to support critical business insights.
Proficient in Big Data technologies such as Hadoop, Hive, Spark, Sqoop, and NiFi, with a proven ability to design and optimize workflows for ingesting, transforming, and processing large-scale datasets efficiently.
Extensive expertise in cloud-based data engineering using AWS services such as S3, EC2, EMR, Glue, Redshift, and Step Functions to design cost-effective, scalable, and high-performing data solutions.
Proficient in designing robust ETL workflows, seamlessly integrating data from diverse sources such as relational databases (MySQL, PostgreSQL) and APIs into distributed storage systems, ensuring high data quality and optimized performance.
Adept at ensuring data quality, integrity, and compliance with industry standards like PCI-DSS through effective data masking and validation.
Proficient in developing and implementing Spark RDD-based data processing workflows using Scala, Java, or Python programming languages.
Strong understanding of Spark RDD integration with other big data technologies, such as Hadoop, Hive, and Kafka, and their impact on data processing workflows and performance.
Familiarity with Spark RDD-based data processing libraries and frameworks, such as Apache Spark SQL, MLlib, and GraphX and their features and limitations.
Proficient in developing and implementing Spark Data Frame-based data processing workflows using Scala, Java, or Python programming languages.
Experienced in optimizing Spark Data Frame performance by tuning various configuration settings, such as memory allocation, caching, and serialization.
Familiarity with Spark DataFrame schema and data type operations, such as adding, renaming, and dropping columns, casting data types, and handling null values.
Ability to troubleshoot common issues with Spark DataFrame, such as data processing errors, performance bottlenecks, and scalability limitations.
Experience working with Spark DataFrame in production environments and implementing performance monitoring and alerting systems to detect and resolve performance issues proactively.
Demonstrated success in cross-functional collaboration, working with stakeholders to deliver actionable insights and support data-driven decision-making.
Expertise in using Spark SQL to process large-scale structured and semi-structured data sets, including querying, filtering, mapping, reducing, grouping, and aggregating data.
Skilled in using Spark SQL persistency and caching mechanisms to reduce data processing overhead and improve query performance.
Utilized Python libraries (Pandas, NumPy, PySpark) to analyze and process large-scale data in healthcare and insurance domains.
Experienced in implementing Data Quality checks and Governance frameworks to ensure data accuracy, consistency, and compliance with industry standards and regulations.
Proficient in using ETL testing tools and frameworks, such as QuerySurge, Talend Data Quality and Informatica Data Validation Option.
Well-versed in Agile development methodologies, promoting iterative and incremental approaches to manage data engineering projects efficiently.
Adept at using Git for Version Control, ensuring efficient collaboration, code management, and tracking changes in the development process.

TECHNICAL SKILLS

Big Data: Hadoop, Apache Spark, Apache Kafka, Apache Sqoop, Hive.
Amazon Web Services (AWS): EMR, EC2, RDS, Step Functions, Redshift, Athena, Glue.
Databases: MySQL, PostgreSQL, Maria DB.
No-SQL Databases: MongoDB, Cassandra.
Programming Languages: Python, Java, SQL.
BI Tools: Tableau, Power BI.
CI/CD: Docker, Git.

WORKING EXPERIENCE

Kaiser Permanente, Frisco, TX May 2023 Present
Data Engineer (Remote)

Designed and developed batch processing data pipelines on Amazon EMR using Apache Spark, Python, and Scala to process terabytes of data in a cost-effective and scalable manner.
Developed large-scale distributed data pipelines using PySpark on AWS EMR.
Optimized PySpark applications running on AWS EMR for efficient data processing.
Integrated AWS S3 with PySpark jobs to handle large datasets in a distributed environment.
Implemented distributed data processing using PySpark on AWS EMR for batch workflows.
Automated Spark job executions on AWS EMR clusters using AWS Step Functions.
Integrated AWS EC2 instances for managing and deploying AWS EMR clusters.
Configured Spark applications on AWS EMR to efficiently handle data transfers between AWS S3 and EC2 instances.
Implemented caching strategies in PySpark jobs on AWS EMR to reduce data load times from AWS S3.
Engineered loading the transformed data from the S3 processed bucket into Amazon Redshift using AWS EMR.
Developed and optimized Spark-based workflows to process serialized data in formats including Avro, Parquet, ORC, and Protobuf, leveraging their features and addressing limitations to ensure efficient data storage and retrieval.
Debugged complex PySpark issues on AWS EMR by analyzing detailed error logs stored in AWS S3.
Collaborated with DevOps teams for cluster provisioning and maintenance.
Debugged and optimized Spark code for performance bottlenecks.
Implemented caching strategies in PySpark jobs on AWS EMR to reduce data load times from AWS S3.
Tuned resource allocation on AWS EMR to reduce costs while running large-scale PySpark jobs.
Conducted data profiling and data quality checks in Spark.
Implemented Spark on Kubernetes for containerized deployments.
Hands-on experience managing Amazon EC2 instances, including AMI creation, VPC network configuration and auto-scaling for optimized performance and cost efficiency.
Strong experience in integrating Sqoop with data quality tools such as Trifacta and Talend.
Proficient in using Sqoop to automate data migrations between Hadoop clusters in different geographical regions.
Skilled in conducting data migration testing and data conversion validation in ETL processes.
Efficiently transferred refined data to Redshift Data Warehouse, meeting specific business needs and constraints.
Utilized Amazon Redshift Spectrum to seamlessly query and process data stored in Amazon S3, leveraging its SQL capabilities optimized for high-performance analytical processing.
Automated infrastructure provisioning and management on EC2 using Infrastructure as Code (IAC) tools like Terraform.
Implemented data masking and access control policies in Snowflake for compliance with healthcare data privacy regulations.
Integrated Python with Apache Airflow for workflow orchestration, enabling better tracking and scalability of data pipelines.
Experienced in identifying and reporting data quality issues and data anomalies during ETL testing.

Environment: Spark, Zookeeper, AWS EC2, AWS S3, AWS EMR, AWS Redshift, AWS Step Functions, ETL, Python, SQL, Tableau, XML, JSON, Shell Scripting, AWS IAM, Agile, QA.

USAA, Hyderabad, India Jan 2018 Jun 2022
Data Engineer

Successfully adapted to an Agile environment, enabling agile development methodologies to be seamlessly integrated into data engineering projects, resulting in improved project agility and responsiveness.
Designed and implemented end-to-end data integration solutions using Sqoop for large-scale data migrations from on-premises databases to Hadoop clusters.
Developed custom Sqoop connectors to support data transfers between Hadoop and proprietary data sources.
Optimized Sqoop jobs to achieve high throughput and low latency by fine-tuning parameters such as batch size, number of mappers, and fetch size, enabling efficient data ingestion from MySQL into HDFS.
Proficient in performing data validation and cleansing during data transfer using Sqoop's validation and cleansing options.
Adept in scheduling and automating Sqoop jobs for incremental runs.
Designed and managed Hive tables including creating managed, external, and partitioned tables to optimize query performance and support efficient data storage and retrieval.
Ability to troubleshoot common issues with Hive tables, such as data skew, table corruption, and query optimization.
Implemented performance monitoring and alerting systems for Hive in a production environment, proactively detecting and resolving performance bottlenecks to ensure optimal query execution and system reliability.
Designed and developed Spark applications to implement complex data transformations and aggregations for batch processing jobs, leveraging Spark SQL and DataFrames.
Designed and implemented data lake architectures on Amazon S3, leveraging partitioning and columnar formats such as Parquet to optimize query performance and minimize storage costs.
Optimized Spark jobs and data processing workflows for scalability, performance, and cost efficiency using techniques such as partitioning, compression, and caching.
Knowledgeable about testing data migration from legacy systems to modern platforms using ETL processes.
Automated repetitive data engineering tasks such as file ingestion and metadata management using Python-based frameworks.
Developed SQL scripts and stored procedures in Snowflake to handle complex business logic, ensuring high efficiency in data processing.

Environment: Spark, Apache Sqoop, Apache Hive, Hadoop, Zookeeper, AWS S3, ETL, Python, SQL, Shell Scripting, Agile, QA.

Lowe s, Hyderabad, India| May 2016 Dec 2017
Data Analytics Engineer
Proficient in developing and optimizing Spark-based data processing workflows using RDDs, DataFrames, and SQL in Scala, Java or Python.
Skilled in using Spark transformations (e.g., filtering, mapping, reducing, grouping, and aggregating) and actions to process structured and unstructured datasets.
Implemented data cleansing, transformation, and validation scripts in Python to ensure data accuracy for analytics in platforms like Snowflake and Redshift.
Developed Spark pipelines for feature engineering, data cleansing, and aggregation tasks.
Designed and implemented Spark jobs using Python.
Hands-on experience with Spark SQL for querying structured and semi-structured data and working with DataFrame APIs for analytics.
Built Spark applications for geospatial data analysis, customer segmentation, and clickstream analysis.
Developed PySpark applications to process structured and unstructured data stored in AWS S3.
Utilized AWS S3 for intermediate and final storage of datasets processed by PySpark.
Built PySpark pipelines on AWS EMR for data aggregation and transformation tasks.
Used AWS Hive to perform SQL queries on datasets processed by PySpark jobs.
Automated PySpark job executions on AWS EMR using AWS Step Functions.
Debugged PySpark jobs to resolve errors and improve processing efficiency.
Tuned PySpark applications for optimized resource usage and memory management on AWS EMR.
Configured Spark jobs to efficiently handle data transfers between AWS S3 and EMR clusters.
Proficient in creating and managing Hive tables, including managed, external, and partitioned tables.
Skilled in querying Hive tables using SQL-like syntax for data analysis and integration with tools like Apache Spark.
Familiarity with Hive table formats (e.g., ORC, Parquet, Avro) and their advantages for specific use cases.
Experienced in leveraging Hive partitioning and bucketing to optimize query performance.
Ability to troubleshoot common issues with Hive queries, such as data skew, query hangs, and slow execution times.
Proficient in using Hive explain plans to diagnose query performance issues and optimize execution.
Knowledge of Hive query tuning best practices, such as minimizing data transfers, avoiding unnecessary data conversions, and using efficient compression codecs.
Proficient in importing and exporting data between Hadoop and relational databases such as MySQL and SQL Server using Sqoop.
Skilled in using Sqoop to import and export data in file formats such as CSV, Avro, and Parquet.
Adept in loading data into Hive tables using Sqoop for processing and analysis.
Proficient in using Sqoop for incremental data transfers with Sqoop's incremental import feature.
Experienced in troubleshooting common Sqoop job failures and resolving performance issues.
Familiar with scheduling and automating Sqoop jobs for regular data imports.
Skilled in validating and cleansing data during transfers using Sqoop's built-in features.
Proficient in exporting Hadoop data to external databases for reporting and analytics.
Experienced in ETL testing methodologies and processes, including data validation and reconciliation.
Proficient in SQL queries and scripting for validating data transformations and business logic in ETL pipelines.
Skilled in testing data extraction processes from databases, files, and APIs.
Familiar with data profiling techniques to identify patterns and anomalies in ETL processes.
Knowledgeable about data quality and cleansing techniques in ETL testing.
Experienced in documenting test plans, test cases, and test results for ETL testing.
Proficient in testing ETL transformations, such as aggregation, filtering, sorting, and joining.
Familiar with ETL tools like Informatica, Talend, DataStage, or SSIS.


EDUCATION

University Of Illinois at Springfield, IL Aug 2022 - Dec 2023
Master in Computer Science

Jawaharlal Nehru Technological University, Hyderabad, India Aug 2012 - May 2016
Bachelor in Computer & Information Science

CERTIFICATIONS

Google Advanced Data Analytics Professional
Keywords: continuous integration continuous deployment quality analyst business intelligence sthree database Illinois Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];4918
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: