Resume View

Home

Resume of GCP/Azure Data Engineer | 13+ years | H1B - GCP/Azure Data Engineer

Location: , , USA

Relocation: Yes

Visa: H1B

Resume file: Anusha Reddy - Cloud DE_1768923613061.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.

Professional Summary:
Over 13 years of experience as a Senior Data Engineer, specializing in cloud (GCP and Azure) with expertise in building and deploying scalable data pipelines using Hadoop-based technologies such as Apache Spark, Hive, Hadoop, Python, and PySpark.
Proficient in utilizing GCP services including Cloud Storage, DataProc, Data Flow, Big Query, Cloud Composer, and Cloud Pub/Sub for efficient data management and pipeline orchestration.
Hands-on experience in developing Spark applications using PySpark Data Frame, RDD, and Spark SQL for large-scale data processing.
Expertise in creating, debugging, scheduling, and monitoring tasks with Airflow in Cloud Composer.
Adept in using Cloud Pub/Sub to replicate data in real-time from source systems to Big Query for timely data availability.
Strong understanding of GCP service accounts, billing projects, authorized views, datasets, GCS buckets, and gsutil commands for data management and cloud resource management.
Experience in deploying Spark applications on Hadoop platforms such as Hortonworks Data Platform and AWS EMR for high-performance data processing.
Involved in the design, architecture, and implementation of cloud-based solutions, ensuring the scalability and robustness of systems.
Worked on infrastructure management, ensuring system availability, performance, and capacity across cloud environments, addressing incidents and optimizing efficiency.
Extensive experience with Spark using Scala for distributed computational analytics on large data sets.
Designed and maintained scalable, fault-tolerant cloud environments on AWS and Azure, spanning multiple availability zones using infrastructure-as-code tools like Terraform and CloudFormation.
Created and maintained infrastructure for development, staging, production, and disaster recovery environments using Terraform scripts.
Engaged in the deployment of cloud services across the AWS stack, including EC2, Route53, S3, RDS, DynamoDB, SNS, SQS, and IAM, with a focus on high availability, fault tolerance, and auto-scaling.
Continuously keeping up to date with new technology stacks released by the Google Cloud platform.
Experience in working with both structured and semi-structured data formats such as Parquet, ORC, CSV, and JSON, ensuring compatibility and performance across data workflows.
Hands-on experience with a variety of GCP services including Big Query, Cloud Storage (GCS), Cloud Functions, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, and Operations Suite (Stack driver).
Practical experience in designing and implementing data engineering pipelines in AWS using EMR, AWS Glue, EC2, Lambda, Athena, Redshift, Scoop, and Hive.
Expertise in programming with Python, Scala, Java, and SQL to develop scalable data processing and analysis solutions.
In-depth knowledge of distributed systems architecture and parallel processing frameworks to optimize large-scale data workflows.
Developed end-to-end data pipelines to extract, cleanse, process, and analyze large behavioral and log data sets for actionable insights.
Used application monitoring tools such as Splunk, Elasticsearch, Logstash, and Kibana for efficient logging and real-time monitoring of applications.
Experienced with Teradata utilities (BTEQ, FLOAD, MLOAD, TPUMP, Analyst) and SQL Server for large-scale data management and performance tuning.
Expertise in query optimization and performance tuning of stored procedures and functions to improve system efficiency.
Well-versed in data warehousing techniques including data cleansing, Slowly Changing Dimensions (SCDs), and Change Data Capture (CDC) for effective data management.
Integrated various data sources from databases such as Teradata, Oracle, SQL Server, and flat files for comprehensive data processing.
Developed complex SQL code in Teradata using OLAP functions, aggregate functions, and BTEQ scripts for data manipulation and transformation.
Extensive experience with Teradata SQL Assistant, writing BTEQ scripts to load data from staging to data warehouse, and optimizing performance through best practices.

TECHNICAL SKILLS
Skill Category Skills
Programming Languages Java, PL/SQL, Python, R, HiveQL, Scala, SQL, Spark, T-SQL
Data Modelling & Architecture Dimensional Modeling, Star, Snowflake schema, OLTP/OLAP
Programming Languages Java, PL/SQL, Python, R, HiveQL, Scala, SQL
Cloud Services AWS (S3, Glue Redshift), GCP (Big Query, Cloud Storage, Dataflow, Cloud Functions, Pub/Sub, Cloud Composer), Azure (Azure Data Factory, ADLS, Azure Databricks azure synapse analytics, Azure Monitor, Key vault)
Data Warehousing Snowflake, Teradata, Amazon Redshift
Data Visualization Tableau, Power BI, Looker
Development Tools Eclipse, Maven, Git, Jenkins
Operating Systems Linux, Windows, UNIX, Ubuntu
Machine Learning Big Query ML, TensorFlow, PyTorch
Containerization Docker, Kubernetes
CI/CD & Automation Jenkins, Terraform
Data Processing Apache Spark, Dataflow, Cloud Dataflow, Dataproc

EDUCATION
Master's degree, Business Analytics (Applied Machine Learning) from JSOM, UT Dallas-2023
Bachelor of Science in Economics BITS-Pilani, India-June 2011
Certified AWS solution architect (Certification no. - QE1J11JCSEVQQPGS)

WORK EXPERIENCE

Client: Walmart, Bentonville, AR Apr 2025-Present
Role: Senior Data Engineer
Developed the Auto Purge framework using Java, Spark, and Spring Boot to automate data retention and deletion workflows across multiple Hive and Big Query data assets.
Built and deployed Spark and PySpark applications on GCP Dataproc and serverless Spark for data processing, validation, and compliance auditing.
Integrated Airflow DAGs with GCS and Hive metadata for orchestrating purge tasks, checkpoint tracking, and automated recovery logic.
Developed RESTful APIs in Java and Spring Boot to fetch retention metadata, process correlation IDs, and update purge completion status.
Implemented audit and validation pipelines using Spark and Big Query to compare pre- and post-purge data, capturing metrics in GCS for compliance reporting.
Designed schema validation and dynamic configuration modules to support environment-based execution across Dev, Stg, and Prod.
Optimized Spark performance for partitioned Hive tables by tuning file I/O, caching, and broadcast joins for large-scale purges.
Developed CI/CD pipelines in GitHub and Jenkins for automated code packaging, versioning, and Airflow deployment across environments.
Built monitoring scripts in Python and Bash for tracking purge metrics, audit log completion, and Airflow DAG run statuses.
Collaborated with Privacy, Legal, and Data Governance teams to align purge logic with regulatory compliance and retention policies.

Client: Southern California Edison (SCE), Dallas, TX Mar 2024 Feb 2025
Role: Senior GCP Data Engineer
Optimized complex SQL queries to enhance performance and enable detailed analytical reporting for business stakeholders.
Designed and implemented scalable data pipelines using Google Cloud Platform (GCP) services, including Cloud DataProc, Google Cloud Storage (GCS), and Big Query, to handle large-scale data processing and analytics.
Migrated legacy Teradata data warehouse to Google Big Query, ensuring consistent data mapping, schema alignment, and validation during the process.
Transformed HQL, SAS scripts, and SQL queries into Big Query SQL scripts, improving query performance and adhering to cloud-native best practices.
Created validation queries to ensure the accuracy and integrity of migrated datasets, implementing thorough data quality checks.
Developed and automated ETL pipelines using Airflow DAGs for data ingestion, transformation, and loading, streamlining the workflow process.
Wrote Python and PySpark scripts to process large-scale datasets, generate reports, and automate FTP file transfers, ensuring seamless delivery.
Designed scalable solutions to handle diverse data formats and volumes, achieving a 99.9% success rate in ensuring data integrity during migration.
Worked with cross-functional teams to modernize legacy data structures and integrate them into Big Query, utilizing advanced cloud capabilities.
Established automated data validation frameworks to compare on-prem and Big Query datasets, proactively identifying and correcting discrepancies.
Configured and optimized Airflow DAGs to integrate Big Query operations with FTP processes, reducing manual intervention by 40%.
Optimized data pipelines for cost and performance efficiency using GCP tools such as Dataflow and Big Query partitions.
Developed custom SQL Looker views to transform raw data into actionable reports, analyzing metrics such as medication adherence, insurance claims, and patient care.

Client: Etsy, Dallas, TX Jun 2023 Feb 2024
Role: Sr GCP Data Engineer
Successfully migrated numerous complex data processing jobs from Scalding to Spark Scala, enhancing performance and maintainability in cloud environments.
Migrated complex data processing jobs from Scalding to Spark Scala, improving performance and simplifying maintenance, with PostgreSQL/Alloy DB integration for optimized data storage.
Refactored ML workflows to optimize training efficiency, leveraging PostgreSQL and Alloy DB for data storage, improving scalability and data handling.
Integrated new Scala-based ML jobs into Airflow DAGs, improving processing efficiency for large datasets, with data stored in PostgreSQL/Alloy DB for quick retrieval.
Automated code migration using LLM frameworks, simplifying the transition from legacy ML patterns to new models, with integration to Alloy DB for improved performance.
Tuned and optimized workflows in Alloy DB, enhancing computational efficiency and reducing resource utilization for large-scale retail data.
Incorporated changes into Airflow DAGs to integrate the new Scala jobs, automating the data pipeline orchestration.
Conducted comprehensive testing of the migrated jobs and DAGs to ensure reliability and accuracy, identifying and addressing potential issues early in the process.
Fine-tuned job configurations and resources to optimize performance, resulting in faster data processing times and reduced operational costs.
Designed and implemented data extraction, transformation, and loading (ETL) processes to migrate data between on-premises systems and GCP environments, ensuring data accuracy and integrity.
Designed and implemented real-time data pipelines using GCP, leveraging Big Query, Dataflow, Cloud Composer, and Pub/Sub to enable seamless data ingestion, transformation, and processing at scale.
Developed and optimized data architecture and schema models for high-performance data retrieval, leveraging Big Query s partitioning and clustering capabilities for large-scale datasets.
Integrated GCP to create real-time ingestion frameworks for capturing logs and user behavior data, supporting data-driven decisions.
Led migration of legacy data systems to GCP, creating ETL pipelines with Cloud Composer, Dataflow, and Big Query, improving data processing efficiency and scalability.
Implemented real-time data validation frameworks using Dataflow and dbt, ensuring consistent data quality across systems, with data stored in PostgreSQL/Alloy DB.
Spearheaded the creation of data lake architecture using GCS and Big Query to efficiently store and retrieve structured and semi-structured data.
Implemented real-time data validation frameworks using Dataflow and dbt, ensuring high data quality and consistency across multiple systems.
Automated incremental data loads from on-premises systems to Big Query using Cloud Functions and Cloud Scheduler, ensuring the efficient processing of daily transactions.
Developed machine learning models in Big Query ML to identify trends in course recommendations, student engagement, and subscription renewals.

Client: Walmart, Dallas, TX Oct 2022 May 2023
Role: GCP Data Engineer
Effectively implemented an automated data expulsion protocol within the Google Cloud Storage (GCS) Data Lake, leveraging Scala in compliance with CCPA regulations, ensuring secure and compliant data storage and handling.
Migrated 600 Automic jobs to Airflow, automating job orchestration and significantly improving workflow efficiency while ensuring data quality and accuracy.
Designed and developed UAT mapping tools for automating the testing and deployment of data workflows, reducing manual effort and improving deployment accuracy.
Created comprehensive Confluence and requirements documents for resolving complex issues related to Airflow DAGs, enabling effective communication and ensuring smooth team collaboration.
Executed backfill operations using Apache Spark SQL in the GCP environment, ensuring data completeness and maintaining data integrity across pipelines.
Used GCP Dataproc as a compute engine for running Spark applications like indexer, log converter to enhance data flowing into snowflake.
Used GCP Storage as target storage location from different advertising companies and circana, MMPpartners.
Used GCP Big Query as a warehouse for ad hoc analysis while working with ML engineers.
Used Snowflake with GCP as storage since Samsung migrated to GCP from AWS.
Leveraged GCP Dataproc to scale and manage Spark clusters, improving computational efficiency and significantly reducing processing time for large data sets.
Led the migration of data storage from Azure Blob Storage to GCS, ensuring seamless data transfer and improving storage cost efficiency by 20%.
Re-engineered ETL pipelines, migrating from ADF to Apache Airflow, reducing pipeline execution time by 25%.
Migrated machine learning models from Azure Machine Learning to Google AI Platform, achieving faster model training times.
Replaced Azure Synapse Analytics with Google Big Query for data warehousing, optimizing query performance and reducing costs through efficient data partitioning.
Successfully moved containerized workloads from Azure Kubernetes Service (AKS) to Google Kubernetes Engine (GKE), improving scalability and deployment.
Environment: Scala, Spark, Python, Airflow, GCP (GCS, Dataproc, Big Query), Bash, Jenkins, git

Client: TechnipFMC. Hyd, India Feb 2017- Dec 2021
Role: Data Engineer
Proficient in working with Azure cloud platform (DataBricks, Data Factory, HDInsight, Data Lake, Blob Storage, Synapse Analytics, Azure SQL, SQL pool, Azure Serverless apps)
Involved in building Enterprise Data Lake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions.
Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems, relational and non-relational to meet business functional requirements.
Created, and provisioned multiple Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
Have good experience in setting up separate applications and reporting data tiers across servers using Geo-replication functionality and failover groups.
Used GCP Big Query as a warehouse for ad hoc analysis while working with data scientists.
Used GCP Dataproc as a compute engine for running Spark applications.
Used GCP Storage as the target storage location for supply chain data arriving from sources.
Extensively used Databricks notebooks for data processing and interactive analytics using Spark API s
Extensive knowledge in data transformations, mapping, cleansing, monitoring, debugging, performance tuning, and troubleshooting clusters.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
Implemented Disaster Recovery and Failover servers in Cloud by replicating data across regions.
Performed service notifications (SN) data classification for subsea components using Natural Language Processing (NLP), leading to a design improvement project based on field issues.
Implemented a machine learning-based classification system for failure reports, improving maintenance planning and reducing costs.

Client: FMC Technologies, Hyderabad, India June 2011 - Jan 2017
Role: Big Data Engineer
Developed a predictive maintenance model for offshore drilling equipment, reducing downtime by 25%.
Built a real-time data ingestion pipeline using Spark and Hive to process sensor data from oil rigs.
Automated the deployment of data processing workflows, integrating Jenkins with Hadoop for seamless execution.
Also worked on file-based Ingestion like csv, Json, txt and tar zip files (tsv.gz, tar.gz) to load them into Azure data lake and create raw tables in delta format.
Implemented an ELK stack to monitor log data, improving issue detection in production systems.
Designed an Airflow-based workflow to automate data validation and scheduling across multiple sources.
Led a team in optimizing a legacy SQL-based reporting system, reducing query execution time by 40%.
Utilized AWS CLI to manage cloud resources like EC2, S3, and RDS, improving system scalability.
Created Hive tables to store the processed results in a tabular format.
Extensively worked with moving data across cloud architectures including Redshift, hive, and S3 buckets.
Involved in working with Hive for the data retrieval process and writing Hive queries to load and process data in the Hadoop file system.
Developed data pipeline using Flume, Sqoop, Hive, and Spark to ingest subscriber data, provider data, and claims into HDFS for analysis.
Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming.
Load D-Stream data into Spark RDD and do in-memory data Computation to generate output response.
Being confident in and well-experienced as a SQL server database administrator is also beneficial.
Deployed the packages in the solution explorer to catalog which is in the management studio.
constraints table in DW to load data to catch many errors like data type violation, NULL constraint violation, foreign key violation, and duplicate data.
Generated reports to maintain zero percent errors in all the data warehouse tables.
Developed SSIS Packages for migrating data from the Staging Area of SQL Server 2008 to SQL Server 2012.
Qualifications Qualified with a relevant technical tertiary qualification and/or relevant professional experience.
Created tables, ran ETL jobs, wrote dynamic scripts to generate stored procedures, created tables, etc., Created packages in the solution explorer (In EXTRACT, TRANSFORM AND LOAD) and wrote SQL query in data flow tasks, creating variable, error handling process and writing functions.
Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory rlang procedural language Arkansas Texas Utah

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)

[email protected];6665

Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: