| Veda Anand - Sr. Data Engineer |
| [email protected] |
| Location: Booneville, Arkansas, USA |
| Relocation: Yes |
| Visa: H4EAD |
| Resume file: Navya DE Resume (1)_1763045170345.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
NAVYASRI A
Sr. Data Engineer Email: [email protected] LinkedIn: https://www.linkedin.com/in/navyasri-a- 5ab819211/ PROFESSIONAL SUMMARY: Senior Data Engineer with 8+ years of experience designing, developing, and optimizing large-scale data pipelines and cloud-native architectures across AWS, Azure, and GCP, with a strong focus on performance, scalability, and reliability. Extensive hands-on expertise in Python and Java, building modular, production-ready data engineering solutions and API-driven workflows in GCP, Databricks, and on-premise environments. Proven experience in developing and orchestrating complex ETL/ELT frameworks using Python, DBT, Talend, Matillion, and Databricks Notebooks to support batch and streaming workloads across multi- cloud platforms, including GCP. Strong command over Apache Spark (Core, SQL, PySpark, Streaming) and Databricks for processing high-volume structured and semi-structured data, leveraging GCP Dataproc and BigQuery for large- scale distributed computing. Expertise in building robust, cloud-native data pipelines using GCP Dataflow, Apache Beam, Cloud Composer (Airflow), Cloud Storage, and Pub/Sub, supporting real-time and batch ingestion pipelines. Skilled in implementing real-time event streaming solutions using Apache Kafka, GCP Pub/Sub, and Databricks Structured Streaming, enabling event-driven analytics and time-sensitive processing. Deep experience with modern data warehousing platforms such as BigQuery (GCP), Snowflake, Redshift, and Azure Synapse, including performance tuning, cost optimization, and data lifecycle management. Proficient in data modeling using Kimball methodology, designing scalable Star and Snowflake schemas for BI tools, and implementing model-driven designs in Python, SQL, and Java pipelines. Strong SQL, PL/SQL, and Python-based scripting capabilities, with expertise in window functions, CTEs, stored procedures, and performance optimization for both traditional and cloud-native databases. Experience integrating machine learning models into production using Python, Databricks ML, and Vertex AI (GCP) to support predictive analytics and intelligent automation. Experienced in developing RESTful APIs using Python (Flask/FastAPI) and Java (Spring Boot) for automated data delivery, service integration, and microservices architecture within GCP. Familiar with modern DevOps and CI/CD tools for data engineering, including Terraform for GCP, GitHub Actions, and Docker, streamlining deployment pipelines for Databricks and GCP-based workflows. Expertise in dashboarding and data storytelling using Power BI, D3.js, and integrating visualizations into web applications powered by Python or Java backends. Strong advocate for data security and governance in GCP, implementing IAM, VPC-SC, data encryption, and access control policies to ensure compliance with regulatory standards. Collaborative team player and communicator, frequently partnering with data scientists, ML engineers, and GCP solution architects to build scalable, cloud-native systems in Databricks and GCP. Passionate about continuous learning, always exploring new capabilities within GCP, Python, Java, and Databricks to improve efficiency, performance, and team productivity in Agile environments. TECHNICAL SKILLS: Category Technologies / Tools Cloud Platforms AWS: S3, EC2, RDS, Redshift, Lambda, Glue, Athena, EMR, Lake Formation, Kinesis, Step Functions, CodePipeline, IAM Azure: Data Factory, ADLS, Synapse Analytics, Stream Analytics, Functions, Monitor, DevOps, RBAC GCP: BigQuery, Cloud Storage, Dataflow, Cloud Functions, Dataproc Big Data & Distributed Computing Apache Spark (Core, SQL, PySpark, Streaming), Apache Kafka, Apache Beam, Hadoop (MapReduce, Hive, HDFS, HBase) ETL & Data DBT, Matillion, Talend, Informatica Integration Programming & Scripting Python (Pandas, NumPy, Matplotlib), Scala, Java, PL/SQL, SQL, PowerShell Orchestration & Workflow Apache Airflow, GitHub Actions Data Warehousing & Storage Azure Synapse Analytics, Snowflake, Redshift, BigQuery Data Modeling & Governance Kimball Methodology, Star & Snowflake Schemas, Erwin Data Modeler, Data Lineage, Encryption, Audit Logging DevOps & CI/CD Jenkins, Terraform, AWS CloudFormation, Git Visualization & BI Power BI, D3.js, Excel (Pivot Tables) APIs & Integration RESTful APIs, Internal APIs, Event-Driven Architecture Security & Access Control IAM, RBAC Other Tools & Practices Bugzilla, Azure Monitor, Agile (Kanban, Scrum), JIRA CERTIFICATIONS: . Freedom with AI - AI Tools, Prompt Engineering, Content Creation . AWS: Certified Solutions Architecture Job Simulation . Data Visualisation: Empowering Business with Effective Insights - TATA Consultancy Services EXPERIENCE: Client: InfoSmart Technologies (Remote) Data Engineer| Dec 2022 Present Essential Contributions: Created and sustained scalable data pipelines for the Equinix Digital Ecosystem Platform, facilitating real- time analytics and smooth data transfer across worldwide data centers. Developed data quality frameworks and automated processes utilizing Apache Airflow, improving the reliability and efficiency of enterprise data management. snow Roles & Responsibilities: Developed, maintained, and automated data processing workflows on the Google Cloud Platform (GCP) to ensure efficient data handling and transformation. Built and optimized scalable ETL pipelines on GCP using Spark and Airflow to support large-scale data processing and analytics. Partnered with data science teams to prepare data sets and support ML model deployment for personalization use cases. Developed Airflow DAGs to orchestrate and monitor data workflows, improving pipeline reliability and transparency. Designed and implemented scalable data pipelines for Data ingestion and transformation using Spark Performed data cleaning and transformation tasks to ensure data quality and consistency, using SQL and Python for data manipulation and preparation. Engaged with stakeholders to understand data needs and deliver tailored insights, effectively communicating findings and recommendations to drive strategic decisions. Validated data quality and completeness using SQL and BigQuery, ensuring consistent, trustworthy outputs for downstream analytics. Created detailed reports and dashboards utilizing Power BI and Looker for comprehensive data analysis and insights. Optimized SQL performance through indexing, filtering, and efficient joins to reduce query time and enhance data accessibility across business units on platforms like Redshift, BigQuery (GCP), Snowflake, and Databricks. Engineered real-time data pipelines using AWS Kinesis, Apache Beam, and GCP Pub/Sub to capture and process high-velocity event streams for immediate downstream consumption. Designed and implemented OLTP systems focusing on transaction throughput and consistency, supporting real-time operations across critical GCP- and AWS-based platforms. Applied AWS Lambda and Google Cloud Functions to build lightweight, serverless functions for data transformations using Python, improving flexibility and reducing infrastructure overhead. Used Presto, BigQuery, and Snowflake to unify querying across distributed sources, improving access to cross-platform data with minimal performance lag. Developed and maintained ETL frameworks using DBT, AWS Glue, and Databricks Notebooks built with Python to ensure timely data delivery, clean transformations, and scalable logic across AWS, GCP, and Snowflake environments. Created and managed Kimball-style data marts on Redshift, BigQuery, Snowflake, and Databricks to support reporting and dashboarding use cases for various business domains. Leveraged Spark-SQL and Databricks with advanced SQL constructs such as window functions and CTEs to streamline logic and optimize data transformations. Utilized Pandas and NumPy in Python to perform complex data cleaning, feature engineering, and analytical preprocessing for both batch and streaming pipelines on GCP, Databricks, and Snowflake. Integrated RESTful APIs and internal systems using Python, Java, and Google Cloud Endpoints to automate ingestion workflows and expand data connectivity across platforms. Developed interactive data visualizations using D3.js, JavaScript, and Python libraries to present real-time insights clearly to business stakeholders, deployed via GCP App Engine. Streamlined governance practices using data lineage tracking tools like Databricks Unity Catalog, Google Cloud Data Catalog, and Snowflake to ensure compliance, traceability, and trust in analytical outputs. Employed MapReduce processing on Hadoop clusters and Dataproc (GCP) to efficiently handle large-scale datasets and reduce batch runtime. Applied bucketing and partitioning strategies in Apache Hive, BigQuery, and Snowflake to reduce scan overhead and accelerate query execution. Constructed data pipelines using Apache Spark, Apache Beam, and Databricks, enabling high-volume batch and streaming transformations in GCP, Snowflake, and multi-cloud environments. Used Amazon Redshift, BigQuery, and Snowflake to build performant data warehouse environments, optimize schema design, and improve analytical response times. Enforced data governance policies using AWS Lake Formation, GCP IAM, and Databricks Access Control to maintain secure, well-cataloged data lakes. Utilized Databricks and Snowflake extensively for collaborative engineering, version control, and efficient execution of Python- and SQL-based analytics workflows across GCP and AWS. Collaborated with DevOps teams to implement end-to-end CI/CD automation via AWS CodePipeline, Cloud Build (GCP), and Databricks Repos, enabling smooth code releases. Automated infrastructure setup and configuration through AWS CloudFormation, Terraform for GCP, and Databricks CLI, ensuring consistent, repeatable deployments across cloud environments. Built and deployed containerized data applications using Docker, orchestrated via Kubernetes (GKE in GCP) for high availability, auto-scaling, and simplified microservices management. Environment: SQL, GCP, AWS Kinesis, AWS Lambda, Presto, DBT, AWS Glue, Databricks, Spark-SQL, Python (Pandas, NumPy), Java, RESTful APIs, D3.js, JavaScript, Hadoop, MapReduce, Apache Hive, Apache Spark, Apache Beam, Amazon Redshift, BigQuery, AWS Lake Formation, AWS CodePipeline, AWS CloudFormation, Terraform, Docker, Kubernetes. Client: Capgemini (Johnson and Johnson) Sr Analyst/Software Engineer | Dec 2018 to Dec 2021 Main Contributions: Created and enhanced data pipelines on AWS with PySpark and SQL to analyze large-scale customer behavior and transaction data for immediate insights. Developed automated workflows for data validation and transformation with Apache Airflow and DBT, enhancing data accuracy and decreasing manual intervention by 40%. Roles & Responsibilities: Utilized Hadoop technologies such as Hive and Spark to construct efficient data pipelines, significantly improving data flow and processing within the company. Collaborated with team members to design and implement data models for efficient data processing, contributing to improved business intelligence and decision-making. Assisted in migrating on-premises data to Google cloud, ensuring a seamless transition with minimal downtime and scheduling them using Airflow. Worked with various teams to understand their data needs, using this information to develop effective data pipelines that supported business goals. Collected and organized data from various sources from Learning Management System. Resolved data processing problems as they arose, ensuring the delivery of high-quality, reliable data. Committed to continuous learning of new data engineering methodologies and tools, contributing to the team's technical knowledge and proficiency. Developed interactive dashboards and reports using Power BI and Tableau to visualize key metrics. Communicated key findings from data to multiple stakeholders to facilitate data driven decisions. Built scalable ETL workflows in Matillion to extract, transform, and load data across cloud environments, including AWS, GCP, and Databricks, accelerating pipeline development. Developed distributed batch processing jobs on Amazon EMR and Dataproc (GCP) using Apache Spark, Hive, and Python, efficiently handling large-scale transformations and joins. Designed and optimized advanced PL/SQL procedures to streamline critical business operations, ensuring efficient data transformation and reliable reporting delivery across cloud and hybrid platforms. Applied Apache Spark and Databricks for high-performance data aggregation, cleansing, and shaping across structured and semi-structured data sources on GCP, AWS, and on-prem environments. Configured Informatica workflows and reusable mappings to automate ingestion from diverse enterprise systems, enhancing reusability, modularity, and consistency across cloud platforms like GCP and AWS. Developed and integrated RESTful APIs using Python and Java to connect enterprise data systems with external platforms, enabling real-time data exchange and automation. Created dashboards and performance visualizations using Python (Matplotlib), Databricks, and GCP Looker Studio to monitor data pipeline health and optimize runtime efficiency. Automated job execution on Amazon EC2 and GCP Compute Engine, scheduling compute-intensive processes during off-peak hours for cost efficiency. Enforced data security using AWS IAM policies and GCP IAM, implementing least-privilege access and role-based controls across multi-cloud environments. Implemented centralized data governance using AWS Lake Formation, Google Cloud Data Catalog, and Databricks Unity Catalog to define fine-grained access policies and ensure regulatory compliance. Built real-time streaming data pipelines using Apache Kafka, GCP Pub/Sub, and Databricks Structured Streaming, facilitating rapid, reliable data exchange across distributed microservices. Designed event-triggered data processing solutions using AWS Lambda, Cloud Functions (GCP), and Python, connecting real-time ingestion with downstream transformation layers. Coordinated serverless data workflows using AWS Step Functions and GCP Workflows, improving fault tolerance and workflow orchestration across dependent cloud services. Led migration of legacy data warehouses to Snowflake, BigQuery, and Databricks Delta Lake, reducing query times and simplifying data access for business stakeholders. Provisioned cloud infrastructure using Terraform for AWS, GCP, and Databricks, enabling repeatable, version-controlled deployments across staging and production. Deployed CI/CD pipelines using Jenkins, GitHub Actions, and Databricks Repos to automate data pipeline builds, testing, and releases, improving code reliability and reducing deployment risks. Designed dimensional models in Snowflake using star and snowflake schemas to support high- performance analytics and self-service BI tools like Power BI and Looker (GCP). Queried petabyte-scale datasets using Amazon Athena, BigQuery, and Presto, enabling analysts to gain near real-time insights without heavy infrastructure overhead. Structured raw and curated datasets in Amazon S3, Google Cloud Storage, and Delta Lake (Databricks), creating a robust, scalable data foundation for analytics and long-term archiving. Managed processing of unstructured big data using Hadoop-based architectures, GCP Dataproc, and Apache Spark, enabling downstream analysis in BI and reporting platforms. Established data quality checks and exception handling frameworks using Python, Databricks Workflows, and Airflow to track anomalies, validate integrity, and ensure reliable reporting. Environment: Matillion, Amazon EMR, Apache Spark, Hive, PL/SQL, Informatica, RESTful APIs, Python (Matplotlib), Amazon EC2, AWS IAM, AWS Lake Formation, Apache Kafka, AWS Lambda, AWS Step Functions, Snowflake, Terraform, Jenkins, Amazon Athena, Amazon S3, Hadoop. Client: Capgemini (Johnson and Johnson) Software Analyst/Engineer | Sep 2016 to Nov 2018 Roles & Responsibilities: Designed and implemented a microservices architecture using Spring Boot. Assisted in the development of a healthcare management system using Spring MVC. Implemented RESTful APIs for user authentication, and progress tracking, adhering to REST principles and best practices. Utilized Hibernate for object- relational mapping, facilitating efficient interaction with MySQL database. Was responsible to communicate with End client to support the application and analyze and fix the issue. Collaborated with cross functional teams in agile development environment, ensuring the timely delivery of project milestones. Extensively used Java OOPs concepts for developing Automation Frameworks using Eclipse, Selenium WebDriver, cucumber and TestNG. Implemented POM, Data-driven framework, and executed automation scripts and manual test cases in different environments. Developed automated solutions to expedite testing to address unit testing, regression testing, negative testing and bug retests. Managed full-lifecycle data pipelines in Azure Data Factory, coordinating data ingestion, transformation, and validation across hybrid cloud sources. Structured multi-zone data architecture in Azure Data Lake Storage (ADLS), organizing raw, refined, and curated layers to meet enterprise reporting needs. Automated infrastructure provisioning with PowerShell scripts, accelerating environment setup, and minimizing human error. Built and optimized high-performance data marts using Azure Synapse Analytics, enabling large-scale analytics through serverless SQL pools. Designed and deployed real-time analytics pipelines using Azure Stream Analytics to monitor IoT telemetry and detect anomalies, enhancing operational visibility. Developed and optimized data transformation scripts with PySpark, enabling machine learning readiness and reducing ETL latency. Built scalable ETL workflows in Azure Databricks, leveraging Delta Lake to streamline batch and streaming processes for improved data availability. Automated event-driven processing using Azure Functions, enhancing scalability and reducing response time in real-time data applications. Deployed Azure services focusing on high availability and compliance, using Azure Monitor and Application Insights for end-to-end observability. Implemented RBAC (Role-Based Access Control) to secure sensitive data access and enforce governance policies across environments. Designed dimensional models using star schema principles, improving BI performance and simplifying data navigation for analysts. Environment: ADF, ADLS, PowerShell, Azure Synapse Analytics, Azure Stream Analytics, PySpark, Azure Databricks, Delta Lake, Azure Functions, Azure Monitor, Application Insights, RBAC, Apache Spark, Keywords: continuous integration continuous deployment artificial intelligence machine learning javascript business intelligence sthree procedural language South Carolina |