| Julia - Data Engineer |
| [email protected] |
| Location: Little Elm, Texas, USA |
| Relocation: Yes |
| Visa: GC |
| Resume file: Julia_Reddy_Data_Engineer(1)_1772743485136.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Julia Reddy || Sr Data Engineer Phone: 312-554-5568
Email: [email protected] SR DATA ENGINEER PROFESSIONAL SUMMARY Over 10 years of professional IT experience, with more than 5 years in Data Engineering and 4 years in Data Warehousing. Extensive hands-on experience with Azure Cloud (Azure Data Factory, Azure Databricks, Azure Synapse, Azure Logic Apps, and Azure DevOps) and AWS Cloud (AWS S3, Redshift, Glue, Lambda, SNS, SQS, Athena). Expertise in designing, implementing, and optimizing end-to-end data pipelines using PySpark, Spark, and SQL, ensuring efficient data extraction, transformation, and loading (ETL) from multiple data sources. Proven experience in leveraging Snowflake for scalable data solutions, including data modeling, partitioning, and using Snowpipe for real-time data ingestion. Hands-on experience in SnowSQL, advanced SQL queries, and performance tuning techniques to optimize large-scale data workloads on Snowflake. Strong proficiency in data streaming and event-driven architectures, utilizing Apache Kafka, Spark Streaming, and AWS Kinesis for real-time data processing and analytics. Skilled in data integration, automating workflows with Azure Logic Apps, AWS Glue, and Apache Airflow, and integrating event processing with AWS SNS and SQS. Expertise in building and maintaining data architectures with high performance and scalability, using technologies such as Hadoop, Hive, MapReduce, and HDFS. Experience with containerization and deploying scalable data processing applications using Docker, Kubernetes, and cloud infrastructure tools like Terraform and ARM templates. Knowledge of using Azure Functions and AWS Lambda for serverless computing and automation of data transformation tasks. Proficient in developing real-time data processing pipelines and improving system efficiency using Apache Spark, PySpark, HDFS, and Apache Sqoop. Experienced in data security, implementing measures such as encryption, data masking, and access control using Azure Key Vault, AWS IAM, and Snowflake roles for data governance. Skilled in CI/CD pipeline development and automated deployments using Azure DevOps, AWS CodePipeline, and Git for continuous integration and deployment of data solutions. Hands-on experience in monitoring and performance optimization using tools like AWS CloudWatch, Azure Monitor, and Snowflake s caching mechanisms for efficient data processing and storage. Expertise in working with large datasets and performing data transformations, cleaning, and validation using Python, SQL, Scala, and HiveQL. Strong background in data warehousing solutions, including the creation and optimization of data marts, EDWs (Enterprise Data Warehouses), and star/snowflake schemas. Experience with agile methodologies and collaborating with cross-functional teams for the seamless development of data solutions. EDUCATION Bachelor's in computer science from Jawaharlal Nehru Technological University Hyderabad, India. TECHNICAL SKILLS Azure Services Azure Data Factory, Azure Databricks, Azure Synapse, Azure Logic Apps, Azure Functions, Azure DevOps, Snowflake AWS Services AWS S3, Redshift, EMR, SNS, SQS, Athena, Glue, CloudWatch, Kinesis, Route 53, IAM, Lambda Cloud Technologies Microsoft Azure, Google Cloud Platform (GCP), PCF Big Data Technologies MapReduce, Hive, HBase, PySpark, Spark, Spark Streaming, Kafka, Sqoop, Zookeeper, Oozie, Flume, Pig, Teg Hadoop Distribution Cloudera, HortonWorks Languages Java, SQL, PL/SQL, Python, HiveQL, Scala, Bash Web Technologies HTML, CSS, JavaScript, XML, JSP, RESTful APIs, SOAP Operating Systems Windows (XP/7/8/10), UNIX, Linux, Ubuntu, CentOS Build Automation Tools Ant, Maven, Jenkins Version Control Git, GitHub, GitLab, VSS Methodology Agile, Scrum, Continuous Integration/Continuous Deployment (CI/CD) IDE & Build Tools Eclipse, Visual Studio, IntelliJ IDEA Databases MS SQL Server (2016/2014/2012), Azure SQL Database, Azure Synapse Analytics, Cosmos DB, Oracle 11g/12c, Teradata Data Warehousing & ETL Snowflake , Redshift, HDFS, Hive, Spark SQL, T-SQL, ETL Workflows (AWS Glue, Azure Data Factory. Additional skills Oracle Database Administration, Oracle Data Guard, Oracle Recovery Manager, Oracle Application Server, MS SQL Server Administration, T-SQL programming, SQL Server Performance Tuning, SQL Server Jobs, SSRS, Unix Shell Scripting, Docker, Kubernetes, CI/CD Pipeline (Azure DevOps, Jenkins), Incident Reporting Systems (ServiceNow, JIRA), PowerShell (version 5.1+). CERTIFICATIONS Microsoft Certified: Azure Data Engineer Associate (DP-203) AWS Certified Data Analytics Specialty Google Professional Machine Learning Engineer Snowflake SnowPro Core Certification Databricks Certified Data Engineer Associate Microsoft Certified: Azure AI Fundamentals Microsoft Certified: Data Analyst Associate (Power BI) AWS Certified Solutions Architect Associate Microsoft Certified: Azure DevOps Engineer Expert (AZ-400) Python for Data Science and Machine Learning Bootcamp - Udemy Complete Python Developer in 2023 - Udemy Python Fundamentals - Pluralsight Google Cloud Professional Data Engineer PROFESSIONAL EXPERIENCE Managed end-to-end Big Data flow within the application, including data ingestion from upstream sources to HDFS, as well as processing and analysis of data in HDFS and Azure Data Lake. Analyzed data from Azure data storages using Databricks and Spark cluster capabilities, extracting valuable insights from large datasets. Designed and deployed scalable data science applications in Python 3 using AWS Lambda, EventBridge, and Step Functions to automate eligibility scoring workflows. Integrated AWS S3, Aurora, and API Gateway into claims processing pipelines, enabling secure access to ML insights from internal dashboards. Used AWS Cloud Development Kit (CDK) to provision serverless infrastructure, including Lambda functions, IAM roles, and SQS queues. Supported end-to-end deployment automation using GitLab CI/CD, managing pipeline triggers for staging and production environments. Worked extensively with Azure Data Factory, Azure Databricks, and PySpark to develop and maintain ETL data pipelines for seamless data ingestion, transformation, and loading into Azure Synapse Analytics. Developed custom activities using Azure Functions, Azure Databricks, and PowerShell scripts to perform data transformations, data cleaning, and validation processes. Developed real-time data processing pipelines using Apache Flink to handle member eligibility and claim event streams. Integrated Flink with Kafka and Azure Event Hub for stream enrichment and real-time joins, improving processing latency and SLA adherence. Tuned Flink job configurations for checkpointing, fault tolerance, and backpressure handling in a distributed Azure environment. Implemented Azure EventHub for real-time data ingestion, enabling efficient streaming and processing of high-volume data, and integrated Azure Stream Analytics for real-time analytics. Integrated Azure API Management to expose secure data services across internal apps and APIs. Configured Azure Private Link and Service Endpoints for secure data access to ADF, Blob, and SQL resources. Deployed Kafka Connectors for CDC pipelines from transactional DBs to Snowflake via Spark Streaming. Built end-to-end real-time ingestion using Azure Event Hub + Stream Analytics Snowflake Power BI. Designed ARM + Bicep templates for reusable infra provisioning (ADF, Databricks, Synapse). Used Azure Purview for data cataloging, lineage tracking, and PII classification. Created Data Quality dashboards using Great Expectations and Azure Monitor alerts. Integrated MLOps pipelines using Databricks MLflow + Azure ML endpoints for batch scoring. Leveraged AWS S3, Redshift, and AWS Glue for data integration, moving data between Azure and AWS for hybrid cloud solutions. Proficient in AWS EC2, ECS, EKS, CloudWatch, SNS, SQS, Lambda, and AWS RDS for cloud infrastructure, ensuring scalable, secure, and efficient deployment of applications. Hands-on experience using Kafka and Spark Streaming to process real-time data, improving the speed and efficiency of data processing pipelines. Optimized data fetching and querying in Azure Synapse by using indexing, partitioning, and performance tuning strategies, ensuring efficient access to large datasets. Developed and deployed SSIS/SSRS packages for data extraction, transformation, and reporting, significantly improving data accuracy and reducing reporting time. Utilized Google Cloud Platform (GCP) for deploying Java full-stack applications, integrating services such as Compute Engine, App Engine, and Cloud Storage for a unified data platform. Used Terraform to automate infrastructure provisioning and management, ensuring consistent and reproducible deployments in both Azure and AWS environments. Leveraged Azure DevOps for CI/CD pipeline automation, streamlining the development and deployment of data pipelines and applications across environments. Designed and implemented data pipeline orchestration using YAML pipelines in Azure DevOps, ensuring efficient execution of workflows. Developed HiveQL scripts for data analysis and processing, utilizing Apache Oozie and Apache Airflow for workflow management and scheduling. Participated in Agile ceremonies, including daily stand-ups, sprint planning, and PI Planning, ensuring project deadlines were met and cross-functional collaboration was maintained. Implemented data security measures in both Azure and AWS environments, including data encryption, data masking, and access control using Azure Key Vault and AWS IAM. Environment: Azure Databricks, Azure Data Factory, Snowflake, AWS, Google Cloud Platform (GCP), Azure Logic Apps, Terraform, Azure EventHub, Docker, Kubernetes, Kafka, PySpark, Spark Streaming, HDFS, MapReduce, YARN, SQL, Hive, Jenkins, Git, Power BI, JIRA, Terraform. Developed and implemented SSIS and SSRS packages to extract, transform, and load data from multiple sources including DB2, SQL, Oracle, flat files (CSV, delimited), APIs, XML, and JSON, improving data accuracy and reporting efficiency. Designed and implemented real-time data pipelines using Azure EventHub for data streaming and processing large volumes of data, enabling immediate data availability for analysis. Designed, built, and deployed a range of applications utilizing the full AWS stack (including EC2, ECS, EKS, SQS, SNS, CloudFront, MongoDB, Docker) to ensure high availability, fault tolerance, and auto-scaling for dynamic workloads. Leveraged containerization technologies such as Docker to deploy and manage scalable data processing environments, ensuring easy portability and deployment consistency across environments. Implemented Spark Streaming for real-time data processing and analysis, enabling quick insights from streaming data sources and driving faster decision-making. Implemented and maintained end-to-end data pipelines, ensuring the smooth flow, transformation, and integration of data across AWS, Azure, and GCP platforms. Utilized Terraform to automate infrastructure provisioning, ensuring consistent and reproducible deployments in both AWS and Azure environments. Automated Redshift + Snowflake data loads with Airflow DAGs and Lambda-triggered jobs. Configured AWS Glue Jobs + Databrew for data cleaning, wrangling, and cataloging. Set up EMR clusters with bootstrap actions for dynamic Spark job configurations. Integrated AWS Lake Formation for fine- grained security across curated and raw zones. Developed Kinesis Data Firehose S3 Athena pipelines for operational log analytics. Used Terraform with AWS Provider and AzureRM provider for hybrid IaC provisioning. Implemented AWS Step Functions + Lambda for orchestration of micro ETL jobs. Managed DNS configurations and routing using AWS Route53 to ensure efficient application and service deployment. Designed and implemented data streaming solutions using AWS Kinesis for real-time data processing and improved business intelligence reporting. Proficient in using AWS CloudWatch for monitoring resources, setting up alarms, and collecting metrics for proactive system management. Integrated AWS SNS and SQS for efficient real-time event processing and messaging, improving system responsiveness. Integrated Azure DevOps for CI/CD pipeline management, automating the testing, building, and deployment of data solutions across environments, ensuring continuous delivery and quality assurance. Created and managed YAML pipelines in Azure DevOps to automate the build, test, and deployment workflows for data processing systems. Designed and implemented data ingestion and storage solutions using AWS S3, Redshift, and AWS Glue, ensuring efficient data storage and transformation for downstream applications. Conducted performance tuning and optimization of data pipelines, queries, and data models, improving overall system efficiency and throughput. Worked with AWS Glue to implement ETL workflows, automating the extraction, transformation, and loading of data into Redshift and other cloud data storage solutions. Wrote HiveQL queries for data analysis, ensuring the business s requirements were met by creating and working with Hive tables to simulate MapReduce functionalities. Created partitions and buckets based on data attributes to enhance performance and enable efficient processing using Hive joins. Developed real-time Python 3 applications for inventory analytics and product personalization, leveraging AWS DynamoDB, Lambda, and Step Functions. Built API-first microservices using AWS API Gateway + Lambda to expose machine learning results to customer-facing web apps. Implemented IaC using CloudFormation templates to deploy data ingestion and processing infrastructure across dev/test/prod. Maintained continuous delivery workflows with GitLab runners, version-controlled codebases, and approval gates for retail analytics products. Utilized RDDs and DataFrames (SparkSQL) with PySpark to process and analyze large datasets, ensuring faster data transformations and insights. Implemented CI/CD pipelines to build and deploy projects within the Hadoop environment, automating deployment workflows and reducing manual intervention. Used Git for version control, managing code repositories, and ensuring collaborative development efforts across teams. Collaborated with cross-functional teams to gather business requirements, design data models, and develop solutions that meet business and technical needs. Performed data profiling to identify data patterns, anomalies, and actionable insights that contributed to business decision- making. Implemented data governance and security measures, ensuring compliance with industry regulations and securing sensitive data. Environment: AWS, AWS S3, Redshift, Glue, Data Factory, Logic Apps, Azure EventHub, Spark Streaming, Data Pipeline, Terraform, Azure DevOps, YAML, Oracle, HDFS, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, PySpark, Git, JIRA, Jenkins, Kafka, ADF Pipeline, Power BI, AWS Kinesis, CloudWatch. Utilized Sqoop to import data from MySQL to Hadoop Distributed File System (HDFS) on a regular basis, ensuring seamless data integration between relational databases and big data systems. Performed aggregations on large datasets using Apache Spark and Scala, storing processed data in the Hive warehouse for further analysis and business insights. Built Flink-based stream analytics for processing vehicle telemetry data in real-time, supporting predictive maintenance use cases. Engineered IoT and vehicle telemetry ingestion pipelines using AWS S3, EventBridge, and Lambda, enabling near real-time fleet analytics. Designed serverless ML trigger workflows with Step Functions to orchestrate batch scoring and anomaly detection jobs. Provisioned infrastructure using AWS CDK to manage scalable compute and storage resources across data and ML workloads. Collaborated with dev teams to implement Git-based CI/CD pipelines using GitLab, automating deployment of containerized Python 3 applications. Deployed Flink jobs on Kubernetes using custom Docker containers, integrated with GCP Pub/Sub and BigQuery for downstream analytics. Designed Flink stateful operations and sliding windows for advanced time-series analysis across streaming IoT signals. Worked extensively with Data Lakes and big data ecosystems, including Hadoop, Spark, Hortonworks, and Cloudera, to leverage their capabilities for efficient data processing and management at scale. Successfully loaded and transformed large sets of structured, semi-structured, and unstructured data, enabling effective analysis and actionable insights generation. Developed Hive queries to analyze data and meet specific business requirements, utilizing HiveQL to simulate MapReduce functionalities and optimize performance. Built HBase tables by leveraging HBase integration with Hive, facilitating efficient storage and retrieval of data for scalable data processing. Built predictive maintenance pipelines using PySpark + Spark MLlib for vehicle health analytics. Created data mesh-style domains with Delta Lake zones and decentralized pipelines. Published Kafka messages using Avro schema registry for schema evolution. Used Grafana + Prometheus for monitoring Spark job metrics and system KPIs. Integrated Databricks Repos + GitHub Actions for version-controlled notebook CI/CD. Tuned long-running PySpark jobs with dynamic allocation and shuffle partitions tuning. Designed a reusable Python framework for ingesting IoT edge data from Azure IoT Hub Applied Kafka and Spark Streaming to process real-time data streams, enabling immediate insights and decision-making for operational efficiency. Designed and implemented data pipelines using Kafka, Spark, and Hive, ensuring seamless data ingestion, transformation, and data analysis across multiple systems. Migrated existing data from RDBMS (Oracle) to Hadoop using Sqoop, enhancing the ability to process and analyze large datasets within the Hadoop ecosystem. Developed custom PL/SQL scripts and automation tools to handle data validation, cleaning, and transformation processes, ensuring data accuracy and consistency throughout the pipeline. Implemented Continuous Integration and Continuous Deployment (CI/CD) pipelines to streamline the development, testing, and deployment of big data projects in a Hadoop environment. Utilized JIRA for managing project workflows, tracking issues, and collaborating with cross-functional teams to ensure successful project completion. Worked with Spark using Python (PySpark) and Spark SQL to perform data testing and processing, enabling faster and more efficient analysis of large datasets. Employed Spark Streaming to divide streaming data into batches as input to the Spark engine, facilitating real-time data processing and analysis. Utilized Zookeeper for efficient coordination, synchronization, and serialization of servers within clusters, ensuring reliable distributed data processing. Worked with Oozie workflow engine for scheduling and automating Hadoop jobs, ensuring the seamless execution of complex data processing workflows. Leveraged Git as a version control tool for maintaining code repositories, enabling collaborative development and efficient version tracking across team members. Collaborated closely with the engineering team to identify and resolve JVM-related issues, ensuring smooth execution of data workflows and optimizing performance. Environment: Spark, Hive, HBase, Sqoop, Kafka, Zookeeper, Oozie, PySpark, SQL, Scala, HDFS, MapReduce, JIRA, Git, AWS, Azure, CI/CD Pipelines, Terraform, Oracle, Linux. Developed ETL jobs using Spark-Scala to migrate data from Oracle to MySQL tables, ensuring seamless data transformation and integration between relational databases and Hadoop. Utilized Spark-Scala (RDDs, DataFrames, Spark SQL) and Spark-Cassandra-Connector API for various tasks including data migration and business report generation, significantly improving data processing workflows. Developed a Spark Streaming application for real-time sales analytics, enabling immediate insights for decision-making and business operations. Analyzed and processed data from multiple sources, including Excel sheets, flat files, and CSV files, and used Power BI to generate ad-hoc reports, providing business stakeholders with timely insights. Analyzed SQL scripts and designed the solution to be implemented using PySpark, optimizing data transformations and performance. Extracted data from multiple data sources using Sqoop and loaded it into HDFS, enabling efficient storage and analysis within the Hadoop ecosystem. Used Hive ACID tables to ensure transactional updates and delete support in regulatory datasets. Managed incremental ingestion using Sqoop s last modified timestamp strategy. Implemented Spark Structured Streaming with checkpointing and watermarks. Created Parquet and ORC-based partitioning logic for cost-effective S3 + HDFS storage. Automated Spark-submit jobs on YARN via Airflow BashOperator. Designed SCD Type 2 data marts for population and claims datasets. Handled data extraction from MySQL into HDFS using Sqoop, streamlining the process of moving large datasets from relational databases to Hadoop. Implemented automation for deployment processes by using YAML scripts for managing large-scale builds and releases, ensuring consistency and repeatability in the development cycle. Worked extensively with Apache Hive, Apache Pig, HBase, Spark, Zookeeper, Flume, Kafka, and Sqoop to implement complex data processing workflows and handle various data types. Implemented data classification algorithms using MapReduce design patterns, enhancing the ability to classify and categorize large datasets efficiently. Worked on performance tuning of MapReduce jobs by implementing combiners, partitioning, and distributed cache strategies, improving processing speed and reducing job runtime. Used Git for version control to maintain the source code in Git and GitHub repositories, ensuring collaborative development and version tracking across teams. Environment: Hadoop, Hive, Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka, Sqoop, MySQL, Oracle, Power BI, YAML, PySpark, SQL, MapReduce, Git, GitHub. Designed and implemented ETL processes using SQL Server to migrate data from various sources to JetBlue s Data Warehouse, ensuring smooth integration and transformation of data. Developed and maintained SQL Server Integration Services (SSIS) and SQL Server Reporting Services (SSRS) packages for data extraction, transformation, and reporting, improving the accuracy and timeliness of business insights. Managed and updated Erwin models for logical and physical data modeling of key data repositories such as the Consolidated Data Store (CDS) and Actuarial Data Mart (ADM). Utilized TFS (Team Foundation Server) for source control and tracking script deployments across different environments, ensuring consistency and reliability in data integration processes. Created and maintained SQL Server jobs, SQL Mail Agents, and alerts to automate critical data operations, reducing manual intervention and increasing operational efficiency. Wrote and optimized T-SQL scripts for stored procedures, functions, views, and triggers, enabling effective data transformation and processing in a high-performance environment. Worked with SQL Server for data profiling, ensuring data integrity and identifying any discrepancies in business-critical data sources. Migrated legacy TFS + SSIS workloads to Azure DevOps + ADF pipelines. Integrated airline operations data into GCP BigQuery via Transfer Service. Used DAX + Power BI row-level security (RLS) for flight-specific reporting access. Tuned ETL loads using SSIS buffer tuning and parallel execution paths. Created and managed databases such as Reference Database, which served as the source for the legacy source systems' code and values, optimizing data storage and retrieval processes. Implemented data governance practices, ensuring compliance with industry standards and protecting sensitive data throughout the ETL process. Conducted query tuning and performance optimization, ensuring the ETL processes were running efficiently and minimizing data processing time. Utilized SSIS for efficient data extraction, transformation, and loading (ETL) from flat files, Excel sheets, and legacy systems, ensuring data was ready for analysis and reporting. Assisted in data migration from legacy systems into the Data Warehouse, leveraging ETL tools and ensuring data was correctly transformed and stored for further use. Created custom reports using SSRS for various business stakeholders, allowing the team to make data-driven decisions with accurate, real-time information. Used Agile methodologies to manage project workflows, participating in daily stand-ups, sprint planning, and project retrospectives to ensure timely delivery of data solutions. Environment: SQL Server (2016/2014/2012), T-SQL, TFS, Erwin Data Models, Power BI, Agile, Team Foundation Server (TFS)), SQL Oracle 10g, Visual Studio, JIRA. Developed complex stored procedures, efficient triggers, and necessary functions, along with creating indexes and indexed views to optimize performance in SQL Server. Extensive experience in monitoring and tuning SQL Server performance, employing best practices to ensure optimal database performance. Expertise in designing ETL data flows using SSIS, creating mappings and workflows for extracting data from SQL Server, as well as performing data migration and transformation from Access/Excel sheets using SQL Server SSIS. Proficient in dimensional data modelling for Data Mart design, identifying facts and dimensions, and developing fact tables and dimension tables using Slowly Changing Dimensions (SCD) techniques. Skilled in error and event handling techniques, such as precedence constraints, breakpoints, check points, and logging, ensuring reliable and robust ETL processes. Developed Tabular SSAS models with DAX measures for advanced banking KPIs. Migrated on-prem data marts into Azure SQL + Synapse with PolyBase staging. Created SSIS logging framework with email notifications + error snapshots to shared logs. Enforced RBAC + dynamic data masking in SQL Server + Snowflake. Experienced in building cubes and dimensions with different architectures and data sources for business intelligence purposes, including writing MDX scripting. Proficient in developing SSAS cubes, implementing aggregations, defining KPIs (Key Performance Indicators), managing measures, partitioning cubes, and creating data mining models. Deploying and processing SSAS objects. Experience in creating ad hoc reports and reports with complex formulas, utilizing querying capabilities of the database for business intelligence purposes. Expertise in developing parameterized, chart, graph, linked, dashboard, scorecards, and drill-down/drill-through reports on SSAS cubes using SSRS (SQL Server Reporting Services). Environment: MS SQL Server 2016, Visual Studio 2017/2019, SSIS, Share point, MS Access, Team Foundation server, Git. Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory information technology microsoft mississippi procedural language Arizona |