shravan - Senior Data Engineer |
[email protected] |
Location: Los Angeles, California, USA |
Relocation: |
Visa: GC |
Shravan Kumar
Senior Data Engineer +1 940-277-7950 | [email protected] | LinkedIn | Summary Bringing over 10+ years of software industry experience dedicated focus on Microsoft Azure cloud services, and Amazon Web Services with a strong background in Big Data and added expertise in Data Warehousing. Accomplished data expert with an extensive background in end-to-end management of ETL data flows, guaranteeing flexibility and seamless performance. Expertise in migrating SQL databases to Azure Data Lake, leveraging Azure Data Lake Analytics, Azure SQL Database, Azure Databricks, and Azure Synapse Analytics (formerly SQL Data Warehouse). Experience in AWS Services like S3, EC2, Lambda, Glue, EMR, SNS, Step Function, IAM, CloudWatch, Athena, Redshift, Quicksight, Sage Maker etc. Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI, and Microsoft SSIS. Proficient in managing database access control and executing the transition of on-premises databases to Azure Data Lake Store utilizing Azure Data Factory for optimized data integration and analytics. Leveraged Azure Event Hubs to create real-time dashboards and monitoring solutions, enhancing data-driven decision-making processes. Implemented Azure Insights for comprehensive logging and diagnostics, enabling data-driven decision-making and continuous improvement of cloud resources. Enhanced data security and compliance by leveraging Azure Databricks' built-in authentication and authorization features. Orchestrated seamless integration between on-premises and cloud data platforms leveraging Azure Data Factory, enacting complex transformations, and optimizing data loads into Snowflake. Designed and implemented high-performance ETL transformations and rigorous validation processes using Spark-SQL and Data Frames within the Azure Databricks and Azure Data Factory environments. Exceptionally collaborative team player, proactively partnering with Azure Logic Apps administrators and DevOps engineers for seamless monitoring and swift resolution of automation and data pipeline issues. Proficient in writing and maintaining YAML configuration files for data pipeline orchestration. Optimized Azure Functions to efficiently extract, transform, and load data from multiple sources, improving ETL performance by 60%. Crafted and deployed an advanced data catalog, significantly boosting data discoverability and organization-wide governance. This strategic initiative revolutionized data management and accessibility. Pioneered comprehensive end-to-end data lineage documentation, guaranteeing transparency and adherence to stringent data governance standards. Expert in optimizing query performance models and refining index schemas, significantly enhancing data retrieval efficiency and maximizing extract throughput. Engineered and sustained data lineage visualizations, delivering stakeholders a crystal-clear grasp of data flow dynamics for informed decision-making. Proficient in Apache NiFi for designing, developing, and managing data integration and ETL workflows. Designed and optimized SQL queries with PARTITION BY and ORDER BY clauses for precise data partitioning and ordering within window functions. Extensive experience in implementing and managing data workflows using Control-M for job scheduling and automation. Utilized cutting-edge data catalogue tools to masterfully classify, manage, and secure enterprise data assets, dramatically enhancing data quality and compliance standards. Highly Proficient in executing Hive scripts using Hive on Spark and Spark SQL for diverse data processing requirements. Extensive expertise in the design, development, and upkeep of data integration solutions within Hadoop and RDBMS ecosystems. Proficient in data warehousing methodologies, encompassing data cleansing, management of Slowly Changing Dimensions, surrogate key allocation, and change data capture techniques for Snowflake modeling. Proficient in developing and maintaining ETL/ELT workflows using technologies like Apache Spark, Apache Beam, and Apache Airflow for efficient data extraction, transformation, and loading processes. Proficient in enforcing data quality control and purification methodologies to maintain data precision and integrity across the pipeline. Proactively fostered collaboration with cross-functional team members to ensure data integrity and the establishment of stable data pipelines, while actively participating in ETL tasks. This contribution strengthened data reliability and streamlined data processing, emphasizing effective teamwork and data engineering competence. Exposure to Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper. Designed and implemented scalable HDFS architectures to optimize the storage and processing of multi-petabyte datasets, ensuring data redundancy and high availability across distributed computing environments. Skilled in utilizing SQL queries, including DDL, DML, and various database objects, for data manipulation and retrieval. Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development. Showcased expertise in scripting languages such as Python and Scala to drive efficient data processing, highlighting data manipulation and analysis versatility. This proficiency contributed to streamlined data workflows and enhanced data-driven decision-making within the organization. Proficient in utilizing GIT for version control, ensuring efficient source code management for multiple concurrent projects, and enhancing team collaboration and code quality. Experienced in utilizing JIRA for project reporting and task management, as well as ensuring efficient project execution within Agile methodologies. Skills Hadoop/Big Data HDFS, MapReduce, HBase, Hive, Cloudera, Sqoop. Programming Languages SQL, Python, PySpark Databases RDBMS (MySQL, DB2, MS-SQL Server, PostgreSQL), NoSQL HBase, Cassandra), Snowflake virtual warehouse, Azure SQL Warehouse, Azure SQL DB Python Libraries NumPy, matplotlib, pandas, Scikit-learn Cloud Technologies Amazon Web Services (AWS), Microsoft Azure Azure Cloud ADFv2, ADLS, Azure SQL DB, SQL server, Azure Synapse, Azure Analytic Services, Data bricks, Mapping Dataflow (MDF), Azure Data Lake (Gen1/Gen2) AWS-Services S3, EC2, EMR, Lambda, Redshift, Data Pipeline, Athena, AWS Glue, S3 Glacier, Cloud Formation Reporting Tools JIRA, MS Excel, Tableau, PowerBI, SSRS Management Studio Tools Jenkins, Eclipse, Maven, Git HUB Methodologies Agile, Scrum, Waterfall Education MS- Data Engineering-University of North Texas, Denton, USA. B. Tech- Electronics & Communication Engineering. Koneru Lakshmaiah University, Vijayawada, India. Work Experience Client: Southern Co, USA January 2023 - Present Role: Azure Data Engineer Atlanta, Georgia. Expertly executed data transformation in Azure Data Lake Storage Gen-2 with medallion architecture, revolutionizing workflows, elevating data quality, and driving efficient decision-making in the Azure cloud environment. Harnessing Azure Databricks and PySpark in the cloud, orchestrated data processing with a focus on optimization, security, and scalability. Achieved unmatched data efficiency, reliability, and compliance in the organization's cloud-based infrastructure. Integrated on-premises (MySQL, Cassandra) and cloud-based (Azure Blob Storage, Azure SQL Data Base) data using Azure Data Factory, applying transformations and loading data into Snowflake. Architected and executed a robust Azure Lambda architecture, seamlessly enabling real-time and batch data processing. Leveraging Azure cloud services, it ensured peak performance and scalability for mission-critical business functions. Implemented automated data catalog processes, drastically reducing manual labor and ensuring real-time metadata accuracy across all data systems. This innovation streamlined data management to its pinnacle. Pioneered the seamless integration of the data catalog with current data lakes and warehouses, empowering unified, streamlined data access and analysis capabilities. This transformative effort revolutionized data utilization. Architected and implemented a real-time data pipeline on Azure, leveraging Azure Event Hubs for efficient streaming data ingestion. Developed complex SQL queries with window functions to calculate moving averages and cumulative sums. Engineered metadata management strategies with data catalogs, facilitating highly efficient data lineage, impact analysis, and reporting capabilities. This advanced approach optimized data governance and decision-making processes. Automated critical data workflows using Azure Functions, Logic Apps, and Pipelines, creating a scalable, reliable framework. This streamlined tasks, reduced errors, and cut manual work, boosting efficiency. Created 100+ ETL transformations and validations using Spark-SQL, processing 5TB of data weekly in Azure Databricks. Implemented robust security and monitoring solutions, and optimized cost management using Azure services including Azure Key Vault, Azure Security Centre, Azure Monitor, and Azure Cost Management. Orchestrated data routing and flow control with NiFi to ensure reliable and real-time data processing. Collaborated with Azure Logic Apps administrators, resolving 150+ issues annually, enhancing data pipeline efficiency by 25%. Utilized Databricks and Spark to transform and prepare data for complex analytical workloads, ensuring high-performance data processing. Implemented data quality checks, reducing data inconsistencies by 40% and ensuring 99.9% accuracy across all pipeline stages. Enhanced data pipeline performance through the implementation of optimized query strategies and indexing techniques. Orchestrated seamless movement of over 5TB of data into SQL databases monthly, reducing data transfer times by 25% using Data Factory pipelines. Engineered scalable data pipelines leveraging Airflow, Kafka, Spark, and Snowflake optimized for performance, reliability, and real-time analytics. Created custom processors and controllers in NiFi to meet specific data integration requirements. Documented and maintained YAML templates for containerized applications and microservices. Fostered collaboration with data scientists and analysts to rigorously validate data lineage, ensuring impeccable accuracy in analytics and reporting, driving data-driven excellence. Proficiently implemented Hive scripts with Hive on Spark and SparkSQL to address diverse data processing needs. This expertise contributed to the optimization of data operations, enhancing analytical capabilities and data-driven insights within the organization. Collaborated effectively on ETL tasks, prioritizing data integrity and ensuring the reliability of stable data pipelines. This proactive approach contributed to efficient data management and a seamless flow of information critical for decision-making. Experienced in leveraging GitHub for repository hosting, facilitating code reviews, pull requests, and team collaboration across geographically dispersed development teams. Effectively harnessed the power of Spark Core and Spark SQL scripts with Scala to significantly enhance data processing capabilities, showcasing a data engineering proficiency that optimized data workflows and facilitated faster insights extraction. This expertise contributed to data-driven decision-making and improved organizational efficiency. Utilized JIRA for project reporting, creating subtasks for development, QA, and partner validation. Collaborated closely with product owners and stakeholders to define user stories and acceptance criteria, ensuring clear project objectives and customer-centric outcomes. Environment: Azure Data Lake Storage Gen-2, Azure Data Factory, Azure Airflow, Azure Data Bricks, Azure Blob Storage, Azure File Storage, Azure Machine Learning, Azure SQL DB, Azure App Service, Azure IoT, Azure Synapse, Azure Kubernetes Service, Azure Active Directory, Azure Logic Apps, Azure Function Apps, Azure Monitor, Azure Service Bus, Azure Event Hubs, Azure DevOps, Snowflake, Python, Scala, PySpark, Spark Performance, data integration, data modeling, Shell scripting, GIT, JIRA, Power BI. Client: Warner Media July 2020- January 2023 Role: Azure Data Engineer Los Angeles, CA Led and executed comprehensive data migration strategies, efficiently transferring mission-critical databases to Azure Data Lake, improving data accessibility and scalability. Enhanced Spark performance by optimizing data processing algorithms, and leveraging techniques such as partitioning, caching, and broadcast variables. Strategically orchestrated the seamless data ingestion of an extensive 3 billion records into a suite of Azure Services, leveraging Azure Databricks to achieve a notable 40% enhancement in data processing speed. Pioneered migrating on-premises Oracle ETL processes to Azure Synapse Analytics, substantially improving data processing efficiency and cost-effectiveness. Directed the strategic transference of SQL databases to Azure Data Lake and SQL Data Warehouse, employing Azure Data Factory to ensure optimal data transfer and storage solutions. Conducted ETL operations with Azure Databricks and led the migration from on-premises Oracle ETL to Azure Synapse Analytics, optimizing system performance and data management efficiency while achieving remarkable scalability and cost reduction. Engineered and implemented high-performance ETL pipelines within Azure Databricks, processing over 3 terabytes of data daily with 99.99% data integrity Effectively drove the utilization of Microsoft Azure services, including Azure HDInsight Clusters, Azure Data Factory, Azure Blob Storage, Azure Data Bricks, Azure Function Apps, Azure Security, and Azure Logic Apps, to advance project initiatives and maximize cloud capabilities. Implemented job dependencies and triggers in Control-M to optimize data workflow execution. Designed and executed state-specific data partitioning and bucketing strategies, harnessing Bucket-based Hive joins for a significant boost in query performance and data retrieval efficiency. This strategic initiative profoundly enhanced analytics capabilities and exemplified excellence in data processing. Proactively assessed and championed advancements for the data catalog system, maintaining a leading edge in technology and aligning with evolving business demands. This strategic vision drove continuous improvement. Revolutionized troubleshooting by harnessing data lineage for rapid issue pinpointing and resolution, fully optimizing operational efficiency. Implemented YAML-based testing strategies to validate data pipeline configurations and ensure data integrity. Collaborated closely with data governance teams to set robust standards and policies within the data catalog, ensuring unwavering data consistency and integrity. This partnership fortified data management practices to the highest degree. Utilized Spark Streaming to handle real-time data streams, processing over 1 million events per day, improving batch processing efficiency by 40%. Mastered JIRA for streamlined project workflow management, demonstrating exceptional issue-tracking expertise that enabled agile, responsive project execution. This specialized skill elevated project management efficiency, fostering success, streamlined processes, and collaborative excellence. Established Git as the pivotal tool for version control, demonstrating a meticulous commitment to preserving code repository integrity and progression. This proficiency contributed to seamless collaboration among development teams and ensured the traceability of code changes, fostering a robust and agile development environment. Environment: Azure HDInsight, Azure Event Hubs, Azure Data Factory, Azure Data Lake Storage Gen-2, Azure Security, Azure Databricks, Azure Logic Apps, Azure Functional App, Snowflake, MS SQL, Oracle, Spark, SQL, Python, Scala, PySpark, Spark Performance, data integration, data modeling, Shell scripting, GIT, JIRA, Jenkins, Kafka, ADF Pipeline, Power BI. PwC Jan 2019- June 2020 Role: Data Engineer Los Angeles, CA Strategically architected and deployed an enterprise-grade Data Lake, facilitating advanced analytics, processing, storage, and reporting on high-velocity, large-scale data sets. Diligently ensured the integrity and quality of reference data within source systems through meticulous cleaning and transformation processes, working in close collaboration with stakeholders and the solution architect Skilfully crafted and implemented tabular models on Azure Analytic Services, directly aligning with business intelligence and reporting needs. Utilized Control-M's monitoring and alerting features to proactively identify and address job failures or performance issues. Directed data ingestion operations across Azure Services for a data volume exceeding 200 TB, optimizing processing workflows within Azure Databricks and achieving a 25% improvement in data processing time. Innovatively designed and executed complex data pipelines and transformations, utilizing ADF and PySpark within Databricks to meet intricate data flow requirements. Managed and optimized data storage and loading procedures for Azure BLOB and Azure Data Lake storage systems, ensuring seamless integration with Azure SQL Synapse Analytics. Conducted comprehensive data lineage and impact analysis using automated tools, ensuring transparency and trust in data transformations and usage. Formulated and executed resilient Python, Spark, and Bash scripts to optimize the seamless transformation and loading of data across complex hybrid environments. This demonstrated a commitment to efficiency and effectiveness in data processing and management. Maximized the capabilities of Apache Spark, employing SQL and Streaming modules to bolster intraday and real-time data processing operations. Optimized data processing efficiency using Spark SQL for Scala and Python, enabling seamless RDD to Data Frame conversions, significantly accelerating analysis and system performance. Streamlined data manipulation, markedly enhancing processing speed and reliability. Implemented advanced automated data validation and cleansing using Apache Spark and Python, significantly enhancing data quality and integrity. Reduced errors and inconsistencies, ensuring accurate and reliable analytics outcomes. Applied performance optimization strategies, achieving a 60% improvement in query execution time through distributed caching, data partitioning, and hive bucketing techniques. Led the innovation and creation of a scalable ETL framework, streamlining future migrations from RDBMS systems to Data Lakes by leveraging Spark Data Sources and Hive objects. This initiative demonstrated forward-thinking expertise in data architecture and optimization. Maximized NoSQL databases like MongoDB and Cassandra for high-velocity transactions and unstructured data, optimizing scalability and performance. Enabled flexible data models for efficient management and analysis of diverse data, supporting dynamic business applications. Led performance tuning for big data applications, employing optimization techniques to enhance processing speed and reduce latency, enabling real-time analysis and quicker decision-making. Improved system efficiency significantly across the organization. Orchestrated extensive database imports and exports by utilizing SQL Server Integration Services (SSIS) and Data Transformation Services (DTS), effectively streamlining data integration pipelines. This initiative facilitated seamless data movement and transformation across systems, enhancing data availability and reliability while optimizing workflow efficiency across the enterprise. Environment: Azure Analytic Services, Azure Data Factory, Azure Databricks, PySpark, Python, Apache Spark, MongoDB, HBase, MySQL, HIVE, SQOOP, Snowflake, Python, SSRS, Tableau. Sun Trust Bank August 2016 Jan 2019 Role: AWS Data Engineer, Atlanta, Georgia Used Agile software development methodology in defining the problem, gathering requirements, development iterations, business modeling, and communicating with the technical team for the development of the system. Involved with project team members to deliver data models to meet data requirements and create all documentation and deliverables following enterprise guidelines and standards. Work on Metadata transfer amongst various proprietary systems. Designed and implemented scalable AWS infrastructure utilizing services like EC2, S3, and Lambda for efficient data processing and storage. Successfully implemented DBT modeling in a large-scale data warehouse migration project, transforming legacy ETL processes into modern, modular DBT models, resulting in improved data quality and reduced processing time by 30%. Build Self-service data pipelines using AWS Services like SNS, StepFunction, Lambda, Glue, EMR, EC2, Athena, Sage Maker, Quick Sight, Redshift, etc. successfully designed and implemented a robust data pipeline using Snowflake, enabling efficient processing and analysis of large-scale datasets for actionable insights. Moved large amounts of data from AWS S3 buckets to AWS Redshift using Glue and EMR. Successfully implemented Data Lakes on Amazon Web Services (AWS) and leveraged Amazon S3 and AWS Glue for seamless data ingestion, storage, and processing. Analyzed large and critical datasets using EMR, Glue, and Spark. Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system. Developed ETL pipelines in and out of data warehouses using tools like Python and AWS Glue. Integrated AWS Redshift with other AWS services such as S3, Glue, and Lambda to streamline data workflows and automate data processing tasks. Architected and managed AWS Redshift clusters, optimizing performance and reducing costs for large-scale data warehousing solutions. I worked on migrating SQL scripts from Redshift and Athena. Implemented real-time data processing and analytics pipeline using Python, HBase, and MongoDB, ensuring efficient data storage, retrieval, and analysis for large-scale datasets. Implemented Spark using Python and Spark SQL for faster testing and processing of data. Consumed the data from Kafka using Apache Spark. Utilized AWS Databricks to orchestrate ETL workflows, perform data transformations, and optimize data pipelines, drive data-driven decision-making and business insights. Experience in building CICD pipelines for testing and production environments using Terraform. Proficient with container systems like Docker and container orchestration like EC2 Container Service, and Kubernetes, worked with Terraform. Worked on POCs with Apache Spark using Python to implement Spark in the project. Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Environment: Hadoop, AWS EMR, EC2, S3, Athena, Glue, DBT, Apache Spark, Airflow, Docker, PySpark, SQL, Spark SQL, Python(OOP), Kafka, HBase, HIVE, PIG, UNIX, Shell scripting, Tableau, Git, Jenkins, Jira. Nationwide August 2014 July 2016 Role: Associate Data Engineer San Jose, CA Used AWS services like SNS, Step Function, Lambda, Redshift, Glue, EMR, EC2, Athena, etc. Utilized NumPy and pandas to efficiently process and analyze large datasets, enabling streamlined data engineering workflows. Cleaned data and processed third-party spending data into maneuverable deliverables within specific formats with Excel macros and Python libraries. Gathering business requirements by organizing and managing meetings with business stakeholders and IT analysts on a scheduled basis. Analyzed the business requirements by dividing them into subject areas and understood the data flow. Created and updated views with Python view controller and template language to create new functionalities for the website. Built development environment using bug-tracking tools like Jira, Confluence, and version controls such as Git, and SVN. Designed, deployed, maintained, and implemented Cloud solutions and underlying technologies. Developed a real-time data engineering project using Python and Apache Kafka to process high-volume data streams for seamless real-time analytics. Experienced in writing Spark Applications in Python. Implemented Spark using Python, Spark SQL for faster testing, and Spark SQL API for faster data processing. Environment: Hadoop, AWS services like SNS, Step Function, Lambda, Redshift, Glue, EMR, EC2, Athena, Apache Spark, Docker, Kubernetes, PySpark, Spark SQL, Python, SQL, and Git. Keywords: quality analyst business intelligence sthree database information technology microsoft California Colorado |