Sumanth - Senior Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: |
Sumanth Sirimalla
Senior Data Engineer Email: [email protected] Phone : +1 (403) 252-3936 LinkedIn: linkedin.com/in/sumanth-siri/ Professional Summary: Senior Data Engineer with over 9+ years of experience in Big Data, Cloud Technologies (AWS, Azure, GCP), Data Engineering, ETL, Snowflake and Data Warehousing. Expertise in designing and optimizing scalable, high-performance data architectures and enabling real-time analytics for business-critical solutions. Big Data Ecosystem & Distributed Processing: Designed and implemented scalable data pipelines using Hadoop, Spark, Hive, HDFS, Kafka, Airflow, and PySpark for real-time and batch processing workloads. Cloud Platform Expertise: Delivered robust data engineering solutions on AWS, Azure, and GCP, leveraging services like EMR, S3, Lambda, BigQuery, Databricks, and Redshift for cloud-native analytics. ETL & Data Warehousing: Engineered complex ETL workflows and ELT pipelines using dbt, Informatica, Talend, and Snowflake, ensuring high data quality, integrity, and performance optimization. Streaming & Real-Time Analytics: Built low-latency streaming architectures using Apache Flink, Kafka Streams, and AWS Kinesis to support real-time business-critical analytics. Data Modeling & Integration: Developed logical and physical data models, integrated data from diverse sources, including AWS S3, Redshift, and Teradata using PySpark and Scala. Programming & Scripting Languages: Proficient in Python, Scala, and SQL with hands-on experience in processing structured and unstructured data formats like Parquet, JSON, and text. Infrastructure as Code & DevOps: Automated deployments using Terraform and CloudFormation, and implemented CI/CD pipelines with Jenkins, GitHub Actions, and GitLab CI/CD. Containerization & Orchestration: Managed scalable, containerized applications using Docker and Kubernetes to support microservices-based architecture and data workflows. Workflow Orchestration & Scheduling: Designed and maintained ETL job workflows in Apache Airflow and Oozie using various operators and DAGs for dependency management and job orchestration. Monitoring & Performance Tuning: Reduced ETL job failures by 50% through robust monitoring, debugging, and tuning strategies, leveraging tools like CloudWatch and Splunk. Serverless Computing: Designed and deployed serverless data ingestion and transformation solutions using AWS Lambda and S3 for event-driven architectures. Data Lakes & Delta Architecture: Built and optimized Delta Lake architectures on Databricks to support scalable, ACID-compliant data lakes and advanced analytics use cases. Machine Learning & MLOps: Enabled data-driven decisions by developing ML pipelines and deploying models using TensorFlow, MLflow, and Vertex AI in production environments. Database & NoSQL Expertise: Worked extensively with both relational and NoSQL databases, including Teradata, MongoDB, DynamoDB, Cassandra, and HBase, ensuring performance and scalability. Technical Skills: Programming Languages Python, SQL, PL/SQL, Shell scripts, Java, Scala, Unix Big Data Tools Hadoop, Apache Spark, MapReduce, PySpark, Hive, YARN, Kafka, Flume, Oozie, Airflow, Zookeeper, Sqoop, HBase, Flink, Airflow, HDFS, dbt, Delta Lake Cloud Services AWS Glue, S3, RedShift, EC2, S3, EMR, Dynamo DB, Data Lake, AWS Lambda, cloud watch, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical Services, HDInsight, Azure SQL Datawarehouse GCP (BigQuery, Dataflow, Cloud Composer, Vertex AI) . ETL/Data warehouse Tools Snowflake, BigQuery, Redshift, Talend, Informatica, dbt, Apache NiFi , Version Control & Containerization tools SVN, GIT, Bitbucket, Docker, and Jenkins, GitHub Actions, GitLab CI/CD, Kubernetes, Docker, Terraform, CloudFormation Databases Oracle, MySQL, Mongo DB, and DB2, SQL Server, Oracle, MySQL, MongoDB, PostgreSQL, NoSQL (HBase, DynamoDB, Cassandra) BI & Visualization & Machine Learning & AI Power BI, Tableau, Looker, Superset, Matplotlib, Plotly, TensorFlow, MLflow, Vertex AI, AutoML, Spark ML Operating Systems & Methodologies Ubuntu, Windows, and Mac OS, Agile and Traditional Waterfall Professional Experience: Client: Verizon, Plano, TX May 2023 - Till now Role: Azure Data Engineer Responsibilities: Azure Data Engineering: Delivered scalable BI solutions on Azure using Data Lake, Data Factory, Stream Analytics, Azure SQL DW, HDInsight, and Databricks, supporting enterprise-level analytics. Cloud-Native Pipelines: Designed and implemented data pipelines using Azure Databricks, Snowflake, and Delta Lake to enable fast and efficient data processing at scale. SQL Server Development & Optimization: Managed Microsoft SQL Server 2016/2019 databases, optimized performance, and ensured data integrity for large healthcare datasets. ETL & Automation: Built and maintained robust ETL pipelines using SSIS and PowerShell, automating data ingestion and transformation tasks to streamline operations. Database Migration & Performance Tuning: Led SQL Server migrations to newer versions, reducing processing time by 40% and improving scalability and maintainability. Disaster Recovery & Security: Implemented backup, disaster recovery, and HIPAA-compliant security policies to ensure business continuity and protect sensitive healthcare data. Stored Procedure & Query Optimization: Developed efficient stored procedures, views, and indexes to improve database performance and enhance analytical capabilities. Real-Time Data Processing: Built real-time data workflows using Apache Flink, Kafka, and AWS Kinesis to support up-to-the-minute business reporting and analytics. Workflow Orchestration: Automated ETL pipelines using Apache Airflow and dbt, increasing efficiency and reducing manual intervention in data transformation. Azure Data Integration: Ingested and transformed data from multiple sources into Azure SQL Data Warehouse, using ADF and maintaining storage in Azure Data Lake. Spark & Streaming: Configured Spark Streaming with Kafka to ingest real-time data and persist it in HDFS, supporting real-time insights and batch processing. Databricks & Data Transformation: Developed Azure Databricks notebooks for data cleansing and business logic transformations, enabling high-quality, ready-to-use datasets. Big Data Processing: Used Hive SQL and PySpark to create optimized tables using partitioning and clustering; converted Hive/SQL logic into Spark transformations with RDDs. Data Ingestion & Migration: Migrated data into Hadoop from Oracle and MySQL using Sqoop with incremental loads; created DAGs in Airflow for orchestrating ETL workflows. Environment: Python, Hadoop, Spark, Spark SQL, SQL Server 2016/2019 Spark Streaming, PySpark, Hive, Scala, MapReduce, HDFS, Kafka, Sqoop, HBase, MS Azure, Blob Storage, Data Factory, Data Bricks, SQL Data Warehouse, Apache Airflow, Snowflake, Oracle, MySQL, UNIX Shell Script, Perl, PowerShell, SSIS, Power BI, Vertex AI, HIPAA, FHIR, HL7 Client: Capital One, Atlanta, GA Sep 2020 Feb 2023 Role: GCP Data Engineer Responsibilities: Oracle to BigQuery Migration & Reporting: Led the migration of an entire Oracle database to Google BigQuery and integrated Power BI for scalable, real-time reporting and dashboarding. ETL Pipeline Development with Airflow: Built and orchestrated end-to-end ETL data pipelines in GCP using Apache Airflow, leveraging various operators for scheduling and automation. GCP Ecosystem Expertise: Worked extensively with GCP services including DataProc, GCS, Cloud Functions, BigQuery, Pub/Sub, and GCP client libraries to design and implement cloud-native solutions. On-Prem to Cloud Migration: Played a key role in migrating on-premise Hadoop systems to Google Cloud Platform, ensuring seamless transition and enhanced data accessibility. Data Analysis & Integration: Analyzed data from diverse domains and integrated it into the enterprise Data Marketplace to support cross-functional analytics and decision-making. PySpark Development: Developed PySpark scripts for creating DataFrames, performing transformations, merging static and dynamic data files, and cleaning datasets for analytics. Multi-Cloud Experience (GCP & AWS): Worked across cloud platforms using GCP services like DataProc, BigQuery, and Cloud Storage, and AWS services like EMR, S3, Glacier, and EC2. Azure & Hybrid Cloud Tools: Gained experience with Azure services like Data Lake, Synapse, Data Factory, and Databricks while developing multi-cloud strategies with a focus on GCP s PaaS capabilities. CI/CD & DevOps: Implemented CI/CD pipelines using Jenkins, Maven, Ant, and GCP tools, including designing Jenkins jobs and automating deployments with containerization (Docker & Kubernetes). Cloud Shell & SDK Usage: Utilized GCP Cloud Shell SDK to configure and manage services like DataProc, BigQuery, and Cloud Storage for efficient data engineering workflows. Documentation & Cross-Team Coordination: Authored functional specifications and technical designs, and collaborated with Cloud Security, IAM, Platform, and Network teams to complete intake and accreditation processes. Cloud Machine Learning Pipelines: Leveraged GCP compute and GPU resources to design automated machine learning and analytics pipelines, enhancing model performance and operational scalability. Environment: GCP, PySpark, GCPs Data Proc Big Query, Hadoop, Hive, GCS, Python, Snowflake, Dynamo DB, Oracle Database, Power Bi, SDK S, Data Flow, Glacier, EC2, EMR Cluster, SQL Database, Synapse, Data Bricks. Client: Freddie Mac, Chicago, IL Mar 2018 Aug 2020 Role: AWS Data Engineer Responsibilities: Requirements Analysis & Estimation: Participated in gathering system requirements and provided estimations for development and testing efforts, supporting enterprise-wide reporting system initiatives. Cloud Migration & AWS Ecosystem: Migrated on-prem applications to AWS using EC2 and S3; managed Hadoop clusters via AWS EMR for distributed processing of small data sets. ETL with AWS Glue & Redshift: Designed and built ETL pipelines in AWS Glue to ingest and transform JSON, Parquet, and Text data from S3 into AWS Redshift for analytics. Real-Time Data Processing: Developed real-time data workflows using Spark Streaming, Kafka, and Flume; configured pipelines to process and store data in HDFS. Data Transformation & Cleansing: Performed advanced data transformations and cleansing operations using Spark and PySpark before loading data into Snowflake and other targets. Apache Airflow Orchestration: Built, debugged, and scheduled ETL jobs using Apache Airflow; defined DAGs and dependencies with Python and AWS Lambda for batch processing pipelines. Monitoring & Logging: Configured monitoring and alerting using AWS CloudWatch, and implemented logging for ETL pipelines to ensure system visibility and performance tracking. DataStage ETL Development: Created complex ETL jobs in IBM DataStage using stages like Transformer, Lookup, Aggregator, Merge, and CDC to support large-scale enterprise transformations. Data Analysis in Databricks: Developed Spark-SQL applications in Databricks to extract, transform, and analyze data across formats, uncovering customer usage patterns. Query Optimization & Data Engineering: Wrote optimized Spark and Hive queries using window functions and fine-tuned Hadoop shuffle for enhanced performance. Cluster Management & Automation: Estimated Spark Databricks cluster sizes, performed troubleshooting, and automated data loads using UNIX shell scripting. DevOps & Monitoring Tools: Implemented containerized monitoring solutions using Docker and Jenkins, supporting CI/CD processes and infrastructure observability. Data Visualization & Dashboards: Designed and built interactive dashboards and visual reports using Tableau, enabling real-time business insights for decision-makers. Agile Development & Testing: Contributed to Agile/SCRUM processes, conducting iterative testing and development to support continuous delivery of data solutions. Environment: Python, Hive, Spark, AWS EC2, S3, AWS EMR, AWS Glue, HDFS, Spark Streaming, Kafka, Flume, JSON, Parquet, Text Files, AWS Redshift, Spark, DataStage, Airflow, Snowflake, ETL pipeline, Lambda, PySpark, Cloud Watch, Spark-SQL, Unix shell, Agile, Tableau, Docker, and Jenkins Client: Sapient, Hyderabad, India May 2014 June 2016 Role: Hadoop Engineer/Developer Responsibilities: Data Lake Development & Management: Designed and developed scalable data lake solutions using Hadoop ecosystem tools to ingest, process, and transform large volumes of structured and unstructured data for analytics. HDFS Operations & Data Ingestion: Managed HDFS storage, including maintenance, data replication, and efficient loading of various data formats (CSV, TXT, fixed-width) into Hadoop clusters from multiple sources. ETL Development with Hive & SQL: Built and optimized ETL pipelines using Hive, writing complex SQL queries to extract, cleanse, and merge data from relational databases like SQL Server. Log Data Ingestion using Flume: Developed Flume agents and configurations to capture and stream logs from various systems into HDFS for monitoring and analytics. HBase Implementation: Designed and implemented HBase tables to store and query massive volumes of semi-structured data, integrating them with Hive for downstream data aggregation. Spark-Based Data Processing: Migrated legacy Hive/SQL scripts to Spark applications using RDDs and Scala, enhancing data processing performance and scalability. Data Transformation with Spark & Python: Utilized Apache Spark and Python for data cleaning, preprocessing, and modeling tasks to support data science and analytics initiatives. Metadata Management & Data Modeling: Conducted data model reviews and captured technical metadata using data modeling tools to support governance and consistency. Workflow Automation with Control-M: Automated and scheduled batch workflows using Control-M to ensure reliable and timely execution of ETL jobs across the Hadoop ecosystem. Environment: Python, SQL, ETL, Hadoop, HDFS, Spark, Scala, Kafka, HBase, MySQL, Netezza, Web Services, Shell Script, Control-M. Education: Bachelor of Engineering from SRM, CHENNAI (2010-2014) MASTERS IN Northeastern University, BOSTON, MA(2016-2018) Academic Projects: Real-Time Student Performance Analytics SQL Server, AWS S3, Java, XML, Android Studio Smart Campus Resource Management Azure Data Factory, Azure SQL Database Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database information technology microsoft procedural language Georgia Illinois Massachusetts Texas |