Resume View

Home

Mahesh - Data Engineer

Location: Cincinnati, Ohio, USA

Relocation: Yes

Visa: GC

PROFESSIONAL SUMMARY:
Over 10+ years of experience as a Senior Data Engineer with demonstrated expertise in designing and implementing robust data pipelines across GCP, AWS, Azure, and on-premises environments for ingesting, processing, and analyzing both real-time streaming and batch data.
Proficient in leveraging GCP services like Cloud Storage, Pub/Sub, Dataflow, Dataproc, BigQuery, Airflow, and AWS services like S3, Kinesis, EMR, Athena, Redshift, Lambda, as well as Azure services like Data Lake Storage, Event Hubs, Databricks, Synapse Analytics, and Data Factory for building scalable data solutions.
Skilled in developing real-time streaming pipelines using GCP's Dataflow with Apache Beam, AWS's Kinesis Data Analytics with Flink, and Azure's Data Factory with Apache Beam, enabling windowing and session processing for use cases like anomaly detection and predictive maintenance.
Implemented reliable and scalable message ingestion systems using GCP's Pub/Sub, AWS's Managed Streaming for Apache Kafka (MSK), and Azure's Event Hubs with Apache Kafka for seamless data delivery.
Designed and developed batch-processing ETL pipelines on GCP's Dataproc with PySpark, AWS's EMR with Spark, and Azure's Databricks with Spark, utilizing Hadoop components and Spark APIs for complex transformations.
Leveraged Apache Hive with SQL-like querying (HiveQL) on GCP's Dataproc and Azure's Synapse Analytics for processing and analyzing large datasets stored in cloud storage, integrating with Apache Spark.
Built cloud-based data warehouses on GCP's BigQuery, AWS's Redshift, Azure's Synapse Analytics, and Snowflake, optimizing for analytics with partitioning, clustering, and integrating with BI tools.
Implemented CDC solutions for capturing and streaming database changes in real-time using connectors for MySQL, PostgreSQL, Oracle, and AWS's Database Migration Service (DMS).
Enforced robust data governance with metadata management tools like GCP's Data Catalog, AWS's Glue Data Catalog, and Azure's Purview, encryption with GCP's KMS, AWS's KMS, Azure's Key Vault, and Snowflake, and access controls with GCP's IAM, AWS's IAM, Azure's Active Directory, and Snowflake's RBAC.
Utilized IaC tools like GCP's Terraform, AWS's CloudFormation and CDK, and Azure's Resource Manager templates for provisioning cloud resources.

TECHNICAL SKILLS:

Google Cloud Platform GCS, Cloud Pub/Sub, Dataflow, Dataproc, BigQuery, Cloud Functions, Cloud Run, Dataprep, Cloud Composer, Data Catalog, KMS, IAM, Cloud Monitoring, Logging, Build.
AWS Cloud Platform AWS S3, Kinesis, EMR, Athena, Redshift, Lambda, Step Functions, DMS, Glue, Sage Maker, CloudWatch, IAM, KMS, CloudFormation, CDK, EKS, CloudWatch, MSK.
Azure Cloud Platform Azure Data Factory, Synapse Analytics, Azure Data Lake, Event Hubs, Databricks, Functions, Cosmos DB, Key Vault, Active Directory, Azure Monitor, Log Analytics, AKS.
Hadoop Core Services HDFS, Apache Hive, HiveQL, Sqoop, Apache Kafka, Apache Flume, Map Reduce, Spark, YARN.
Hadoop Distribution Apache Hadoop, Spark, Flink, Beam, Cloudera, Hortonworks.
On-Premises SAS, DB2, Teradata, Netezza, Oracle.
Databases HBase, Spark-Redis, Cassandra, Oracle, MySQL, PostgreSQL, Teradata.
Data Services Hive, Pig, Impala, Sqoop, Flume, Kafka.
Data Warehousing Big Query, Snowflake, Redshift, Azure Synapse Analytics.
Scheduling Tools Apache Airflow, Zookeeper, Oozie.
Monitoring Tools Prometheus, Grafana, Cloud Monitoring, Cloud Logging, Cloud Watch, Azure Monitor
Data Visualization Looker, Tableau, Data Studio, Power BI
Cloud Computing Tools AWS, Azure, GCP.
Programming Languages Java17, Python, Scala, SQL, PL/SQL, Pig Latin, HiveQL, Shell Scripting.
Operating Systems UNIX, Windows, LINUX.
Build Tools Terraform, Docker, Kubernetes, Git, Cloud Build, Jenkins, Maven, Ant.
ETL Tools Apache NiFi, Talend, Sqoop, SSIS, IBM DataStage, Robot Scheduler.
Development Tools Eclipse, NetBeans, Microsoft SQL Studio, Toad.

PROFESSIONAL EXPERIENCE:
Client: Verizon, Incedo, TX December 2022 Present
Senior Data Engineer
Technical Stack: GCP, GCS, Cloud Pub/Sub, Spark, Dataproc, BigQuery, Apache Airflow, MySQL, PostgreSQL, KMS, Prometheus, Terraform, Docker, Kubernetes, Git, NiFi, Flink, Apache Beam, TFX, Data Fusion.
Designed and implemented a robust data pipeline on GCP for ingesting, processing, and analyzing both real-time streaming and batch data from multiple sources, utilizing Cloud Storage buckets for raw data storage.
Leveraged Cloud Dataflow with Apache Beam's Python SDK for building streaming data pipelines, enabling real-time data processing and transformations using Pardo and windowing functions.
Implemented Cloud Pub/Sub and Kafka for reliable and scalable message ingestion and delivery using Pub/Sub subscriptions.
Developed batch processing pipelines using Apache Spark on Cloud Dataproc, leveraging PySpark (Python API for Apache Spark) and Hadoop components like HDFS and YARN for ETL and data transformations utilizing Spark's RDD and Data Frame APIs.
Incorporated Apache Hive for processing and analyzing large datasets stored in HDFS, leveraging its SQL-like querying capabilities (HiveQL) and integrating it with other components like Apache Spark.
Employed BigQuery to create a cloud-based data warehouse, enabling efficient analytics and reporting with SQL queries and leveraging partitioning and clustering features.
Integrated Apache Airflow with DAGs and operators to orchestrate and schedule data pipelines, ensuring reliable and timely execution.
Implemented change data capture (CDC) using MySQL and PostgreSQL connectors to capture and stream real-time database changes.
Enforced data governance and security measures, including data lineage tracking with Cloud Data Catalog's metadata management, data encryption with Cloud KMS's key management, and access controls with Cloud IAM's roles and permissions.
Implemented monitoring and observability using Prometheus for metrics collection, ELK stack for log aggregation and analysis
Utilized Terraforms configuration files and providers for infrastructure as code (IAC) and provisioning GCP resources.
Deployed containerized applications using Docker containers and Kubernetes clusters for scalability and portability.
Integrated Git with branches, commits, and pull requests for version control and collaborative development.
Integrated Prometheus with exporters, service discovery, and alerting rules for monitoring containerized applications and collecting metrics.
Employed Cloud Bigtable as a NoSQL database with Column Families and Row Keys to store and serve large volumes of analytical data.
Leveraged Cloud Datastore's entities, kinds, and queries for storing and querying semi-structured and unstructured data.
Utilized Apache Beam's ml-pipeline library with TensorFlow Extended (TFX) to build and deploy machine learning pipelines on GCP.

Client: Hewlett Packard Enterprise February 2019 October2021
Senior Data Engineer
Technical Stack: GCP, Cloud Functions, Cloud Pub/Sub, Cloud Dataflow, Apache Beam, Apache Kafka, Cloud Dataproc, Cloud Composer, BigQuery, Cloud Monitoring, Cloud Logging, Cloud IAM, Cloud KMS, Terraform, Git, Cloud Build.
Designed and implemented a serverless data ingestion pipeline using Cloud Functions and Cloud Run for collecting and processing real-time data streams from various devices and sensors.
Utilized Cloud Pub/Sub for reliable and scalable message ingestion and delivery, with Cloud Dataflow for real-time stream processing and transformations using Apache Beam's Java SDK.
Employed Apache Kafka on Cloud Dataproc to build a distributed, fault-tolerant messaging system for ingesting and processing high-volume data streams.
Implemented Cloud Dataflow's streaming analytics capabilities with Apache Beam's windowing and session functions for real-time anomaly detection and predictive maintenance use cases.
Integrated Cloud Data prep for data exploration, profiling, and data quality management, leveraging its visual interface and machine learning capabilities.
Employed Cloud Composer (Apache Airflow) for orchestrating and scheduling complex data pipelines, including batch processing and machine learning workflows.
Utilized Cloud Dataproc to run Apache Spark and Hadoop workloads, leveraging Spark's Structured Streaming and Machine Learning libraries for real-time analytics and model deployment.
Implemented Cloud Dataproc Metastore for managing and governing data lakes, including metadata management, data discovery, and access control.
Employed Cloud Data Fusion for building and managing hybrid and multi-cloud data integration pipelines, leveraging its pre-built connectors and transformations.
Utilized BigQuery for creating a cloud-based data warehouse, leveraging its machine learning capabilities (BigQuery ML) and integrating with tools like Looker for data visualization and reporting.
Implemented Cloud Dataflow's Apache Beam pipelines for data quality checks, validation, and cleansing, utilizing built-in and custom data transformation functions.
Employed Cloud Dataflow's streaming SQL engine for real-time stream processing and querying using SQL-like syntax.
Utilized Git and Cloud Source Repositories for version control and collaborative development, integrating with Cloud Build for continuous integration and deployment (CI/CD).
Collaborated with cross-functional teams, including data scientists, software engineers, and product managers, using Agile methodologies and project management tools for effective communication and coordination

Datadot Software Solution, India November 2017 January 2019
Data Engineer
Technical Stack: AWS, Snowflake, Amazon S3, MSK, Lambda, Kinesis, Flink, EMR, DMS, Glue Data Brew, Tableau, Glue Data Catalog, KMS, IAM, Prometheus, Grafana, CloudFormation, CDK, GitHub, Code Pipeline, Code Build, Jira.
Designed and implemented a hybrid data platform leveraging both AWS and Snowflake to cater to diverse data processing and analytical requirements.
Utilized Amazon S3 buckets and Snowflake's internal and external stages for ingesting and storing raw data from various sources, including databases, APIs, and file systems.
Implemented Amazon Managed Streaming for Apache Kafka (MSK) for building a distributed, fault-tolerant messaging system for ingesting and processing high-volume, real-time data streams.
Developed streaming data pipelines using AWS Lambda and Amazon Kinesis Data Analytics with Apache Flink, integrating with Amazon MSK for real-time data ingestion and leveraging windowing and session functions for stream processing and analytics.
Utilized Apache Spark on Amazon EMR for batch processing and ETL jobs, leveraging Spark's RDD and DataFrame APIs for data transformations and integrating with Amazon S3 and Snowflake stages.
Employed Snowflake's cloud data platform for building a scalable, high-performance data warehouse, leveraging its semi-structured data capabilities, SQL querying, and advanced analytics features.
Implemented change data capture (CDC) using AWS Database Migration Service (DMS) for capturing and streaming database changes in real-time from Amazon RDS, Amazon Aurora, and self-hosted MySQL, PostgreSQL, and Oracle databases to Amazon MSK topics.
Integrated AWS Glue DataBrew for data exploration, profiling, and data quality management, leveraging its visual interface and machine learning capabilities for data wrangling and feature engineering.
Utilized Amazon QuickSight and Tableau for data visualization and reporting, connecting to both Amazon Athena/Amazon Redshift and Snowflake as data sources.
Implemented data governance and security measures, including data lineage tracking with AWS Glue Data Catalog, data encryption with AWS Key Management Service (KMS) and Snowflake's encryption features, and access controls with AWS Identity and Access Management (IAM) and Snowflake's role-based access management (RBAC).
Employed Amazon Managed Service for Prometheus and Amazon Managed Grafana for monitoring and observability, collecting metrics from AWS resources, Amazon MSK, and Snowflake, and creating custom dashboards and alerts.
Utilized AWS CloudFormation and AWS Cloud Development Kit (CDK) for infrastructure as code (IaC) and provisioning AWS resources, including Amazon EMR clusters, AWS Lambda functions, and Amazon MSK clusters, as well as Snowflake resources such as databases, warehouses, and roles

Client: Airtel, India December 2014 - November 2017
Data Engineer
Technical Stack: Azure Functions, Event Hubs, Data Factory, Apache Beam, Apache Kafka, Databricks, Data Factory, Synapse Analytics, Azure Log Analytics, Azure Active Directory, Azure Key Vault, Terraform, Azure DevOps.
Leveraged Azure Data Factory and Azure Synapse Analytics for efficient data management, developing complex data pipelines and data flows for seamless data migration from flat files to Azure SQL databases.
Developed and performed Azure Data Factory copy activities to load data from Oracle databases to Azure Data Lake Storage Gen2 (ADLS Gen2); created partitions and buckets based on state for bucket-based Azure Synapse Analytics SQL pool joins.
Created Azure Synapse Analytics SQL pool tables to store processed results in a tabular format; loaded and transformed data using scripting languages like Python and Azure Bash scripts.
Led ETL efforts to integrate, transform, and map data from multiple sources using Python; imported datasets with ingestion tools like Azure Data Factory, Azure Event Hubs, and Azure IoT Hub.
Extensively worked with moving data across azure cloud architectures, including Azure Synapse Analytics, Azure Data Lake Storage Gen2, and Azure Blob Storage.
Developed data pipelines using Azure Data Factory, Azure Synapse Analytics, and Azure Databricks to ingest subscriber data, provider data, and claims into ADLS Gen2 for analysis.
Demonstrated working knowledge of Apache Spark RDD, Data Frame API, Dataset API, Data Source API, Spark SQL, and Spark Streaming on Azure Databricks.
Loaded streaming data into Spark RDD and performed in-memory computations to generate output responses.
Collected data using Spark Structured Streaming from Azure Blob Storage in near-real-time, performing transformations and aggregations to build common learner data models and persisting data in ADLS Gen2.
Experienced in Microsoft SQL Server database administration, deploying packages in SQL Server Data Tools (SSDT) to SQL Server Integration Services (SSIS) catalogs.
Implemented constraints tables in data warehouses to load data and catch errors like data type violations, NULL constraint violations, foreign key violations, and duplicate data.
Generated reports to maintain zero percent errors in all data warehouse tables.
Developed SSIS packages for migrating data from the staging area of SQL Server 2008 to SQL Server 2012.

EDUCATION:
Graduated from ANDHRA UNIVERSITY Vishakhapatnam, India-May 2014
Keywords: continuous integration continuous deployment machine learning business intelligence sthree database information technology procedural language Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)

[email protected];4950

Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: