| Carl Johnson - Lead AI / Cloud Data Engineer |
| [email protected] |
| Location: Chicago, Illinois, USA |
| Relocation: YES |
| Visa: USC |
| Resume file: Carl Johnson_1778527268531.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Lead AI & Cloud Data Engineer
Carl Johnson [email protected] Results-driven Lead AI & Azure Data Engineer with 10+ years of experience designing and delivering enterprise-scale cloud data platforms, Lakehouse architectures, and AI-enabled analytics solutions across Banking, Healthcare, Retail, Telecom, and Semiconductor domains. Proven expertise in Azure, AWS, and GCP with strong hands-on experience in Azure Databricks, Azure Synapse, Azure Data Factory, AWS Glue, BigQuery, PySpark, Spark SQL, and distributed data processing frameworks. Specialized in building scalable ETL/ELT pipelines, real-time streaming architectures, Medallion/Lakehouse frameworks, and AI/ML-ready data platforms supporting advanced analytics, regulatory reporting, and intelligent automation. Strong background in data governance, cloud security, CI/CD, Infrastructure as Code (Terraform/ARM), and enterprise cloud modernization initiatives. PROFESSSIONAL SUMAMRY Lead AI & Azure Data Engineer with 10+ years of progressive experience designing, developing, and modernizing enterprise-scale cloud data platforms, AI-enabled analytics ecosystems, and Lakehouse architectures across Banking, Healthcare, Retail, Telecom, and Semiconductor domains. Proven expertise in building scalable multi-cloud data engineering solutions across Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP) supporting enterprise analytics, regulatory reporting, AI/ML workloads, and real-time data processing. Hands-on experience delivering end-to-end enterprise data solutions using Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, AWS Glue, Amazon Redshift, GCP Dataflow, BigQuery, and cloud-native analytics services. Strong experience architecting modern Lakehouse and Medallion architectures (Bronze, Silver, Gold layers) enabling scalable ingestion, transformation, governance, and consumption of structured, semi-structured, and streaming datasets. Specialized in building enterprise ETL/ELT frameworks, metadata-driven ingestion pipelines, distributed data processing systems, and cloud-native analytics platforms using Python, PySpark, Spark SQL, Apache Spark, and Apache Beam. Hands-on expertise implementing scalable batch and real-time streaming pipelines using Azure Event Hub, Amazon Kinesis, Google Cloud Pub/Sub, and Databricks Structured Streaming for low-latency enterprise analytics. Extensive experience in SQL development, Spark optimization, query tuning, dimensional data modeling, partitioning, indexing, and performance optimization across cloud-native and distributed data platforms. Strong background in designing enterprise data warehouses, Lakehouse architectures, dimensional models, fact and dimension tables, data marts, and analytical reporting frameworks following Kimball methodology. Experience implementing scalable Change Data Capture (CDC), watermarking, incremental loading, reconciliation, auditing, and enterprise data quality validation frameworks supporting regulatory and compliance requirements. Skilled in integrating on-premises and cloud-based systems including Oracle, SQL Server, mainframe systems, APIs, flat files, and third-party platforms into modern cloud-native ecosystems. Strong expertise in cloud security, data governance, RBAC, Row-Level Security (RLS), encryption, Azure Key Vault, AWS IAM, AWS KMS, Azure Purview, and enterprise compliance frameworks including HIPAA and GDPR. Experience developing AI/ML-ready data engineering platforms supporting predictive analytics, fraud detection, intelligent automation, semantic search, Retrieval-Augmented Generation (RAG), and Generative AI integration using Azure OpenAI and cloud-native AI services. Hands-on experience implementing DevOps and Infrastructure as Code (IaC) practices using Azure DevOps, AWS CodePipeline, GitHub, Terraform, ARM Templates, CloudFormation, CI/CD pipelines, and automated cloud deployments. Proven expertise in enterprise production support, incident management, root cause analysis, operational monitoring, SLA management, and continuous optimization of mission-critical cloud data platforms. Strong leadership and collaboration skills with proven ability to mentor engineers, lead technical initiatives, work with cross-functional stakeholders, and translate complex business requirements into scalable enterprise data and AI solutions. TECHNICAL SKILLS Category Skills CLOUD PLATFORMS Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure Synapse Analytics, Azure Data Factory (ADF), Azure Data Lake Storage Gen2 (ADLS Gen2), Azure Databricks, Azure Event Hub, Azure Key Vault, Azure Purview, Azure Monitor, Azure Log Analytics, Azure Functions, Azure OpenAI, Azure Cognitive Services, Azure Machine Learning, AWS S3, AWS Glue, AWS Glue Studio, AWS Lambda, AWS Redshift, AWS Lake Formation, AWS DMS, AWS DataSync, AWS Kinesis, AWS CloudWatch, AWS CloudTrail, AWS API Gateway, GCP BigQuery, Google Cloud Storage (GCS), Google Cloud Dataflow, Google Cloud Pub/Sub, Cloud Functions, Cloud Run, Cloud Build, Google Cloud Monitoring, Databricks (Azure/AWS/GCP) AI & MACHINE LEARNING Azure OpenAI, Generative AI Integration, AI/ML Data Pipelines, Retrieval-Augmented Generation (RAG), Vector Search Preparation, AI-Ready Data Platforms, Feature Engineering Pipelines, ML Data Preparation, Predictive Analytics Support, AI-Driven Automation, Intelligent Document Processing, Semantic Search Preparation, Azure Cognitive Services, Azure Machine Learning Integration, Fraud Detection Analytics, Risk Scoring Analytics DATA ENGINEERING & ETL/ELT Enterprise ETL/ELT Design & Development, Azure Data Factory Pipelines, AWS Glue Jobs, Glue Workflows, GCP Dataflow Pipelines, Metadata-Driven Frameworks, Data Integration, Data Transformation, Data Migration, Workflow Orchestration, Incremental Data Loads, Change Data Capture (CDC), Batch Processing, Real-Time Streaming Pipelines, Data Ingestion Frameworks, Data Lake & Lakehouse Architecture, Medallion Architecture, Distributed Data Processing, Data Reconciliation, Data Validation, Data Cleansing, Data Standardization, Data Profiling, Hybrid Cloud Integration DATABASES & DATA WAREHOUSING SQL Server, Azure SQL Database, Azure Synapse Dedicated SQL Pools, Amazon RDS, Amazon Aurora, Amazon Redshift, PostgreSQL, MySQL, Oracle, BigQuery, Snowflake, MongoDB, Cassandra, DynamoDB, NoSQL Databases, Enterprise Data Warehouse (EDW), Operational Data Store (ODS), Data Marts, Data Lake, Lakehouse Architecture PROGRAMMING & QUERY LANGUAGES Python, PySpark, Spark SQL, SQL, T-SQL, PL/SQL, Scala (Basic), Shell Scripting, JSON, XML, YAML, REST API Integration, Apache Beam BIG DATA & STREAMING TECHNOLOGIES Apache Spark, Databricks, Hadoop Ecosystem, Hive, Delta Lake, Spark Structured Streaming, Apache Beam, Distributed Computing, Real-Time Data Processing, Event-Driven Architecture, Streaming Analytics, Dataflow Streaming Jobs, Kinesis Data Streams, Azure Event Hub, Cloud Pub/Sub DATA MODELING & ARCHITECTURE Dimensional Modeling, Star Schema, Snowflake Schema, Fact & Dimension Tables, Kimball Methodology, Data Modeling, Data Architecture, SCD Type 1 & Type 2, Enterprise Data Platform Design, Medallion Architecture, Lakehouse Architecture, Data Lineage, Metadata Management BUSINESS INTELLIGENCE & ANALYTICS Power BI, Amazon QuickSight, Tableau, Looker, Dashboard Development, Executive Reporting, Data Visualization, Semantic Modeling, KPI Reporting, Financial Analytics, Healthcare Analytics, Retail Analytics, Risk Analytics, Operational Reporting CLOUD ANALYTICS SERVICES Azure Synapse Analytics, Azure Analysis Services, Azure Data Explorer, Amazon Redshift, AWS Athena, BigQuery, Databricks SQL, Power BI Semantic Models DEVOPS, CI/CD & IaC Azure DevOps, AWS CodePipeline, AWS CodeBuild, Cloud Build, GitHub Actions, CI/CD Pipelines, Git, Terraform, ARM Templates, AWS CloudFormation, Infrastructure as Code (IaC), YAML Pipelines, Automated Deployments, Release Management, Environment Provisioning DATA GOVERNANCE & SECURITY Azure RBAC, AWS IAM, GCP IAM, AWS Lake Formation, Azure Key Vault, AWS KMS, Managed Identities, Role-Based Access Control (RBAC), Row-Level Security (RLS), Data Governance, Data Security, Data Encryption, Data Privacy, Compliance (HIPAA, GDPR), Audit Frameworks, Data Lineage, Azure Purview, Secure API Integration INTEGRATION & DATA CONNECTIVITY REST APIs, Web Services, API Gateway, Azure Event Hub, AWS Kinesis, Google Cloud Pub/Sub, JDBC/ODBC Connectivity, Hybrid Data Integration, Third-Party API Integration, Secure Token-Based Authentication, Self-Hosted Integration Runtime PERFORMANCE OPTIMIZATION Query Optimization, Performance Tuning, Spark Optimization, Adaptive Query Execution, Broadcast Joins, Partitioning, Clustering, Caching, Indexing Strategies, Materialized Views, PolyBase, COPY INTO, Data Skew Handling, Execution Plan Analysis, Workload Optimization, Cost Optimization, Auto-Scaling Strategies MONITORING & LOGGING Azure Monitor, Azure Log Analytics, AWS CloudWatch, AWS CloudTrail, Google Cloud Monitoring, Google Cloud Logging, Pipeline Monitoring, Alerting Frameworks, SLA Monitoring, Incident Management, Operational Dashboards OPERATING SYSTEMS Windows, Linux/Unix PROJECTS & METHODOLOGIES Agile, Scrum, SDLC, Data Lifecycle Management, DevOps Practices, Continuous Integration, Continuous Deployment, Production Support, Incident Management, Root Cause Analysis, Release Management TOOLS & UTILITIES Azure Portal, AWS Console, GCP Console, Databricks Workspace, SQL Server Management Studio (SSMS), Visual Studio Code, Azure Storage Explorer, GitHub, Jira, Postman, WinSCP, Putty LEAD AI/ML / DATA ENGINEER | JP MORGAN | 2023 - PRESENT Architected and led the implementation of a scalable enterprise AI-enabled data platform on Microsoft Azure using Azure Data Lake Storage Gen2 (ADLS Gen2), Azure Synapse Analytics, and Azure Databricks to support regulatory reporting, risk analytics, fraud detection, intelligent automation, and enterprise-wide financial data processing. Designed and implemented modern Lakehouse and Medallion architecture (Bronze, Silver, Gold layers) enabling scalable ingestion, transformation, governance, and AI-driven analytics across enterprise financial datasets. Developed metadata-driven and configuration-based Azure Data Factory (ADF) orchestration frameworks enabling dynamic ingestion of 150+ structured, semi-structured, and streaming data sources using reusable and parameterized pipelines. Built scalable ETL/ELT and AI-ready data pipelines leveraging Azure Databricks, PySpark, Spark SQL, Delta Lake, and Azure Synapse for high-volume transactional, customer, market, and risk datasets. Engineered high-performance distributed data transformation pipelines in Azure Databricks processing multi-terabyte financial datasets with optimized Spark execution strategies including partition pruning, broadcast joins, caching, adaptive query execution, and workload parallelization. Leveraged Delta Lake to implement ACID-compliant Lakehouse pipelines supporting schema evolution, time travel, data versioning, auditability, and enterprise governance requirements. Developed scalable Change Data Capture (CDC) frameworks using watermarking, Delta MERGE operations, and hash-based comparisons to support incremental data processing and low-latency ingestion. Implemented secure hybrid-cloud integration using Self-Hosted Integration Runtime, REST APIs, Azure Functions, and secure connectors to extract data from Oracle, SQL Server, mainframe systems, and third-party financial platforms into Azure environments. Designed enterprise-scale dimensional data models including Customer, Trade, Risk, Transaction, Portfolio, and Compliance marts following Kimball and Star Schema methodologies for advanced analytics and reporting. Built Slowly Changing Dimension (SCD Type 1 & Type 2) frameworks using Delta MERGE operations to maintain historical data consistency and audit tracking across financial systems. Developed advanced Spark SQL transformation logic using analytical functions, aggregations, ranking, anomaly detection, and statistical calculations to support fraud analytics, compliance monitoring, and financial risk scoring. Implemented near real-time streaming pipelines using Azure Event Hub, Databricks Structured Streaming, and Spark Streaming to process live trading feeds, transaction events, and market activity data. Designed enterprise AI/ML-ready data engineering frameworks supporting downstream machine learning, predictive analytics, intelligent automation, and Generative AI use cases. Collaborated with AI/ML and analytics teams to deliver curated feature-engineered datasets supporting fraud detection models, customer intelligence, forecasting, predictive analytics, and risk modelling initiatives. Integrated Azure OpenAI and AI-enabled services for intelligent document processing, semantic search preparation, AI-assisted analytics workflows, and enterprise automation initiatives. Supported development of Retrieval-Augmented Generation (RAG)-ready data pipelines enabling ingestion, transformation, chunking, metadata enrichment, and vector-search preparation for enterprise AI applications. Assisted in implementing AI-driven automation workflows leveraging Azure Cognitive Services, Azure Machine Learning integration patterns, and cloud-native AI orchestration approaches. Developed enterprise data quality frameworks performing schema validation, reconciliation, referential integrity validation, null checks, anomaly detection, and automated auditing across ingestion and transformation layers. Built automated reconciliation and audit pipelines comparing source financial records with curated analytical datasets ensuring regulatory compliance and enterprise reporting accuracy. Secured highly sensitive financial and customer data using Azure Key Vault, Managed Identities, RBAC, Row-Level Security (RLS), encryption, column masking, and enterprise governance controls. Implemented fine-grained security and governance strategies across Azure Synapse, Power BI, Databricks, and ADLS environments supporting enterprise compliance and secure data access management. Developed CI/CD pipelines using Azure DevOps, Git integration, YAML-based deployments, and automated release pipelines for Azure Data Factory, Databricks notebooks, Synapse artifacts, infrastructure resources, and AI pipelines. Automated cloud infrastructure provisioning and environment standardization using ARM templates, Terraform, and Infrastructure as Code (IaC) best practices across development, QA, staging, and production environments. Tuned Azure Synapse dedicated SQL pools using partitioning strategies, materialized views, indexing, workload management, and distribution optimization to improve analytical query performance and scalability. Implemented PolyBase and COPY INTO strategies for high-performance ingestion of large-scale Parquet, Delta, and CSV datasets into Synapse analytical environments. Built centralized monitoring, logging, and observability frameworks using Azure Monitor, Log Analytics, Databricks monitoring, and custom telemetry dashboards to track pipeline execution, Spark performance, SLA adherence, and operational health. Configured proactive alerting and incident management mechanisms integrated with enterprise notification systems to handle failures, latency issues, and production incidents efficiently. Developed enterprise-grade Power BI semantic models, KPIs, and executive dashboards supporting financial analytics, compliance reporting, operational intelligence, and risk exposure monitoring. Implemented enterprise data governance, lineage tracking, metadata management, and cataloguing using Azure Purview to support regulatory audits, compliance initiatives, and enterprise data discovery. Led migration of legacy SSIS, Informatica, and traditional ETL workflows into modern cloud-native Azure Data Factory and Databricks Lakehouse architectures improving scalability, maintainability, and operational efficiency. Performed cloud cost optimization initiatives using auto-scaling Azure Databricks clusters, workload optimization, auto-pause/resume strategies, and efficient compute utilization techniques. Led technical design discussions, established enterprise engineering standards, mentored junior engineers, and enforced cloud-native data engineering and AI integration best practices across enterprise programs. Collaborated with enterprise architects, business stakeholders, AI/ML teams, compliance groups, DevOps engineers, and cross-functional teams in Agile/Scrum environments to deliver scalable, secure, and AI-enabled financial data solutions. Played a key role in modernizing enterprise financial ecosystems into cloud-native AI-enabled Lakehouse platforms supporting advanced analytics, automation, and next-generation AI initiatives. Environment: Microsoft Azure, Azure Data Factory (ADF), Azure Synapse Analytics, Azure Data Lake Storage Gen2 (ADLS Gen2), Azure Databricks, Delta Lake, Azure Event Hub, Azure Key Vault, Azure Purview, Azure DevOps, ARM Templates, Terraform, Azure OpenAI, Azure Cognitive Services, Azure Machine Learning, Power BI, Python, PySpark, Spark SQL, SQL, REST APIs, Distributed Data Processing, ETL/ELT, Lakehouse Architecture, Medallion Architecture, CDC Pipelines, Databricks Structured Streaming, AI/ML Data Pipelines, RAG Pipelines, Vector Search Preparation, Enterprise AI Integration, Cloud-Native Analytics, Data Governance, RBAC, Row-Level Security (RLS), CI/CD, Git, Agile, SDLC, Financial Analytics, Risk Analytics, Fraud Detection, Enterprise Data Platform Engineering SENIOR DATA ENGINEER | CVS Health | 2020 - 2023 Architected and implemented a scalable enterprise healthcare data platform on AWS using Amazon S3 as the centralized cloud data lake for ingesting, processing, and analyzing large-scale healthcare datasets including claims, eligibility, pharmacy, provider, member, and clinical data from multiple enterprise source systems. Designed and developed enterprise-grade ETL/ELT pipelines using AWS Glue, Glue Studio, PySpark, Python, and Spark SQL to process billions of healthcare records with complex transformation and validation logic. Built metadata-driven and configuration-based ingestion frameworks using AWS Glue Workflows, reusable pipeline templates, and parameterized job orchestration to onboard new healthcare datasets dynamically without code modifications. Implemented secure and scalable hybrid-cloud data ingestion pipelines using AWS Database Migration Service (DMS), AWS DataSync, JDBC connectors, and API integrations to migrate data from on-premises SQL Server, Oracle, EMR, and legacy healthcare systems into AWS. Designed and enforced modern Medallion/Lakehouse architecture consisting of raw, curated, and consumption layers using Amazon S3 and AWS Lake Formation to standardize enterprise data processing, governance, and accessibility. Leveraged Amazon Redshift as the enterprise cloud data warehouse to build curated healthcare analytical data marts supporting claims analytics, patient outcome analysis, pharmacy reporting, provider analytics, cost optimization, utilization management, and operational reporting. Developed scalable incremental and CDC-based data processing frameworks using timestamp-based extraction, partitioning strategies, watermarking, and merge operations to optimize ETL performance and reduce processing windows. Implemented complex healthcare transformation logic using Spark SQL, PySpark DataFrames, analytical functions, joins, aggregations, and window functions to process claims transactions, pharmacy data, provider networks, and healthcare utilization datasets. Standardized healthcare coding systems including ICD, CPT, HCPCS, and NDC using enterprise reference mapping frameworks, validation rules, and business transformation logic to ensure data consistency and regulatory compliance. Designed and implemented Slowly Changing Dimension (SCD Type 1 & Type 2) frameworks using AWS Glue and Redshift MERGE operations to maintain historical patient, provider, and member data for enterprise analytical reporting. Built enterprise data quality validation frameworks using AWS Glue Data Quality, Python validation scripts, reconciliation processes, and SQL-based integrity checks to enforce schema consistency, referential integrity, and data reliability. Developed automated reconciliation and audit frameworks comparing source system counts, transactional totals, claims balances, and curated reporting datasets to ensure healthcare reporting accuracy and compliance readiness. Integrated third-party healthcare APIs, pharmacy benefit management systems, payer systems, and external healthcare vendors using AWS Lambda, API Gateway, REST APIs, and token-based authentication mechanisms. Implemented near real-time streaming data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, and Spark Streaming to process healthcare events, patient activities, and prescription transactions with low-latency ingestion. Optimized Spark and Glue job performance by implementing partitioning, bucketing, caching, broadcast joins, dynamic resource allocation, parallel processing, and memory optimization techniques to improve runtime efficiency and reduce cloud costs. Secured sensitive healthcare and PHI data using AWS IAM roles, bucket policies, AWS KMS encryption, S3 security controls, network access policies, and enterprise security best practices to ensure HIPAA and compliance adherence. Implemented fine-grained access control using AWS Lake Formation, Redshift RBAC, row-level security, and policy-based access management to restrict sensitive healthcare data visibility based on user roles and business functions. Developed CI/CD pipelines using AWS CodePipeline, CodeBuild, GitHub, and automated deployment frameworks for Glue jobs, Lambda functions, infrastructure components, and cloud-native data services. Automated infrastructure provisioning and cloud resource deployment using Terraform and AWS CloudFormation templates to standardize environment setup across development, QA, staging, and production environments. Built enterprise monitoring, logging, and operational support frameworks using Amazon CloudWatch, AWS CloudTrail, Glue job metrics, and centralized dashboards to monitor pipeline execution, SLA adherence, and platform health. Configured proactive alerting and notification mechanisms integrated with Amazon SNS, email alerts, and operational dashboards to manage failures, latency issues, and production incidents efficiently. Developed healthcare analytics dashboards and reporting solutions using Amazon QuickSight and Redshift datasets to support operational insights, patient analytics, claims analysis, financial reporting, and executive decision-making. Partnered with data science and analytics teams to deliver curated and high-quality healthcare datasets supporting predictive modeling, patient risk scoring, fraud detection, utilization forecasting, and advanced analytics initiatives. Provided production support including incident management, root cause analysis, ETL troubleshooting, performance tuning, operational optimization, and SLA management for critical enterprise healthcare data pipelines. Led technical discussions, established development standards, mentored junior engineers, and enforced cloud engineering best practices across enterprise healthcare data initiatives. Collaborated with cross-functional teams including business stakeholders, healthcare analysts, architects, QA teams, and DevOps engineers in Agile/Scrum environments to deliver scalable and secure healthcare analytics solutions. Played a key role in modernizing legacy healthcare ETL ecosystems into scalable cloud-native AWS Lakehouse architectures supporting enterprise digital transformation initiatives. Environment: Amazon Web Services (AWS), AWS S3, AWS Glue, Glue Studio, Amazon Redshift, AWS Lambda, API Gateway, AWS DMS, AWS DataSync, Amazon Kinesis Data Streams, Kinesis Firehose, AWS IAM, AWS KMS, AWS Lake Formation, Amazon CloudWatch, AWS CloudTrail, AWS CodePipeline, AWS CodeBuild, Terraform, AWS CloudFormation, Python, PySpark, Spark SQL, SQL, REST APIs, ETL/ELT, CDC Pipelines, Lakehouse Architecture, Medallion Architecture, Distributed Data Processing, Streaming Pipelines, Healthcare Analytics, Data Warehousing, Redshift Optimization, HIPAA Compliance, SCD Type 1 & 2, CI/CD, GitHub, Agile, SDLC, Data Governance, Cloud Security SENIOR DATA ENGINEER | Target | 2017 - 2020 Architected and implemented a scalable enterprise retail data platform on Google Cloud Platform (GCP) using Google Cloud Storage (GCS) as the centralized cloud data lake for ingesting and processing high-volume sales, customer, inventory, product, pricing, supply chain, and clickstream datasets. Designed and developed enterprise-grade batch and real-time data ingestion pipelines using Google Cloud Dataflow (Apache Beam), PySpark, and Python to process structured, semi-structured, and streaming retail datasets at scale. Built metadata-driven and configuration-based ingestion frameworks enabling dynamic onboarding of multiple retail source systems with reusable pipeline templates and parameterized workflows. Integrated enterprise retail data from POS systems, ERP platforms, CRM applications, e-commerce systems, third-party vendor feeds, marketing platforms, and flat file sources into GCP analytics environments. Designed and implemented modern Medallion/Lakehouse architecture consisting of raw, curated, and consumption layers on GCS and BigQuery to support scalable enterprise analytics and reporting solutions. Developed cloud-native ELT and ETL processing frameworks leveraging BigQuery SQL, Apache Beam, PySpark, and Dataflow distributed processing capabilities. Leveraged BigQuery as the enterprise cloud data warehouse to build curated retail analytical data marts supporting sales analytics, customer segmentation, inventory forecasting, supply chain optimization, and operational reporting. Designed and optimized enterprise-scale BigQuery datasets, partitioned tables, clustered tables, materialized views, and query execution strategies to improve performance and reduce processing cost. Implemented advanced transformation logic using BigQuery SQL, analytical functions, aggregations, joins, ranking functions, and window functions to derive business-critical retail insights and KPI reporting. Developed incremental and CDC-style data processing frameworks using timestamp-based extraction, partitioning, watermarking, and merge strategies for efficient large-scale data processing. Designed dimensional data models including Customer, Product, Sales, Inventory, Vendor, Promotion, and Store fact and dimension tables following Kimball and Star Schema methodologies. Built Slowly Changing Dimension (SCD Type 1 & Type 2) frameworks in BigQuery to maintain historical tracking of customer, product, pricing, and inventory changes. Developed enterprise data quality validation frameworks using Python, SQL validation rules, reconciliation logic, and automated data checks to ensure data consistency and reliability. Implemented data profiling, schema validation, null handling, duplicate detection, referential integrity checks, and reconciliation processes across ingestion and transformation layers. Built reconciliation and audit frameworks comparing source system counts, transactional totals, and curated reporting datasets to ensure reporting accuracy and data integrity. Integrated real-time streaming pipelines using Google Cloud Pub/Sub and Dataflow streaming jobs to process clickstream data, transaction feeds, customer activity events, and near real-time retail analytics workloads. Developed serverless data processing solutions using Cloud Functions and Cloud Run for API integrations, lightweight transformation services, event-driven ingestion, and automation workflows. Integrated third-party vendor systems, marketing platforms, and external APIs using REST APIs, Cloud Functions, and secure authentication mechanisms for automated data ingestion. Optimized Dataflow jobs by tuning worker configurations, autoscaling parameters, memory utilization, parallelism, windowing strategies, and pipeline execution settings to improve throughput and reduce latency. Implemented distributed data processing and scalable cloud-native architectures supporting high-volume retail transaction processing and analytical workloads. Secured enterprise data environments using Google Cloud IAM roles, service accounts, encryption mechanisms, and security best practices to ensure controlled access and data protection. Implemented fine-grained access controls in BigQuery using authorized views, row-level security, dataset-level permissions, and policy-based access management. Built CI/CD pipelines using Cloud Build, GitHub, and automated deployment practices for Dataflow jobs, BigQuery artifacts, Cloud Functions, and infrastructure deployments. Automated infrastructure provisioning and environment setup using Terraform scripts for development, QA, staging, and production environments. Developed monitoring, logging, and operational support frameworks using Google Cloud Monitoring, Cloud Logging, and alerting mechanisms to track pipeline execution, failures, SLA adherence, and system health. Configured proactive alerting and notification systems using Pub/Sub notifications, email integrations, and monitoring dashboards to manage operational incidents efficiently. Assisted in troubleshooting production issues, performance bottlenecks, pipeline failures, and data inconsistencies through root cause analysis and optimization initiatives. Collaborated with cross-functional business stakeholders, analytics teams, product owners, and data science teams to deliver scalable retail analytics and enterprise reporting solutions. Supported Agile-based software development lifecycle activities including sprint planning, code reviews, release management, testing, deployment, and continuous platform enhancements. Mentored junior developers and engineers on GCP best practices, cloud-native development, distributed processing, and enterprise data engineering standards. Played a key role in modernizing legacy retail data processing systems into scalable cloud-native GCP analytics platforms supporting enterprise digital transformation initiatives. Environment: Google Cloud Platform (GCP), Google Cloud Storage (GCS), BigQuery, Google Cloud Dataflow, Apache Beam, Cloud Pub/Sub, Cloud Functions, Cloud Run, Cloud Build, Google Cloud IAM, Cloud Monitoring, Cloud Logging, Terraform, Python, PySpark, Spark SQL, SQL, REST APIs, Distributed Data Processing, ETL/ELT, Lakehouse Architecture, Medallion Architecture, Big Data Processing, Streaming Pipelines, CDC, Data Warehousing, Star Schema, Kimball Modeling, SCD Type 1 & 2, CI/CD, GitHub, Agile, SDLC, Retail Analytics, Data Governance, Cloud Security BIG DATA / ETL Developer | AT&T | 2015 2017 Contributed to building a scalable telecom data platform on AWS using Amazon S3 as the centralized data lake for ingesting and processing large-scale telecom datasets including Call Detail Records (CDR), customer usage, billing, and network event data. Developed enterprise ETL and ELT pipelines using AWS Glue, Python, and PySpark to extract, transform, and load data from Oracle, SQL Server, flat files, and telecom operational systems into the AWS ecosystem. Built scalable batch ingestion workflows using AWS Data Pipeline and AWS Glue Workflows for orchestrating large-scale telecom data processing jobs. Developed PySpark transformation frameworks in AWS Glue to process multi-million record telecom datasets using Spark SQL, DataFrames, joins, aggregations, and window functions. Implemented distributed data processing techniques using Apache Spark to improve ETL scalability, fault tolerance, and execution performance. Designed and maintained raw, staging, and curated data layers in Amazon S3 supporting enterprise analytics and reporting requirements. Worked extensively with Amazon Redshift to build telecom analytical data marts supporting billing analytics, customer usage reporting, network monitoring, and operational insights. Developed SQL transformation logic, views, stored procedures, and aggregation pipelines in Amazon Redshift for analytical and reporting use cases. Built incremental data loading frameworks using timestamp-based extraction, partitioning, and CDC-style processing strategies to optimize ETL performance and reduce processing windows. Created reusable Python utilities and shell scripts for data validation, batch execution, job automation, file handling, reconciliation, and operational monitoring. Implemented data cleansing and validation routines to handle null values, duplicate records, malformed telecom events, and inconsistent source data structures. Supported Slowly Changing Dimension (SCD Type 1 and Type 2) implementations to maintain historical customer profile and telecom service plan changes. Integrated near real-time streaming pipelines using Amazon Kinesis to process live telecom network events and customer activity streams. Developed AWS Lambda functions for lightweight event-driven processing, automation tasks, and serverless ETL operations within the AWS environment. Assisted in migrating legacy ETL workflows and on-premises batch processing jobs into modern cloud-native AWS data pipelines. Optimized Spark and ETL job performance by tuning partitioning strategies, memory allocation, joins, caching, and query execution plans. Worked with AWS IAM roles, policies, and encryption mechanisms to implement secure data access and controlled permissions for AWS resources. Built monitoring and logging frameworks using Amazon CloudWatch to track ETL job execution, failures, runtime metrics, and operational health. Configured automated alerting and notification mechanisms using Amazon SNS for proactive monitoring and SLA management. Assisted in troubleshooting production ETL failures, performance bottlenecks, and data inconsistencies through root cause analysis and issue remediation. Participated in CI/CD deployment activities using AWS CodePipeline, Git, and automated deployment practices for ETL workflows and infrastructure components. Collaborated with business analysts, QA teams, and senior engineers in Agile environments to gather requirements, support releases, and enhance enterprise telecom data solutions. Gained strong hands-on experience with cloud-native big data processing, distributed computing, AWS analytics services, and enterprise ETL modernization initiatives. Environment: AWS S3, AWS Glue, AWS Glue Workflows, AWS Data Pipeline, Amazon Redshift, AWS Lambda, Amazon Kinesis, AWS IAM, AWS CloudWatch, Amazon SNS, AWS CodePipeline, Python, PySpark, Spark SQL, SQL, Oracle, SQL Server, Shell Scripting, Apache Spark, ETL/ELT, Data Warehousing, Telecom Analytics, Distributed Data Processing, Batch Processing, Streaming Pipelines, Git, Unix/Linux, Agile, SDLC ETL / Python Developer | INTEL | 2014 2015 Developed and supported enterprise ETL pipelines using SQL Server Integration Services (SSIS) to extract, transform, and load large volumes of structured and semi-structured data from multiple enterprise source systems into centralized data warehouse environments. Built reusable ETL workflows and modular SSIS packages for handling data extraction, transformation, cleansing, validation, and loading processes across operational and analytical systems. Worked extensively with Python scripting to automate file processing, data validation, log generation, data reconciliation, and batch job execution for ETL operations. Developed Python-based utilities for parsing CSV, XML, and flat file datasets, performing transformation logic, and generating standardized outputs for downstream processing. Wrote complex T-SQL queries, stored procedures, triggers, temporary tables, CTEs, and views to support data transformation, aggregation, and reporting requirements. Assisted in migrating legacy ETL workflows and SQL-based batch processing jobs into optimized SSIS and Python-driven automation frameworks. Integrated data from multiple heterogeneous sources including SQL Server, Oracle, flat files, Excel files, and legacy enterprise applications into staging and Operational Data Store (ODS) layers. Designed and maintained staging tables, intermediate transformation layers, and warehouse schemas to support enterprise reporting and business intelligence solutions. Participated in dimensional data modeling activities including creation of fact tables, dimension tables, star schemas, and snowflake schemas for analytical reporting. Implemented Slowly Changing Dimension (SCD Type 1 and Type 2) logic within ETL processes to maintain historical tracking and auditability of dimensional data. Developed data quality and validation frameworks using SQL and Python to identify null values, duplicate records, invalid formats, referential integrity issues, and inconsistent source data. Performed data profiling, cleansing, normalization, and transformation activities to improve overall data quality and reporting accuracy. Created incremental and full-load ETL strategies using watermarking, batch processing, and change detection logic to optimize data processing performance. Assisted in early cloud migration initiatives by supporting data movement from on-premises databases into Microsoft Azure Blob Storage and Azure SQL Database environments. Utilized Python and shell scripting for job scheduling, file transfer automation, process monitoring, and operational support in Unix/Linux environments. Developed automation scripts for handling archival processes, batch executions, error handling, and ETL restart/recovery mechanisms. Worked with SQL Server Agent to schedule, monitor, and troubleshoot ETL workflows and overnight batch jobs supporting enterprise reporting SLAs. Assisted in troubleshooting ETL failures, performance bottlenecks, data mismatches, and load issues by analyzing logs, execution plans, and SQL performance metrics. Optimized SQL queries and ETL workflows using indexing strategies, query tuning, execution plan analysis, and efficient join/filter techniques. Participated in unit testing, integration testing, and UAT validation of ETL workflows to ensure data accuracy and successful deployments. Created detailed technical documentation including source-to-target mappings, ETL design documents, transformation rules, workflow diagrams, and operational runbooks. Collaborated with business analysts, database administrators, QA teams, and senior developers to gather requirements and implement scalable data integration solutions. Supported Agile-based development activities including sprint planning, defect resolution, code reviews, and release deployments. Gained hands-on exposure to enterprise data warehousing concepts, ETL best practices, data governance, and large-scale data integration methodologies. Environment: Python, SQL Server, SSIS, T-SQL, Stored Procedures, SQL Server Agent, Azure Blob Storage, Azure SQL Database, Oracle, Unix/Linux, Shell Scripting, ETL Development, Data Warehousing, Data Modeling, ODS, Star Schema, Snowflake Schema, SCD Type 1 & 2, CSV, XML, Flat Files, Data Validation, Data Cleansing, Performance Tuning, Batch Processing, Agile, SDLC, Git, Excel, SQL Optimization Education: Bachelors - Chicago State University 2008 2012 Masters - University of Illinois Chicago 2012 - 2014 Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree procedural language |