| Nikhil - Data Engineer With AI/ML |
| [email protected] |
| Location: Princeton, New Jersey, USA |
| Relocation: Yes |
| Visa: H1-B |
|
Name: Nikhil Boinapally
Mail Id: [email protected] Contact No: 732-672-5095 Professional Summary: Senior AI/ML Data Engineer with 12+ years of experience delivering enterprise-grade data, machine learning, and generative AI solutions across healthcare, finance, insurance, and supply chain domains. Strong expertise in building production-ready LLM platforms, including RAG-based search systems, AI agents, and enterprise NLP pipelines. Proven ability to architect scalable AI systems that integrate seamlessly with data lakes, ERP platforms, and cloud-native infrastructure. Hands-on experience with Retrieval-Augmented Generation (RAG) using vector databases to improve accuracy, traceability, and trust in AI outputs. Extensive background in AI agent orchestration using LangChain and CrewAI for autonomous workflows and decision support systems. Skilled in model training, fine-tuning, and optimization using TensorFlow and PyTorch for NLP, classification, anomaly detection, and forecasting use cases. Strong foundation in MLOps practices, including model versioning, CI/CD automation, experiment tracking, and production monitoring. Deep experience in feature engineering and data ingestion pipelines supporting both traditional ML and modern GenAI workloads. Adept at processing structured and unstructured data including clinical documents, contracts, claims data, and transactional datasets. Proven track record of deploying real-time and batch inference services using REST APIs and event-driven architectures. Strong understanding of data governance, explainability, and auditability, ensuring AI systems meet regulatory and compliance standards. Experienced in cloud-native AI engineering across AWS environments using containerization and orchestration platforms. Effective collaborator with product owners, clinicians, compliance teams, and business stakeholders to translate requirements into AI-driven solutions. Demonstrated ability to optimize model performance, cost, and latency in high-scale enterprise environments. Background in building enterprise data platforms supporting analytics, ML, BI, and regulatory reporting. Known for delivering reliable, secure, and measurable AI solutions aligned with business outcomes and operational KPIs. Technical Skills Programming Languages Python, SQL, Scala AI / ML Frameworks TensorFlow, PyTorch, Scikit-learn, Hugging Face Transformers Generative AI & LLMs GPT-4, LLaMA, Mistral, OpenAI APIs, Prompt Engineering RAG & AI Agents RAG Pipelines, LangChain, CrewAI, Vector Search Vector Databases Pinecone, FAISS Data Engineering Apache Spark, PySpark, Feature Engineering Pipelines Cloud Platforms AWS (EC2, EKS, Lambda, S3, CloudWatch) Data Stores Snowflake, PostgreSQL, MongoDB, Oracle, Redshift MLOps & Model Management MLflow, Kubeflow, Model Registry, CI/CD Containerization & Orchestration Docker, Kubernetes Workflow Orchestration Airflow, Step Functions APIs & Microservices REST APIs, FastAPI, Event-Driven Architecture Monitoring & Observability Prometheus, Grafana, CloudWatch Security & Compliance HIPAA, SOC2, RBAC, Data Encryption, Audit Logging DevOps & Automation GitHub Actions, Jenkins, Terraform Methodologies & Tools Agile/Scrum, Jira, Confluence Professional Experience: Client: Cardinal Health Dublin, OH. Jan 2025 Till Date Role: Sr. AI/ML Data Engineer Project Scope: Enterprise AI modernization initiative for Cardinal Health focused on improving supply chain forecasting, clinical document intelligence, and internal knowledge discovery across pharmaceutical distribution and healthcare operations. The project involved building LLM-powered platforms, RAG-based search systems, and autonomous AI agents integrated with existing ERP, data lake, and cloud infrastructure. Emphasis was placed on production-ready ML pipelines, governance, scalability, and regulatory compliance (HIPAA, SOC2). Responsibilities: Architected enterprise-scale LLM solutions to support supply chain analytics, contract intelligence, and clinical document processing using GPT-based and open-source models (LLaMA, Mistral) aligned with healthcare compliance requirements. Led the design of Retrieval-Augmented Generation (RAG) pipelines leveraging vector databases (Pinecone/FAISS) and curated knowledge sources including SOPs, FDA documents, vendor contracts, and internal wikis, significantly improving answer accuracy and traceability. Established ML Ops frameworks covering model versioning, experiment tracking, CI/CD automation, and monitoring using MLflow, Docker, Kubernetes, and GitHub Actions, ensuring seamless promotion from development to production. Built AI agents for automated data validation, anomaly detection, and workflow orchestration using LangChain / CrewAI, enabling autonomous decision support for supply chain exceptions and inventory risk alerts. Developed advanced NLP pipelines for document classification, entity extraction, summarization, and semantic similarity using Transformers, spaCy, and custom TensorFlow models for high-volume healthcare data. Directed model training, fine-tuning, and optimization workflows using TensorFlow and PyTorch, including parameter-efficient tuning (LoRA/PEFT) to reduce inference cost and latency. Designed scalable real-time and batch inference services exposed through REST and event-driven architectures, supporting downstream applications and internal analytics platforms. Implemented data ingestion and feature engineering pipelines integrating structured and unstructured data from Snowflake, S3, APIs, and streaming sources, ensuring data quality and lineage. Enforced model governance, explainability, and auditability using SHAP and custom evaluation metrics to meet healthcare regulatory and internal risk standards. Collaborated with product owners, data architects, and compliance teams to translate business requirements into AI-driven solutions with measurable KPIs and operational impact. Mentored junior engineers and data scientists, conducted design reviews, and set coding, ML Ops, and deployment standards across the AI engineering team. Supported client stakeholders during production releases, post-deployment monitoring, and performance tuning, ensuring system reliability under enterprise workloads. Environment: LLMs (GPT-4, LLaMA, Mistral), RAG, AI Agents, TensorFlow, PyTorch, Hugging Face, NLP, Transformers, spaCy, SHAP, MLflow, Docker, Kubernetes, GitHub Actions, CI/CD, Model Registry, Monitoring & Alerting, Snowflake, Amazon S3, Vector DBs (Pinecone, FAISS), SQL, NoSQL, AWS (EC2, EKS, Lambda, S3, CloudWatch), APIs, Microservices, Python, REST APIs, LangChain, FastAPI, Pandas, NumPy, HIPAA, SOC2, Role-based access, Data encryption, Audit logging. Client: Express Scripts St. Louis, MO. Oct 2023 Dec 2024 Role: Sr. AL/MLOps Engineer Project Scope: The project focused on building enterprise-grade AI and machine learning solutions to optimize pharmacy benefit management (PBM) operations, improve prescription fulfillment accuracy, reduce fraud and waste, and enhance customer and clinician experiences. The engagement involved deploying large-scale ML models, NLP pipelines, and Generative AI solutions across claims processing, prior authorization automation, member support chat systems, and clinical insights. The scope also included implementing MLOps frameworks, LLM-based RAG architectures, and AI agents integrated with internal healthcare data platforms while ensuring HIPAA compliance, scalability, and production reliability. Responsibilities: Led end-to-end ownership of AI/ML solutions supporting prescription claims analysis, medication adherence prediction, and prior authorization decisioning workflows used by millions of members Built and optimized NLP pipelines for processing unstructured clinical notes, prescription data, and provider communications using transformer-based models and domain-specific embeddings Implemented LLM-powered Retrieval Augmented Generation (RAG) solutions to enable accurate, context-aware responses for internal clinical teams and member support applications Integrated vector databases (FAISS / Pinecone) with enterprise data sources to support semantic search and knowledge retrieval across formularies, policies, and clinical guidelines Designed and deployed AI Agents capable of orchestrating multi-step tasks such as eligibility checks, formulary validation, and claims exception handling with minimal human intervention Trained, fine-tuned, and evaluated deep learning models using TensorFlow and PyTorch for classification, sequence modeling, and anomaly detection use cases Established robust MLOps pipelines for model versioning, automated retraining, CI/CD, monitoring, and drift detection using tools like MLflow, Kubeflow, and GitHub Actions Collaborated with data engineering teams to process large-scale healthcare datasets using Spark and cloud-native pipelines on AWS Improved model inference latency and throughput by optimizing feature pipelines, batching strategies, and GPU utilization in production environments Partnered with compliance and security teams to ensure HIPAA-aligned data handling, auditability, and access controls across AI systems Conducted model explainability and bias assessments using SHAP and LIME to support regulatory requirements and clinical transparency Supported A/B testing and real-world performance analysis to measure business impact and continuously improve AI-driven workflows Mentored junior engineers and reviewed model architectures, feature engineering approaches, and deployment strategies Worked closely with product owners, clinicians, and stakeholders to translate business requirements into scalable AI solutions Environment: Python, SQL, TensorFlow, PyTorch, Scikit-learn, Hugging Face Transformers, OpenAI APIs, LangChain, RAG, AI Agents, Prompt Engineering, MLflow, Kubeflow, GitHub Actions, Docker, Kubernetes, Apache Spark, Pandas, NumPy, PostgreSQL, MongoDB, Vector DBs (FAISS, Pinecone), AWS (S3, EC2, EKS, Lambda), Grafana, Prometheus, HIPAA, Jira, Confluence, Git. Client: BNP Paribas NYC, NY. Aug 2021 Sep 2023 Role: Machine Learning Engineer Project Scope: The project focused on building and maintaining enterprise-scale data platforms to support risk analytics, compliance reporting, fraud detection, and trading analytics across BNP Paribas US operations. The team handled high-volume transactional, market, and reference data, integrating multiple upstream systems into a centralized analytics platform. Responsibilities included real-time and batch data ingestion, feature engineering for ML models, regulatory data validation, and enabling downstream consumption for data science, BI, and regulatory teams under strict banking security and governance standards. Responsibilities: Built scalable data pipelines to ingest high-frequency transactional and market data from multiple internal banking systems and third-party vendors into centralized data platforms Managed end-to-end ETL workflows handling structured and semi-structured data to support risk, compliance, and AML use cases Created optimized data models and curated datasets for machine learning teams supporting fraud detection and credit risk scoring Supported feature engineering processes by transforming raw financial data into ML-ready datasets with version control and lineage tracking Tuned Spark jobs to improve performance for large-scale joins, aggregations, and window functions across billions of records Automated data quality checks, reconciliation reports, and validation rules to meet regulatory and audit requirements Integrated real-time data streams using Kafka to support near-real-time fraud alerts and transaction monitoring Collaborated closely with data scientists to operationalize ML models and deploy batch and streaming inference pipelines Implemented data partitioning, indexing, and compression strategies to reduce query latency and storage costs Maintained secure access controls and data masking policies to comply with internal governance and regulatory standards Supported CI/CD pipelines for data workflows, ensuring smooth deployments across development, QA, and production environments Monitored pipeline failures, performance metrics, and SLA compliance using logging and alerting tools Assisted regulatory and audit teams by providing data extracts, lineage documentation, and reproducible reports Coordinated with global teams across US and Europe to align data standards and ingestion frameworks Handled production support issues, root cause analysis, and preventive fixes for recurring data failures Environment: Python, PySpark, SQL, Scala,Apache Spark, Hadoop, Kafka, AWS (S3, EC2, EMR), HDFS, PostgreSQL, Oracle, Snowflake, MLlib, Scikit-learn, Feature Engineering Pipelines, Airflow, Control-M, Git, Jenkins, CI/CD, CloudWatch, Splunk, Tableau, Power BI, Linux, Unix Client: S&P Global NYC, NY. Jun 2020 Apr 2021 Role: Data Engineer Project Scope: S&P Global maintains large-scale financial, market intelligence, and ratings data consumed by internal analytics teams, external clients, and regulatory bodies. The project focused on building and maintaining scalable data pipelines to ingest, process, and transform high-volume market data from multiple vendors and internal systems into centralized data platforms. The platform supported real-time analytics, historical reporting, risk modeling, and downstream BI applications while ensuring data accuracy, governance, and regulatory compliance. Responsibilities: Built and maintained end-to-end data pipelines handling structured and semi-structured financial data sourced from vendor feeds, REST APIs, FTP servers, and internal transactional systems Worked closely with product owners, quantitative analysts, and downstream consumers to understand data requirements related to pricing, ratings, and market intelligence datasets Created robust ETL workflows using Python and SQL to cleanse, standardize, and enrich raw market data before loading into enterprise data warehouses Managed ingestion of high-frequency data into cloud-based storage and processing platforms, ensuring low latency and high availability for analytical use cases Optimized complex SQL queries and data models to improve performance of reporting and analytics workloads used by internal research and client-facing teams Supported migration of on-prem data pipelines to AWS, including refactoring jobs to leverage S3, Redshift, and Glue for scalable processing Ensured data quality through validation checks, reconciliation processes, and exception handling to meet regulatory and audit requirements Partnered with DevOps teams to automate deployments and scheduling of batch jobs using CI/CD pipelines and workflow orchestration tools Monitored data pipelines in production, analyzed failures, and resolved performance bottlenecks to meet strict SLA commitments Participated in code reviews and followed enterprise data engineering standards, security policies, and best practices Worked with governance teams to apply metadata management, data lineage, and access controls across sensitive financial datasets Supported UAT and production releases by coordinating with QA and business stakeholders, ensuring accurate data delivery across environments Environment: Python, SQL, PySpark, AWS (S3, Redshift, Glue, EC2), Apache Airflow, Informatica, Snowflake, Unix/Linux, Oracle, PostgreSQL, Redshift, Snowflake, CSV, JSON, Parquet, XML, Git, Jenkins, AWS CodePipeline, Agile / Scrum, Linux, Unix. Client: Oracle Bangalore, India. May 2016 Jul 2019 Role: Jr. Data Engineer Project Scope: Enterprise-scale data engineering initiative supporting Oracle s internal analytics and reporting platforms. The project focused on building and maintaining reliable data pipelines to process high-volume transactional, customer, and product data from multiple source systems. The platform enabled business intelligence, operational reporting, and downstream analytics by delivering cleansed, transformed, and performance-optimized datasets to data warehouses and reporting tools. Emphasis was placed on data accuracy, scalability, automation, and adherence to enterprise data governance standards. Responsibilities: Built end-to-end data pipelines to ingest data from Oracle databases, flat files, and external systems into centralized data warehouse environments Created complex SQL queries, joins, and transformations to handle large datasets and ensure data consistency across multiple business domains Developed ETL workflows using Informatica to extract, transform, and load structured and semi-structured data based on business requirements Handled data cleansing, deduplication, validation rules, and reconciliation checks to improve overall data quality Tuned SQL queries, indexes, and ETL jobs to improve performance for high-volume batch processing Supported incremental and full data loads, including change data capture (CDC) logic for near real-time reporting needs Coordinated with business analysts and reporting teams to translate functional requirements into scalable data models Maintained dimensional and fact tables following star and snowflake schema designs for analytics use cases Performed root cause analysis for data issues, load failures, and performance bottlenecks, ensuring timely resolution Automated recurring data workflows using shell scripts and scheduling tools to reduce manual intervention Participated in code reviews, data validation cycles, and production deployments following change management processes Ensured compliance with data security, access controls, and internal governance standards Provided ongoing production support, monitoring daily loads, and resolving data anomalies within SLA timelines Environment: Oracle Database, Informatica PowerCenter, SQL, PL/SQL, Unix, Linux, Shell Scripting, Data Warehousing, ETL, Star Schema, Snowflake Schema, Jenkins (basic), Control-M, Git, Agile/Scrum. Client: Thoughtworks Pune, India. Jan 2014 Apr 2016 Role: Software Developer Project Scope: The project supported multiple client engagements focused on data-driven application development and analytics enablement across retail and enterprise domains. The engagement involved building and maintaining data pipelines, supporting reporting systems, improving data reliability, and assisting senior engineers with scalable data processing solutions. The work emphasized clean data ingestion, transformation, validation, and close collaboration with application and business teams in an Agile delivery environment. Responsibilities: Supported end-to-end data pipelines that ingested data from relational databases, flat files, and external systems for reporting and analytics use cases. Handled data extraction and transformation tasks using Python and SQL to prepare datasets for downstream consumption. Maintained batch data processing jobs and ensured timely availability of clean, validated data for business users. Assisted in writing optimized SQL queries for reporting, reconciliation, and ad-hoc analysis. Performed data quality checks, validation rules, and basic anomaly detection to improve data reliability. Collaborated with application developers to align data models with application and reporting requirements. Supported integration of data pipelines with reporting tools and internal dashboards. Monitored scheduled jobs and addressed failures by analyzing logs and correcting data or configuration issues. Participated in Agile ceremonies, providing updates on data pipeline status and resolving sprint-level issues. Assisted senior data engineers with performance tuning of queries and batch jobs. Managed version control for data scripts and SQL artifacts using Git. Supported deployment activities by validating data pipelines across development and test environments. Documented data flows, table definitions, and transformation logic for knowledge sharing and onboarding. Responded to data-related production issues and supported root cause analysis under guidance. Followed coding standards, data governance practices, and client-specific compliance requirements. Environment: Python, SQL, MySQL, PostgreSQL, CSV/Flat Files, Linux, Git, Shell Scripting, ETL Concepts, Data Validation, Batch Processing, Agile / Scrum, Jira, Confluence. Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree active directory procedural language Alabama Idaho Missouri New York Ohio |