Home

Nikhil - Data Engineer With AI/ML
[email protected]
Location: Princeton, New Jersey, USA
Relocation: Yes
Visa: H1-B
Name: Nikhil Boinapally
Mail Id: [email protected]
Contact No: 732-672-5095


Professional Summary:
Senior AI/ML Data Engineer with 12+ years of experience delivering enterprise-grade data, machine learning, and generative AI solutions across healthcare, finance, insurance, and supply chain domains.
Strong expertise in building production-ready LLM platforms, including RAG-based search systems, AI agents, and enterprise NLP pipelines.
Proven ability to architect scalable AI systems that integrate seamlessly with data lakes, ERP platforms, and cloud-native infrastructure.
Hands-on experience with Retrieval-Augmented Generation (RAG) using vector databases to improve accuracy, traceability, and trust in AI outputs.
Extensive background in AI agent orchestration using LangChain and CrewAI for autonomous workflows and decision support systems.
Skilled in model training, fine-tuning, and optimization using TensorFlow and PyTorch for NLP, classification, anomaly detection, and forecasting use cases.
Strong foundation in MLOps practices, including model versioning, CI/CD automation, experiment tracking, and production monitoring.
Deep experience in feature engineering and data ingestion pipelines supporting both traditional ML and modern GenAI workloads.
Adept at processing structured and unstructured data including clinical documents, contracts, claims data, and transactional datasets.
Proven track record of deploying real-time and batch inference services using REST APIs and event-driven architectures.
Strong understanding of data governance, explainability, and auditability, ensuring AI systems meet regulatory and compliance standards.
Experienced in cloud-native AI engineering across AWS environments using containerization and orchestration platforms.
Effective collaborator with product owners, clinicians, compliance teams, and business stakeholders to translate requirements into AI-driven solutions.
Demonstrated ability to optimize model performance, cost, and latency in high-scale enterprise environments.
Background in building enterprise data platforms supporting analytics, ML, BI, and regulatory reporting.
Known for delivering reliable, secure, and measurable AI solutions aligned with business outcomes and operational KPIs.
Technical Skills
Programming Languages Python, SQL, Scala
AI / ML Frameworks TensorFlow, PyTorch, Scikit-learn, Hugging Face Transformers
Generative AI & LLMs GPT-4, LLaMA, Mistral, OpenAI APIs, Prompt Engineering
RAG & AI Agents RAG Pipelines, LangChain, CrewAI, Vector Search
Vector Databases Pinecone, FAISS
Data Engineering Apache Spark, PySpark, Feature Engineering Pipelines
Cloud Platforms AWS (EC2, EKS, Lambda, S3, CloudWatch)
Data Stores Snowflake, PostgreSQL, MongoDB, Oracle, Redshift
MLOps & Model Management MLflow, Kubeflow, Model Registry, CI/CD
Containerization & Orchestration Docker, Kubernetes
Workflow Orchestration Airflow, Step Functions
APIs & Microservices REST APIs, FastAPI, Event-Driven Architecture
Monitoring & Observability Prometheus, Grafana, CloudWatch
Security & Compliance HIPAA, SOC2, RBAC, Data Encryption, Audit Logging
DevOps & Automation GitHub Actions, Jenkins, Terraform
Methodologies & Tools Agile/Scrum, Jira, Confluence

Professional Experience:
Client: Cardinal Health Dublin, OH. Jan 2025 Till Date
Role: Sr. AI/ML Data Engineer
Project Scope: Enterprise AI modernization initiative for Cardinal Health focused on improving supply chain forecasting, clinical document intelligence, and internal knowledge discovery across pharmaceutical distribution and healthcare operations. The project involved building LLM-powered platforms, RAG-based search systems, and autonomous AI agents integrated with existing ERP, data lake, and cloud infrastructure. Emphasis was placed on production-ready ML pipelines, governance, scalability, and regulatory compliance (HIPAA, SOC2).
Responsibilities:
Architected enterprise-scale LLM solutions to support supply chain analytics, contract intelligence, and clinical document processing using GPT-based and open-source models (LLaMA, Mistral) aligned with healthcare compliance requirements.
Led the design of Retrieval-Augmented Generation (RAG) pipelines leveraging vector databases (Pinecone/FAISS) and curated knowledge sources including SOPs, FDA documents, vendor contracts, and internal wikis, significantly improving answer accuracy and traceability.
Established ML Ops frameworks covering model versioning, experiment tracking, CI/CD automation, and monitoring using MLflow, Docker, Kubernetes, and GitHub Actions, ensuring seamless promotion from development to production.
Built AI agents for automated data validation, anomaly detection, and workflow orchestration using LangChain / CrewAI, enabling autonomous decision support for supply chain exceptions and inventory risk alerts.
Developed advanced NLP pipelines for document classification, entity extraction, summarization, and semantic similarity using Transformers, spaCy, and custom TensorFlow models for high-volume healthcare data.
Directed model training, fine-tuning, and optimization workflows using TensorFlow and PyTorch, including parameter-efficient tuning (LoRA/PEFT) to reduce inference cost and latency.
Designed scalable real-time and batch inference services exposed through REST and event-driven architectures, supporting downstream applications and internal analytics platforms.
Implemented data ingestion and feature engineering pipelines integrating structured and unstructured data from Snowflake, S3, APIs, and streaming sources, ensuring data quality and lineage.
Enforced model governance, explainability, and auditability using SHAP and custom evaluation metrics to meet healthcare regulatory and internal risk standards.
Collaborated with product owners, data architects, and compliance teams to translate business requirements into AI-driven solutions with measurable KPIs and operational impact.
Mentored junior engineers and data scientists, conducted design reviews, and set coding, ML Ops, and deployment standards across the AI engineering team.
Supported client stakeholders during production releases, post-deployment monitoring, and performance tuning, ensuring system reliability under enterprise workloads.
Environment: LLMs (GPT-4, LLaMA, Mistral), RAG, AI Agents, TensorFlow, PyTorch, Hugging Face, NLP, Transformers, spaCy, SHAP, MLflow, Docker, Kubernetes, GitHub Actions, CI/CD, Model Registry, Monitoring & Alerting, Snowflake, Amazon S3, Vector DBs (Pinecone, FAISS), SQL, NoSQL, AWS (EC2, EKS, Lambda, S3, CloudWatch), APIs, Microservices, Python, REST APIs, LangChain, FastAPI, Pandas, NumPy, HIPAA, SOC2, Role-based access, Data encryption, Audit logging.

Client: Express Scripts St. Louis, MO. Oct 2023 Dec 2024
Role: Sr. AL/MLOps Engineer
Project Scope: The project focused on building enterprise-grade AI and machine learning solutions to optimize pharmacy benefit management (PBM) operations, improve prescription fulfillment accuracy, reduce fraud and waste, and enhance customer and clinician experiences. The engagement involved deploying large-scale ML models, NLP pipelines, and Generative AI solutions across claims processing, prior authorization automation, member support chat systems, and clinical insights. The scope also included implementing MLOps frameworks, LLM-based RAG architectures, and AI agents integrated with internal healthcare data platforms while ensuring HIPAA compliance, scalability, and production reliability.
Responsibilities:
Led end-to-end ownership of AI/ML solutions supporting prescription claims analysis, medication adherence prediction, and prior authorization decisioning workflows used by millions of members
Built and optimized NLP pipelines for processing unstructured clinical notes, prescription data, and provider communications using transformer-based models and domain-specific embeddings
Implemented LLM-powered Retrieval Augmented Generation (RAG) solutions to enable accurate, context-aware responses for internal clinical teams and member support applications
Integrated vector databases (FAISS / Pinecone) with enterprise data sources to support semantic search and knowledge retrieval across formularies, policies, and clinical guidelines
Designed and deployed AI Agents capable of orchestrating multi-step tasks such as eligibility checks, formulary validation, and claims exception handling with minimal human intervention
Trained, fine-tuned, and evaluated deep learning models using TensorFlow and PyTorch for classification, sequence modeling, and anomaly detection use cases
Established robust MLOps pipelines for model versioning, automated retraining, CI/CD, monitoring, and drift detection using tools like MLflow, Kubeflow, and GitHub Actions
Collaborated with data engineering teams to process large-scale healthcare datasets using Spark and cloud-native pipelines on AWS
Improved model inference latency and throughput by optimizing feature pipelines, batching strategies, and GPU utilization in production environments
Partnered with compliance and security teams to ensure HIPAA-aligned data handling, auditability, and access controls across AI systems
Conducted model explainability and bias assessments using SHAP and LIME to support regulatory requirements and clinical transparency
Supported A/B testing and real-world performance analysis to measure business impact and continuously improve AI-driven workflows
Mentored junior engineers and reviewed model architectures, feature engineering approaches, and deployment strategies
Worked closely with product owners, clinicians, and stakeholders to translate business requirements into scalable AI solutions
Environment: Python, SQL, TensorFlow, PyTorch, Scikit-learn, Hugging Face Transformers, OpenAI APIs, LangChain, RAG, AI Agents, Prompt Engineering, MLflow, Kubeflow, GitHub Actions, Docker, Kubernetes, Apache Spark, Pandas, NumPy, PostgreSQL, MongoDB, Vector DBs (FAISS, Pinecone), AWS (S3, EC2, EKS, Lambda), Grafana, Prometheus, HIPAA, Jira, Confluence, Git.

Client: BNP Paribas NYC, NY. Aug 2021 Sep 2023
Role: Machine Learning Engineer
Project Scope: The project focused on building and maintaining enterprise-scale data platforms to support risk analytics, compliance reporting, fraud detection, and trading analytics across BNP Paribas US operations. The team handled high-volume transactional, market, and reference data, integrating multiple upstream systems into a centralized analytics platform. Responsibilities included real-time and batch data ingestion, feature engineering for ML models, regulatory data validation, and enabling downstream consumption for data science, BI, and regulatory teams under strict banking security and governance standards.
Responsibilities:
Built scalable data pipelines to ingest high-frequency transactional and market data from multiple internal banking systems and third-party vendors into centralized data platforms
Managed end-to-end ETL workflows handling structured and semi-structured data to support risk, compliance, and AML use cases
Created optimized data models and curated datasets for machine learning teams supporting fraud detection and credit risk scoring
Supported feature engineering processes by transforming raw financial data into ML-ready datasets with version control and lineage tracking
Tuned Spark jobs to improve performance for large-scale joins, aggregations, and window functions across billions of records
Automated data quality checks, reconciliation reports, and validation rules to meet regulatory and audit requirements
Integrated real-time data streams using Kafka to support near-real-time fraud alerts and transaction monitoring
Collaborated closely with data scientists to operationalize ML models and deploy batch and streaming inference pipelines
Implemented data partitioning, indexing, and compression strategies to reduce query latency and storage costs
Maintained secure access controls and data masking policies to comply with internal governance and regulatory standards
Supported CI/CD pipelines for data workflows, ensuring smooth deployments across development, QA, and production environments
Monitored pipeline failures, performance metrics, and SLA compliance using logging and alerting tools
Assisted regulatory and audit teams by providing data extracts, lineage documentation, and reproducible reports
Coordinated with global teams across US and Europe to align data standards and ingestion frameworks
Handled production support issues, root cause analysis, and preventive fixes for recurring data failures
Environment: Python, PySpark, SQL, Scala,Apache Spark, Hadoop, Kafka, AWS (S3, EC2, EMR), HDFS, PostgreSQL, Oracle, Snowflake, MLlib, Scikit-learn, Feature Engineering Pipelines, Airflow, Control-M, Git, Jenkins, CI/CD, CloudWatch, Splunk, Tableau, Power BI, Linux, Unix

Client: S&P Global NYC, NY. Jun 2020 Apr 2021
Role: Data Engineer
Project Scope: S&P Global maintains large-scale financial, market intelligence, and ratings data consumed by internal analytics teams, external clients, and regulatory bodies. The project focused on building and maintaining scalable data pipelines to ingest, process, and transform high-volume market data from multiple vendors and internal systems into centralized data platforms. The platform supported real-time analytics, historical reporting, risk modeling, and downstream BI applications while ensuring data accuracy, governance, and regulatory compliance.
Responsibilities:
Built and maintained end-to-end data pipelines handling structured and semi-structured financial data sourced from vendor feeds, REST APIs, FTP servers, and internal transactional systems
Worked closely with product owners, quantitative analysts, and downstream consumers to understand data requirements related to pricing, ratings, and market intelligence datasets
Created robust ETL workflows using Python and SQL to cleanse, standardize, and enrich raw market data before loading into enterprise data warehouses
Managed ingestion of high-frequency data into cloud-based storage and processing platforms, ensuring low latency and high availability for analytical use cases
Optimized complex SQL queries and data models to improve performance of reporting and analytics workloads used by internal research and client-facing teams
Supported migration of on-prem data pipelines to AWS, including refactoring jobs to leverage S3, Redshift, and Glue for scalable processing
Ensured data quality through validation checks, reconciliation processes, and exception handling to meet regulatory and audit requirements
Partnered with DevOps teams to automate deployments and scheduling of batch jobs using CI/CD pipelines and workflow orchestration tools
Monitored data pipelines in production, analyzed failures, and resolved performance bottlenecks to meet strict SLA commitments
Participated in code reviews and followed enterprise data engineering standards, security policies, and best practices
Worked with governance teams to apply metadata management, data lineage, and access controls across sensitive financial datasets
Supported UAT and production releases by coordinating with QA and business stakeholders, ensuring accurate data delivery across environments
Environment: Python, SQL, PySpark, AWS (S3, Redshift, Glue, EC2), Apache Airflow, Informatica, Snowflake, Unix/Linux, Oracle, PostgreSQL, Redshift, Snowflake, CSV, JSON, Parquet, XML, Git, Jenkins, AWS CodePipeline, Agile / Scrum, Linux, Unix.

Client: Oracle Bangalore, India. May 2016 Jul 2019
Role: Jr. Data Engineer
Project Scope: Enterprise-scale data engineering initiative supporting Oracle s internal analytics and reporting platforms. The project focused on building and maintaining reliable data pipelines to process high-volume transactional, customer, and product data from multiple source systems. The platform enabled business intelligence, operational reporting, and downstream analytics by delivering cleansed, transformed, and performance-optimized datasets to data warehouses and reporting tools. Emphasis was placed on data accuracy, scalability, automation, and adherence to enterprise data governance standards.
Responsibilities:
Built end-to-end data pipelines to ingest data from Oracle databases, flat files, and external systems into centralized data warehouse environments
Created complex SQL queries, joins, and transformations to handle large datasets and ensure data consistency across multiple business domains
Developed ETL workflows using Informatica to extract, transform, and load structured and semi-structured data based on business requirements
Handled data cleansing, deduplication, validation rules, and reconciliation checks to improve overall data quality
Tuned SQL queries, indexes, and ETL jobs to improve performance for high-volume batch processing
Supported incremental and full data loads, including change data capture (CDC) logic for near real-time reporting needs
Coordinated with business analysts and reporting teams to translate functional requirements into scalable data models
Maintained dimensional and fact tables following star and snowflake schema designs for analytics use cases
Performed root cause analysis for data issues, load failures, and performance bottlenecks, ensuring timely resolution
Automated recurring data workflows using shell scripts and scheduling tools to reduce manual intervention
Participated in code reviews, data validation cycles, and production deployments following change management processes
Ensured compliance with data security, access controls, and internal governance standards
Provided ongoing production support, monitoring daily loads, and resolving data anomalies within SLA timelines
Environment: Oracle Database, Informatica PowerCenter, SQL, PL/SQL, Unix, Linux, Shell Scripting, Data Warehousing, ETL, Star Schema, Snowflake Schema, Jenkins (basic), Control-M, Git, Agile/Scrum.
Client: Thoughtworks Pune, India. Jan 2014 Apr 2016
Role: Software Developer
Project Scope: The project supported multiple client engagements focused on data-driven application development and analytics enablement across retail and enterprise domains. The engagement involved building and maintaining data pipelines, supporting reporting systems, improving data reliability, and assisting senior engineers with scalable data processing solutions. The work emphasized clean data ingestion, transformation, validation, and close collaboration with application and business teams in an Agile delivery environment.
Responsibilities:
Supported end-to-end data pipelines that ingested data from relational databases, flat files, and external systems for reporting and analytics use cases.
Handled data extraction and transformation tasks using Python and SQL to prepare datasets for downstream consumption.
Maintained batch data processing jobs and ensured timely availability of clean, validated data for business users.
Assisted in writing optimized SQL queries for reporting, reconciliation, and ad-hoc analysis.
Performed data quality checks, validation rules, and basic anomaly detection to improve data reliability.
Collaborated with application developers to align data models with application and reporting requirements.
Supported integration of data pipelines with reporting tools and internal dashboards.
Monitored scheduled jobs and addressed failures by analyzing logs and correcting data or configuration issues.
Participated in Agile ceremonies, providing updates on data pipeline status and resolving sprint-level issues.
Assisted senior data engineers with performance tuning of queries and batch jobs.
Managed version control for data scripts and SQL artifacts using Git.
Supported deployment activities by validating data pipelines across development and test environments.
Documented data flows, table definitions, and transformation logic for knowledge sharing and onboarding.
Responded to data-related production issues and supported root cause analysis under guidance.
Followed coding standards, data governance practices, and client-specific compliance requirements.
Environment: Python, SQL, MySQL, PostgreSQL, CSV/Flat Files, Linux, Git, Shell Scripting, ETL Concepts, Data Validation, Batch Processing, Agile / Scrum, Jira, Confluence.
Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree active directory procedural language Alabama Idaho Missouri New York Ohio

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6936
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: