Home

Sai Hemanth Paturi - Data Scientist
[email protected]
Location: , , USA
Relocation: Yes
Visa: Green Card
Resume file: SaiHemanth_Data_Scientist_1769530532728.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Sai Hemanth Paturi
Data Scientist
+1 (513)987-0449 | LinkedIn: www.linkedin.com/in/sai-hemanth-p/ | [email protected]

PROFESSIONAL SUMMARY:
I am a Senior Data Scientist with 11+ years of experience building end-to-end machine learning systems, cloud-native pipelines, and serverless architectures across Banking, Healthcare, and Aviation domains.
I have strong hands-on expertise in AWS Lambda, Node.js, Python, API Gateway, SNS/SQS, S3, RDS, DynamoDB, CloudWatch, IAM, KMS, VPC, enabling scalable and event-driven ML workflows.
And I have proven experience designing ML models, distributed pipelines, and microservices aligned with modern cloud engineering practices.
I am skilled in developing and deploying ML/DL models using Python, Scikit-Learn, Spark, TensorFlow, integrating them into serverless inference layers for real-time predictions.
I am experienced with CI/CD (Jenkins, AWS CodePipeline), CloudFormation IaC, automated testing, logging, and monitoring for production-grade ML systems.
I have Strong foundation in data engineering, OOAD, system design, distributed architectural patterns, and Agile/Scrum delivery.
I m adept at blending ML expertise with serverless/cloud development to deliver intelligent, cost-optimized, and high-performing data-driven applications.
Enterprise MLOps Architect: My experience is specialized in building end-to-end model lifecycles using Kubeflow and Vertex AI for scalable, cloud-native orchestration.
I have Automated Machine Learning (AutoML): Expert in streamlining model discovery and feature engineering through H2O.ai and AutoFlow, significantly reducing time-to-production for predictive analytics and credit risk frameworks.
I have observability & Distributed Tracing: Hands-on architect for observability stacks utilizing OpenTelemetry (OTel), MLflow, and Jaeger to profile latency and monitor complex Vertex AI and Kubeflow pipelines.
I have implemented model optimization techniques such as cross validation, hyperparameter tuning, class balancing and threshold calibration to maximize AUC, precision, recall and stability across diverse production environments.
I own data quality frameworks that monitor drift, detect anomalies, validate schemas, track lineage and enforce completeness rules, protecting downstream models and executive reporting from hidden data issues.
I have designed and maintained production ready model services using Docker containers, AWS Lambda functions, EMR clusters and orchestrated workflows, delivering low latency, resilient and cost efficient scoring capabilities.
Also I have collaboratee closely with data engineers, analysts, product owners and business leaders in Agile Scrum environments, translating requirements into technical roadmaps and delivering incremental value through well planned sprints.
Created dashboards and analytical stories in Power BI and Tableau, turning complex model outputs, risk scores and operational metrics into intuitive visuals for senior leadership and frontline teams.
I have apllied strong software engineering practices including modular Python, reusable components, code reviews, unit testing and version control to ensure maintainable, robust and audit friendly data science codebases.
Lead experimentation programs using A B testing, champion challenger approaches and controlled experiments to quantify business uplift, measure policy impact and guide data driven decision making.
Documented models, assumptions, data sources and performance results through structured model cards, data dictionaries and technical summaries that support audits, risk reviews and cross team collaboration.
Mentored junior data scientists on statistical thinking, feature engineering, model selection and coding standards, helping teams deliver consistent, high quality work in demanding enterprise environments worldwide.
I always stay curious with advances in machine learning, responsible AI and large language models, incorporating appropriate techniques into production solutions while maintaining focus on reliability, governance and business value.
TECHNICAL SKILLS:
Programming & Analytics Python (Pandas, NumPy, Scikit-Learn, Statsmodels), R,SQL (MySQL, SQL Server, Oracle), HiveQL, Spark SQL, Jupyter, Colab, VS Code, GitHub

Data Engineering & Pipelines Data Wrangling, Cleaning, Transformation (ETL), PySpark, Pandas pipelines, Batch Processing, Data Modeling, Schema Design (Star/Snowflake), Feature Stores, Data Quality Checks (Drift, Missingness, Outliers)

Big Data & Distributed Computing Apache Spark, Hive, HDFS, MapReduce, Kafka, Large-scale dataset processing and distributed ML workflows

Databases & Storage SQL Server, MySQL, Oracle, Snowflake, Redshift (exposure), AWS S3, HDFS

Visualization & BI Tools Power BI, Matplotlib, Seaborn, Dashboards, Exploratory Data Analysis (EDA), Storytelling

LLM Ecosystem Transformers (GPT-4/3.5, BERT), Hugging Face, Vector Databases (Pinecone, FAISS, Milvus)
RAG Workflow Concepts

Cloud & DevOps AWS (EC2, S3, Lambda, IAM), Docker, Git, CI/CD Basics (GitHub Actions)

Software Engineering Practices OOP, Modular Python, Clean Code, Unit Testing (PyTest), Version Control, API Integration

MLOps & Orchestration Platforms: Kubeflow (Pipelines, Katib), Vertex AI (Training, Prediction), H2O.ai (AutoML, Driverless AI).
Automation: AutoFlow, MLflow, Airflow.
Testing & Engineering Testing Posture: TDD/BDD, Python Unit Testing (PyTest), Mocking.
End-to-End Validation: Playwright, Selenium.
Practices: OOP, Modular Python, Clean Code.

PROFESSIONAL EXPERIENCE:
Client: Citi Bank, NYC, NY August 2023 Present
Data Scientist
Responsibilities:
I have developed a credit risk prediction system using Python, Scikit Learn, SQL and Power BI, improving risk detection and delivering scoring workflows that accelerated underwriting decisions across business teams.
Built an end to end data science pipeline covering ingestion, feature engineering, model training, evaluation and reporting, ensuring data flow from sources to model insights consumed by analysts.
Collaborated in Agile and Scrum teams with product owners, analysts and engineers to refine requirements, manage deliverables and prioritize impact models aligned with regulatory, compliance and risk initiatives.
Developed ingestion workflows connecting SQL databases, APIs, S3 repositories and third party credit data sources, automating pipeline runs to cut manual effort and keep datasets current for modeling.
Processed, cleaned and transformed structured datasets using Pandas, NumPy and statistical techniques, applying normalization, outlier handling, encoding and feature refinement to create value data for robust risk models.
Stored datasets in SQL Server and Hive environments with optimized schemas, enabling efficient data access that supported workflows, model development and regulatory reporting for underwriting and management teams.
Implemented data quality checks for drift, missing data, correlations and anomalies, keeping model inputs stable, limiting performance degradation and maintaining reliable analytics as customer conditions and behaviors evolved.
Leveraged H2O.ai Driverless AI capabilities to generate high-value features through automated engineering, including time-series structures and interaction terms, strengthening the signal in mission-critical risk models.
Implemented production-ready CI/CD workflows using GitHub Actions and Dockerized components, enabling automated deployments of modular Scikit-Learn pipelines to AWS Lambda and serverless inference layers.
Engineered sophisticated ingestion workflows using PySpark and SQL Server, automating data acquisition from APIs and S3 repositories while applying AutoFlow logic for real-time drift detection and schema validation.
Applied rigorous statistical analysis, including ANOVA and Hypothesis Testing, and conducted data quality audits for Critical Data Elements (CDEs) to ensure regulatory compliance and model interpretability in high-governance environments.
Collaborated in Agile/Scrum environments to translate complex regulatory requirements into technical roadmaps, utilizing Vertex AI Model Registry for versioning and lineage tracking.
Designed and analyzed split tests to measure policy impact and credit decision flows, interpreting significance and confidence intervals to guide business strategies and model refinement.
Environment: Python (Scikit-Learn, Pandas, NumPy, NLTK), SQL (PL/SQL, T-SQL), Scala, Kubeflow, Vertex AI, H2O.ai, AutoFlow, MLflow, OpenTelemetry (OTel), Jaeger, Databricks (dbx), Azure, AWS (EMR, Lambda, S3, DynamoDB), Hadoop (HDFS, MapReduce, Hive), Apache Spark (PySpark, MLlib), Kafka, Oracle 12c, SQL Server, Netezza, Teradata, MongoDB, Cassandra, MariaDB, Erwin r9.6, SAP CRM, Tableau, SSRS, JSON, XML.

Client: UHG OPTUM, Minnetonka, MN May 2022 July 2023
Machine Learning Engineer
Responsibilities:
I have delivered risk scoring machine learning solution using Python, Scikit Learn, SQL, Spark, improving patient identification by twenty eight percent and automating scoring workflows for clinical teams across organizations.
Designed scalable machine learning architecture using distributed Spark pipelines and centralized model repositories, ensuring seamless data flow from ingestion, feature store, model training, to real time scoring services.
Collaborated in Agile and Scrum with Product Owners, data engineers, and clinicians, leading sprint planning, backlog refinement, delivering machine learning features aligned with clinical outcomes and regulatory goals.
Built ingestion pipelines connecting APIs, S3, SQL Server, Oracle, Netezza, Kafka streams, automating refresh cycles and ensuring timely availability of clinical, claims, provider datasets for analytics and modeling.
Performed large scale data cleaning and transformation using PySpark, Pandas, statistical methods, addressing missing values, outliers, categorical encoding, normalization to maintain consistent training datasets across diverse patient cohorts.
Persisted curated datasets in Hive, Teradata, and operational data stores, designing optimized schemas and partitions to support fast access for downstream machine learning workflows and cross functional analytics.
Implemented feature quality and integrity checks, including drift detection, bias assessment, correlation tracking, anomaly alerts, ensuring robust feature pipelines and reproducible outputs across multiple patient populations and regions.
Selected machine learning algorithms including Logistic Regression, Random Forest, Gradient Boosting, XG Boost, matching techniques to clinical risk scoring use cases requiring interpretability, precision, recall, and strict regulatory compliance.
Applied advanced machine learning techniques for risk scoring, forecasting, feature engineering, and natural language processing on unstructured clinical text, generating features from claims, results, diagnoses, and provider notes.
Performed performance optimization through hyperparameter tuning, threshold calibration, and cross validation, improving model precision, recall, area under curve, and stability for high risk patient populations across multiple regions.
Developed modular training and inference pipelines using Spark MLlib and Scikit Learn, creating reusable components that standardize preprocessing, feature extraction, training, validation, scoring and support machine learning workflows.
Containerized machine learning models using Docker and integrated scalable scoring workloads on AWS Lambda and EMR clusters, reducing inference latency and improving cost efficiency for healthcare analytics workloads.
Orchestrated machine learning workflows with Airflow and implemented continuous integration and delivery using GitHub Actions, ensuring consistent deployments, automated testing, version control, and reliable lifecycle management for models.
Monitored model performance using CloudWatch, EvidentlyAI, and custom drift dashboards, implementing alerts for data drift, accuracy degradation, latency spikes, supporting timely retraining, recalibration, and compliance driven validation cycles.
Created comprehensive documentation including model cards, schema definitions, drift reports, deployment procedures, and training materials, leading handoff sessions with analytics, compliance, and engineering teams to ensure operational readiness.
Environment: Python, R, SQL, Oracle 10g, DB2, Hive, MongoDB, HDFS, Hadoop, MapReduce, ODS, OLTP, OLAP, Metadata Repositories, MS Excel, Mainframes, MS Visio, Rational Rose, Distributed File Systems, Data Warehouse Tools, and Enterprise Data Modeling Environments.

Client: State of VA, Richmond, Virginia April 2019 May 2022
Data Engineer

Responsibilities:

In State of Virginia I have worked on full-scale data engineering and ML pipelines using Python, Pandas, NumPy, Spark, and Scikit-learn, supporting statewide analytics modernization while improving data reliability, operational efficiency, and enterprise-wide intelligence capabilities across mission-critical environments.
Designed large-scale distributed architectures leveraging Hadoop, Spark, Kafka, HBase, Hive, and MongoDB to process high-volume datasets with low latency, ensuring scalable ingestion, resilient compute orchestration, and fault-tolerant data processing across multi-cluster cloud and on-prem hybrid ecosystems.
Collaborated through Agile/Scrum methodologies with structured 2-week sprint cadences, partnering closely with Product Owners, SMEs, Data Architects, and cross-functional engineering teams to align backlog priorities, refine requirements, and deliver consistent high-quality outcomes across iterative releases.
Built sophisticated ingestion workflows integrating JSON, XML, REST APIs, Kafka Streams, AWS S3, SQL Server, Oracle, DB2, and NoSQL sources enabling continuous data acquisition, multi-system synchronization, and real-time ingestion throughput for enterprise analytical consumption layers.
Engineered transformation pipelines using PySpark, SparkSQL, MLLib, DBT-style modular patterns, and reusable data validation frameworks to cleanse, normalize, enrich, and prepare structured/unstructured datasets with strict quality checks, ensuring accuracy, integrity, and analytical readiness.
Managed extensive multi-tier data storage ecosystems across AWS (S3, EMR, DynamoDB), Oracle, DB2, Hive, OLAP cubes, and operational data stores supporting analytics workloads, machine learning pipelines, dashboard reporting, and cross-domain data accessibility with optimized performance.
Applied comprehensive Data Quality Validation for Critical Data Elements (CDEs) including anomaly detection, schema consistency checks, referential integrity audits, and lineage verification ensuring regulatory compliance, downstream accuracy, and trustworthy enterprise data assets.
Developed and optimized distributed ML solutions using Spark MLLib, Scala-based algorithms, Hadoop tools, and Python libraries resulting in a documented 45% improvement in user lifetime metrics, increased engagement, and enhanced analytical model precision across targeted business initiatives.
Implemented production-grade classification models using Logistic Regression, Decision Trees, KNN, and Na ve Bayes while scaling Python utilities, SciPy computations, and NumPy-backed vector operations with performance tuning, modular architecture, and repeatable ML workflow enhancements.
Built and deployed cloud-ready CI/CD data workflows using Dockerized components, AWS EMR, Lambda, GitHub Actions, and Jenkins pipelines while enabling automated deployments, real-time monitoring via CloudWatch, anomaly dashboards, SQL/PL-SQL metadata artifacts, and complete Confluence documentation.

Environment: Python, R, PySpark, Spark SQL, Spark MLlib, Databricks, AWS (EC2, S3, EMR, Lambda), Hive, Hadoop, HDFS, Snowflake, Teradata, Oracle 10g, DB2, SQL, Pandas, NumPy, Scikit-learn, XGBoost, Hugging Face Transformers, spaCy, NLTK, Causal Impact (R), ggplot2, dplyr, Kafka, Spark Streaming, Tableau, Power BI, JSON, XML, ODS/OLTP/OLAP Systems, Feature Engineering Tools, Healthcare Data Models (Claims, EHR, Provider Files), UML/Visio.







Client: American Airlines, Fort worth, TX December 2017 March 2019
Role: Sr. Python developer
Responsibilities:

I have developed Python-based operational automation tools and data processing frameworks improving system efficiency by 32%; built reusable modules supporting flight operations, scheduling analytics, and real-time operational reporting.
Designed microservices-driven architecture using Python, REST APIs, and event-driven communication;
ensured scalable, fault-tolerant integration across scheduling, crew management, and maintenance systems.
Worked within Agile/Scrum methodology, collaborating with Product Owners, QA engineers, and cross-functional teams;
contributed to sprint planning, backlog refinement, and iterative delivery of high-value automation features.
Developed ingestion pipelines to pull data from APIs, SQL Server, Oracle, and Kafka-based airline operational feeds;
automated extraction workflows for flight schedules, crew datasets, aircraft health logs, and historical operations data.
Processed and transformed large datasets using Python (Pandas, NumPy) and distributed processing patterns;
implemented cleansing, parsing, aggregations, and validations required for analytics and downstream services.
Stored and managed structured/unstructured data across SQL Server, Oracle, and HDFS environments;
created optimized schemas, indexes, and partition strategies to support high-performance data access.
Built data models and business logic layers using Python OOP principles, reusable components, and modular design;
ensured maintainable codebases enabling rapid enhancements, debugging, and integration with enterprise systems.
Implemented predictive features and rule-based engines leveraging scoring logic, clustering, and regression utilities;
supported decision-making for operations planning, crew optimization, and flight delay risk reduction.
Optimized Python performance through vectorization, caching, concurrency, and memory-efficient techniques;
achieved significant reductions in runtime for high-volume processing tasks.
Integrated Python apps with Spark, Kafka, and Hadoop components for scalable airline data workflows;
enabled near-real-time ingestion and processing of operational event streams.
Built REST APIs using Flask and Django to expose business logic and ML-enabled insights to airline applications;
ensured secure, scalable API consumption across internal platforms.
Containerized Python services using Docker and deployed them on on-prem and cloud-based environments;
ensured portability, standardization, and consistent runtime performance across environments.
Implemented CI/CD pipelines using Git, Jenkins, and automated unit test suites;
enforced continuous delivery standards, reducing deployment time and improving release reliability.
Monitored application performance using log analytics, alerting, and Python instrumentation;
resolved production issues proactively and supported 24/7 airline operations stability.
Created documentation including API specs, flow diagrams, code standards, and deployment guides;
provided training and handoff sessions to engineering teams for long-term maintainability.
Environment: Python 2.7, MySQL, Microsoft SQL Server, Cassandra, LDAP, Git, Bitbucket, Linux, Windows, JSON, XML, HTML, CSS, JavaScript, jQuery, AngularJS, REST APIs, Bootstrap, Rally, Agile/Scrum, PyCharm, PyUnit, PyTest, MOCK, Beautiful Soup, Matplotlib, Apache Directory Studio, DataStax DevCenter, Ansible, Jenkins, and CI/CD automation tools.

Client: Microsoft, India. June 2014 May 2017
Python Developer
Responsibilities:

I have built Python-based automation and diagnostic tools that improved engineering workflow efficiency by 27%; developed reusable modules enabling faster system monitoring, reporting, and performance insights for internal teams.
Designed distributed data-processing architecture using Python, REST APIs, and service-oriented components; ensured scalable integration with Microsoft internal telemetry platforms and performance logging systems.
Collaborated in Agile/Scrum with Program Managers, developers, and QA teams; participated in sprint planning, refinement, and delivery of automation and backend enhancements.
Built ingestion pipelines retrieving data from APIs, SQL Server, SharePoint repositories, and telemetry event streams; automated extraction and refresh schedules to maintain real-time availability of diagnostic data.
Performed extensive data cleansing, parsing, normalization, and schema validation using Pandas, regex, and JSON/XML handlers; ensured high-quality datasets for dashboards, analytics, and system health monitoring tools.
Managed structured and unstructured datasets across SQL Server, MySQL, and internal file systems; designed optimized queries, schemas, and indexing strategies to support high-performance retrieval.
Developed backend logic using Python OOP, modular design, and reusable class structures; increased maintainability, enabled rapid enhancements, and reduced code duplication across services.
Implemented analytics features providing trend analysis, anomaly detection, and performance summaries; supported engineering leadership with actionable insights derived from telemetry and log data.
Optimized Python processing pipelines using vectorization, concurrency, and caching strategies; significantly reduced latency and improved throughput for high-volume diagnostic workflows.
Built RESTful APIs with Flask and Django to expose telemetry metrics and automation results; enabled seamless integration with dashboards, monitoring systems, and internal engineering applications.
Containerized Python services using Docker for standardized deployments across dev, QA, and production; improved consistency, reproducibility, and operational reliability across environments.
Developed CI/CD pipelines using Jenkins and Git to automate testing, builds, and deployments; ensured faster release cycles and reduced manual intervention in code promotions.
Created comprehensive unit tests using PyTest, PyUnit, mocking frameworks, and coverage tools; improved code reliability, prevented regressions, and supported long-term maintainability.
Monitored application performance using logging frameworks, dashboards, and threshold-based alerts; identified bottlenecks proactively and implemented fixes to ensure stability of mission-critical systems.
Produced detailed documentation including API specs, architectural diagrams, design notes, and troubleshooting guides; facilitated knowledge transfer sessions and onboarded new engineers to internal tooling.
Environment: Python, Anaconda, Spyder (IDE), Windows 7, Teradata, Requests, urllib, urllib2, Beautiful Soup, Tableau, python libraries such as NumPy, SQL Alchemy, MySQL.


Education: Bachelor of Technology in Computer Science and Engineering
GITAM University, India 2010 2014
Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning javascript access management business intelligence sthree rlang microsoft mississippi procedural language Minnesota New York Texas Virginia

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6711
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: