| Sujitha C - AI ML Engineer |
| [email protected] |
| Location: Mclean, Virginia, USA |
| Relocation: Yes |
| Visa: |
| Resume file: Sujitha_001_1776778974757.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Sujitha Cherukuthota
Senior AI / Machine Learning Engineer | Data Engineering | Cloud (AWS, Azure) | AI Systems +1 (757) 936-9318 | [email protected] Summary: Senior AI/ML Engineer with a strong foundation in data engineering, experienced in working with large-scale datasets and building practical solutions across healthcare, financial services, and retail domains in real-world environments. Hands-on experience using Python, SQL, and PySpark for data processing, feature preparation, and supporting machine learning workflows, contributing to systems used for analytics, automation, and AI-driven applications. Worked with LLM-based and NLP-driven use cases including document processing and retrieval-based systems, focusing on integrating model outputs into existing workflows rather than building isolated experimental models. Experience working with AWS and Azure cloud services such as S3, Redshift, Glue, Data Factory, and container-based environments to manage data pipelines and support deployment of scalable data and AI solutions. Strong understanding of end-to-end data lifecycle, including ingestion, transformation, storage, validation, and consumption, ensuring reliable data flow for both reporting systems and AI-enabled applications. Comfortable handling real-world, imperfect datasets, performing data cleaning, validation, and transformation to ensure quality inputs for analytics, reporting, and downstream AI or machine learning processes. Ability to bridge data engineering and AI systems, contributing to both pipeline development and intelligent features, while maintaining a balanced approach without over-engineering or over-claiming model ownership. Proven collaboration in Agile and cross-functional teams, translating business requirements into working solutions and adapting quickly to changing priorities while maintaining stability of data and AI workflows. Technical Skills: Languages & Core Technologies: Python, SQL Data Engineering & Big Data: PySpark, Pandas, ETL Pipelines, Data Ingestion, Data Transformation, Data Modeling, Data Warehousing, Batch Processing, Distributed Data Processing, Spark SQL, Hive Machine Learning & Applied AI: Supervised Learning, Regression, Classification, Time Series Analysis, Anomaly Detection, Model Evaluation, Model Validation, Feature Engineering, NLP (Natural Language Processing) Generative AI & LLM Systems: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Prompt Engineering, Context Handling, Embeddings, Semantic Search, Vector Retrieval, LLM Integration into Production Systems AI/ML Frameworks & Libraries: Scikit-learn, PyTorch, LangChain, LlamaIndex Cloud Platforms & Data Services: AWS (S3, Redshift, Glue, EMR, DynamoDB, EKS, Bedrock), Azure (Blob Storage, Data Factory, Synapse Analytics, AKS, Azure OpenAI) GCP (BigQuery, Dataflow, GKE, Vertex AI, Cloud Storage) Databases & Storage Systems: Amazon Redshift, DynamoDB, Azure Synapse, BigQuery, MySQL, Oracle APIs & Backend Systems: FastAPI, Flask, REST APIs, Microservices Architecture DevOps, MLOps & Deployment: Docker, Kubernetes (EKS, AKS, GKE), CI/CD Pipelines, Jenkins, GitHub Actions, Azure DevOps, Model Deployment, Pipeline Automation Vector Databases & Search: Pinecone, FAISS Monitoring, Performance & Reliability: Data Quality Checks, Pipeline Monitoring, Performance Optimization, Query Optimization, Debugging Production Pipelines Engineering Practices & Architecture: Modular Design, Reusable Components, Scalable Systems Design, Data Lifecycle Management, Agile Scrum Experience: HCA Healthcare Senior AI / Machine Learning Engineer Richmond, VA | Feb 2024 - Present Developed a Healthcare Knowledge Assistant (RAG chatbot) using Python, AWS Bedrock, and vector search, enabling hospital staff to query internal documents and significantly reduce manual search across multiple operational systems. Designed a microservices architecture integrating ingestion, NLP processing, embedding generation, vector retrieval, and LLM response services, enabling modular deployment and seamless integration with enterprise healthcare applications. Worked in an Agile Scrum model with 2-week sprints, collaborating with product owners and operations teams to define chatbot use cases, validate outputs, and continuously improve workflows based on real usage feedback. Built data ingestion pipelines using Python and REST APIs to collect clinical documents such as discharge summaries, referral notes, and policy files, storing them in Amazon S3 for centralized processing and knowledge base creation. Implemented data preprocessing pipelines using PySpark, AWS Glue, and Pandas to extract text from PDFs, clean data, normalize formats, and split documents into chunks optimized for ML models and RAG retrieval workflows. Designed a data storage layer using Amazon S3, Redshift, and DynamoDB, organizing raw documents, processed text, and metadata to support efficient querying and integration with chatbot and backend services. Developed embedding pipelines using transformer models and stored vectors in Pinecone or FAISS, enabling semantic similarity search and efficient retrieval of relevant document chunks for accurate chatbot responses. Built a machine learning-based document classification model using NLP techniques to categorize documents into types such as discharge summaries and referral notes, improving retrieval relevance and reducing manual sorting effort. Implemented a RAG pipeline by converting user queries into embeddings, retrieving relevant document chunks using vector similarity search, and generating grounded responses using AWS Bedrock Claude LLM models. Designed prompt engineering strategies and guardrails to ensure responses were generated strictly from retrieved context, reducing hallucinations and improving reliability of chatbot outputs for healthcare staff usage. Used LangChain and LlamaIndex frameworks to orchestrate embedding pipelines, retrieval logic, prompt construction, and response generation, enabling seamless integration between vector databases and LLM services. Developed REST APIs using FastAPI to expose chatbot functionality, enabling integration with web-based applications and internal systems for real-time document search and question-answering workflows. Conducted evaluation and validation of ML and RAG outputs using test datasets and sample queries, ensuring classification accuracy and response relevance before deploying the system to production environments. Containerized services using Docker and managed images in AWS ECR, ensuring consistent runtime environments for document processing, embedding pipelines, and chatbot services across development, testing, and production stages. Deployed services on AWS EKS with CI/CD pipelines using GitHub Actions and Jenkins, enabling automated build, testing, and deployment while supporting scalable execution of RAG pipelines under increasing workloads. Sallie Mae Bank AI / Machine Learning Engineer Newark, DE | May 2022 - Jan 2024 Engineered an Azure-based financial document intelligence platform using Python and Scikit-learn with PyTorch, enabling automated classification and routing of loan documents across high-volume servicing workflows. Built an AI-driven financial query and reconciliation assistant using Python, Azure OpenAI, and vector search, enabling analysts to investigate mismatches in loan and transaction data without manually reviewing multiple reporting systems. Designed a distributed service architecture separating ingestion, reconciliation logic, embedding generation, and query services, allowing independent scaling of pipelines and smoother integration with finance and reporting applications. Operated within an Agile delivery model with bi-weekly sprints, working closely with risk analysts and finance teams to translate reconciliation challenges into technical solutions and refine outputs based on real audit scenarios. Developed data ingestion workflows to pull loan data, transaction records, and audit logs from SQL systems and APIs into Azure Blob Storage, ensuring consistent availability of structured and semi-structured financial datasets. Built data transformation pipelines using PySpark and Azure Data Factory to join datasets, resolve inconsistencies, and standardize formats, enabling reliable comparison of records across different financial systems. Structured a storage layer using Azure Blob, Synapse Analytics, and relational databases, organizing reconciled datasets and historical records to support both batch analysis and real-time query requirements. Created embedding pipelines for financial records and logs, storing vector representations in a vector database to enable similarity-based lookup of related transactions and historical reconciliation cases. Developed a machine learning-based anomaly detection model to identify mismatches between expected and actual transaction values, helping analysts focus only on flagged records instead of scanning entire datasets. Built a RAG-based query layer where analysts could input natural language questions, retrieve relevant transaction records and logs, and receive contextual explanations generated using Azure OpenAI models. Engineered context filtering logic to ensure only relevant financial records were passed to the LLM, improving accuracy of responses and avoiding irrelevant or misleading outputs during reconciliation workflows. Used LangChain pipelines to combine retrieval, structured data lookups, and response generation, enabling a hybrid approach where both vector search and tabular queries contributed to final answers. Designed Python-based services using modular and reusable components, allowing easy extension of reconciliation logic and reducing effort required to onboard new financial datasets into the system. Performed validation using historical reconciliation cases, comparing system outputs with analyst-reviewed results to ensure anomaly detection accuracy and response correctness before wider adoption. Containerized services using Docker and maintained images in Azure Container Registry, ensuring consistent execution environments for ingestion, anomaly detection, and query services across development and production setups. Deployed applications on Azure Kubernetes Service (AKS) with CI/CD pipelines using Azure DevOps and GitHub Actions, enabling controlled releases and scalable execution of reconciliation and query workloads. State of California, San Francisco, CA Data Scientist / Machine Learning Engineer Feb 2020 - Apr 2022 Developed a GCP-based public program analytics platform using Python and Scikit-learn with PyTorch, enabling analysis of large administrative datasets and supporting data-driven planning across government systems. Developed a public data analytics platform using Python and GCP to analyze datasets related to program usage and regional trends, enabling analysts to generate reports without manually combining data from multiple sources. Designed a batch-oriented data processing architecture integrating ingestion, transformation, analytics, and reporting layers, ensuring reliable handling of large public datasets across different departments and reporting systems. Worked in an Agile-based delivery model with sprint cycles, collaborating with policy analysts and reporting teams to define data requirements and deliver incremental improvements aligned with government reporting needs. Built data ingestion pipelines using Python and APIs to collect datasets from government portals, internal systems, and flat files, storing them in Google Cloud Storage for centralized processing and reuse. Implemented data transformation workflows using PySpark and Dataflow to clean missing values, standardize formats, and merge datasets, ensuring consistency across multiple sources used for reporting and analytics. Designed a data storage layer using BigQuery and Cloud Storage, organizing structured datasets and historical records to support reporting queries and downstream machine learning models. Created basic text indexing for reports and documents to support keyword and semantic search, enabling analysts to quickly locate relevant sections within large policy and reporting documents. Built a machine learning model for trend analysis using regression techniques to identify patterns in program participation and regional demand, supporting data-driven decision-making for planning and resource allocation. Developed data aggregation logic to summarize large datasets into meaningful metrics such as usage trends and growth patterns, reducing manual effort required for generating recurring reports. Implemented a lightweight query assistant using retrieval logic to help analysts search reports and datasets using natural language queries, improving accessibility of data without relying on complex interfaces. Used Python-based modular pipelines to structure data processing and model execution, improving maintainability and enabling reuse of components across different datasets and reporting workflows. Developed REST APIs using Flask to expose analytics outputs and datasets, enabling integration with dashboards and internal tools used by reporting and policy teams. Performed validation using historical data and analyst feedback, ensuring model outputs and aggregated results aligned with expected reporting trends before deployment. Containerized applications using Docker and stored images in Google Container Registry, ensuring consistent runtime environments for data pipelines, analytics jobs, and API services across development and production. Deployed services on Google Kubernetes Engine (GKE) with CI/CD pipelines using Jenkins and GitHub Actions, enabling reliable deployment and scalable execution of batch processing and analytics workloads. Walmart Global Tech Data Engineer Bentonville, AR | Oct 2016 - Dec 2019 Worked on large-scale retail data pipelines using Python, PySpark, and AWS services to process sales and inventory data, supporting reporting and operational decision-making across multiple store locations. Worked on large-scale retail data pipelines using Python, PySpark, and AWS services to process sales, inventory, and product data, supporting reporting dashboards and operational decision-making across multiple store locations. Contributed to distributed data processing workflows handling ingestion, transformation, and aggregation of transactional datasets from multiple systems, enabling reliable processing of high-volume retail data generated from daily store operations. Collaborated in an Agile environment with sprint-based delivery cycles, working closely with analysts and business teams to understand reporting requirements and refine data processing logic based on changing business needs. Maintained batch ingestion jobs to extract sales transactions, product catalog data, and inventory records from relational databases and APIs into Amazon S3, ensuring consistent availability of raw data for downstream processing pipelines. Developed data transformation pipelines using PySpark and Hive to clean missing values, standardize product attributes, and aggregate sales metrics, improving consistency and usability of datasets across reporting and analytics systems. Managed data storage using Amazon S3 and Redshift, organizing raw and processed datasets with proper partitioning and schema design to support efficient querying and downstream analytical workflows. Created data preparation workflows to generate structured datasets such as store-level sales summaries, product trends, and category-level metrics, reducing manual effort required for recurring reporting and analysis tasks. Supported preparation of datasets for demand forecasting use cases by organizing historical sales data, creating derived features, and ensuring consistent input structures for downstream modeling and analytics workflows. Applied data validation and quality checks to verify completeness and accuracy of processed datasets, identifying inconsistencies such as missing values and duplicate records before data was used for reporting. Designed data aggregation logic to combine multiple datasets into unified reporting tables, enabling consistent analysis of sales performance across different regions, stores, and time periods. Developed Python scripts and reusable utilities to automate repetitive data processing tasks, improving efficiency and reducing manual intervention required for running and maintaining data pipelines. Assisted in performance tuning of data pipelines by optimizing queries and improving resource usage in Spark jobs, reducing execution time for large batch processing workloads. Participated in testing and validation of data workflows, ensuring outputs matched source data and meeting business expectations before datasets were consumed by reporting and analytics teams. Executed batch data pipelines using AWS EMR and scheduled workflows, ensuring timely processing of large datasets and consistent availability of updated data for reporting and analytics use cases. Supported deployment and monitoring using Jenkins and logging tools, helping maintain stable pipeline execution, track job failures, and assist in resolving data processing issues in production environments. Citibank Python Developer Hyderabad, India | Aug 2015 - Sep 2016 Worked on data processing and reporting scripts using Python and SQL to support financial operations, helping teams analyze transaction data and generate reports required for internal business and compliance use cases. Assisted in ETL workflows involving data extraction from relational databases, transformation of financial records, and loading processed datasets into reporting systems used by business and operations teams. Collaborated with analysts and senior developers in an Agile environment to understand reporting requirements and implement data processing logic aligned with financial data structures and workflows. Developed data extraction scripts using SQL queries to pull transaction records, account details, and customer data from relational databases for use in reporting and downstream processing tasks. Implemented data cleaning and transformation logic in Python using Pandas to handle missing values, normalize data formats, and prepare structured datasets for reporting and validation workflows. Maintained data storage in relational databases such as MySQL and Oracle, ensuring proper organization of processed data and supporting efficient retrieval for reporting and analysis tasks. Created data preparation routines to generate summary tables such as transaction aggregates and account-level metrics, reducing manual effort required for recurring financial reporting activities. Wrote Python scripts for data processing and automation, handling tasks such as data validation, transformation, and report generation to improve efficiency of daily financial operations workflows. Applied data validation checks to identify inconsistencies in transaction records, such as missing values and duplicate entries, ensuring accuracy of datasets before they were used for reporting. Designed SQL queries and joins to combine data from multiple tables, enabling creation of consolidated datasets for reporting and supporting analysis across different financial data sources. Assisted in automating reporting processes by generating structured output files and reports, reducing manual work for analysts and improving turnaround time for business reporting tasks. Followed modular coding practices to organize scripts and reusable functions, improving maintainability and enabling easier updates to data processing workflows over time. Education: SIET Bachelor of Technology (B.Tech) in Computer Science India | Aug 2011 June 2015 Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree bay area Arkansas California Delaware Virginia |