Home

sravya nagaram - Data engineering
[email protected]
Location: Margate, Florida, USA
Relocation: Yah relocate any whare
Visa: GC
Name: Sravya Nagaram
Sr. Data Engineer
PH No: +1 870 568 6886
Email: [email protected]
Professional Summary:
11+ years of IT Experience in Diverse Domains: Extensive background in end-to-end data analytics solutions encompassing Big Data, Hadoop, Informatica, Data Modeling, and System Analysis.
Azure DevOps Expertise: Proficiently build reusable YAML pipelines in Azure DevOps, create CI/CD pipelines using cloud-native architectures on Azure Cloud, and implement Git flow branching strategy.
GCP Proficiency and Cloud Native Technologies: Mastery over GCP services including BigQuery, Google Cloud Storage (GCS) buckets, Google Cloud Functions, and Dataflow.
A WS and Hadoop Expertise: Proficient in AWS cloud services like EC2, S3, Glue, Athena, DynamoDB, and RedShift. Hands-on experience with the Hadoop ecosystem - HDFS, MapReduce, Pig, Hive, Sqoop, Flume, and Spark.
Legacy Data Migration: Led successful migration projects from Teradata to AWS Redshift and on-premises to AWS Cloud.
AWS Cloud-Based Pipelines: Utilized AWS services like EMR, Lambda, and Redshift to develop cloud-based pipelines and Spark applications.
DevOps and Scripting Proficiency: Skilled in PowerShell scripting, Bash, YAML, JSON, GIT, Rest API, and Azure Resource Management (ARM) templates. Implement CI/CD standards, integrate security scanning tools, and manage pipelines effectively.
Windows Scripting and Cloud Containerization: Proficient in scripting and debugging within Windows environments. Familiarity with container orchestration, Kubernetes, Docker, and AKS.
Efficient Data Integration: Expertise in designing and deploying SSIS packages for data extraction, transformation, and loading into Azure SQL Database and Data Lake Storage. Configure SSIS Integration Runtime for Azure execution and optimize package performance.
Data Visualization and Analysis: Create data visualizations using Python, Scala, and Tableau. Develop Spark scripts with custom RDDs in Scala for data transformation and actions. Conduct statistical analysis on healthcare data using Python and various tools.
Big Data Ecosystem: Extensive experience with Amazon EC2 for computing, query processing, and storage. Proficiently set up Pipelines in Azure Data Factory using Linked Services, Datasets, and Pipelines for ETL tasks.
Azure Data Services: ETL expertise using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Ingest data to Azure Services and process within Azure Databricks.
Hadoop Proficiency: Strong support experience across major Hadoop distributions - Cloudera, Amazon EMR, Azure HDInsight, Hortonworks. Proficient with Hadoop tools - HDFS, MapReduce, Yarn, Spark, Kafka, Hive, Impala, HBase, Sqoop, Airflow, and more.
Azure Cloud and Big Data Tools: Working knowledge of Azure components - HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB. Hands-on experience with Spark using Scala and PySpark.
Database Migration: Expertise in migrating SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse. Proficient in access control and migration using Azure Data Factory.
Cloud Computing and Big Data Tools: Proficient in Azure Cloud and Big Data tools - Hadoop, HDFS, MapReduce, Hive, HBase, Spark, Azure Cloud, Amazon EC2, DynamoDB, S3, Kafka, Flume, Avro, Sqoop, PySpark.
Real-time Data Solutions: Build real-time data pipelines and analytics using Azure components like Data Factory, HDInsight, Azure ML Studio, Stream Analytics, Azure Blob Storage, and Microsoft SQL DB.
Database Expertise: Work with SQL Server and MySQL databases. Skilled in working with Parquet files, parsing, and validating JSON formats. Hands-on experience in setting up workflows with Apache Airflow and Oozie.
API Development and Integration: Develop highly scalable and resilient RESTful APIs, ETL solutions, and third-party platform integrations as part of an Enterprise Site platform.
IDE and Version Control: Proficient use of IDEs like PyCharm, IntelliJ, and version control systems SVN and Git.


TECHNICAL SKILLS :
Big Data Technologies Kafka, Cassandra, Apache Spark, Spark Streaming, Delta Lake, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
Hadoop Distribution Cloudera CDH, Apache, AWS, Horton Works HDP
Programming Languages SQL, PL/SQL, Python, R, PYSpark, Pig, P Hive QL, Scala, Shell
Scripting, Regular Expressions
Spark components RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming
Cloud Infrastructure AWS, Azure, GCP
Databases Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)
Scripting &Query Languages Shell scripting, SQL
Version Control CVS, SVN and Clear Case, GIT
Build Tools Maven, SBT
Containerization Tools Kubernetes, Docker, Docker Swarm
Reporting Tools Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks,UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau


PROFESSIONAL EXPERIENCE :

Client: OSF HealthCare - Peoria, IL Feb 2024 Present
Role: Sr. Data Engineer
Responsibilities:
Contributed to the analysis, design, and development phases of the Software Development Lifecycle (SDLC). Proficient in Agile practices, participated in sprint planning, scrum calls, and retrospectives. Managed projects through JIRA and version control with GitHub.
Implemented Alteryx workflows to automate data extraction, transformation, and loading (ETL) processes for real-time data streaming.
Orchestrated containerized deployment of data mesh components using Kubernetes or similar container orchestration platforms to ensure scalability, resilience, and resource efficiency in real-time environments.
Harnessed Kafka and RabbitMQ to capture data streams within Docker virtualized test and dev environments.
Implemented Data Mesh architecture to decentralize data ownership, enabling real-time data access and analytics for cross-functional teams.
Oversee platform engineering initiatives, manage Data Mesh/Virtualization systems like Starburst and Dremio, maximize performance, and guarantee smooth data access throughout the enterprise.
Developed and maintained a variety of data platforms, fostering efficiency and scalability in data processing and storage using distributed file and table formats like Iceberg, Delta, and Apache Parquet.
Leveraged Alteryx Server for scheduling and orchestrating real-time data pipelines, ensuring timely and reliable data delivery.
Spearheaded the implementation of an enterprise-wide data cataloging solution and leveraging tools like Alation and
Collibra, to enhance the discoverability and understanding of diverse data assets across the organization.
Managed metadata repositories to catalog and govern data assets, facilitating efficient data discovery and lineage tracking in real-time data workflows.
Established data quality monitoring and anomaly detection mechanisms within the data mesh infrastructure to ensure high data integrity and reliability in real-time data streams.
Implemented a robust alerting mechanism within the data catalog, utilizing Apache Kafka and Elasticsearch, to notify
stakeholders about changes in cataloged healthcare datasets, ensuring timely awareness and updates.
Orchestrated the integration of IBM Infosphere Information Governance Catalog into the data cataloging ecosystem,
streamlining the cataloging of sensitive patient data and reinforcing data security measures.
Leveraged expertise in hardware systems performance to optimize CPU, RAM, storage, network, Linux, JVM, and distributed systems performance, enhancing overall system reliability and throughput.
Conducted comprehensive statistical analysis using SQL, Python, Scala, R Programming, and Excel.
Engineered custom ETL solutions for data transformation using PySpark and Shell Scripting.
Crafted Azure Databricks (Spark) notebooks for data movement between storage accounts.
Integrate diverse data sources into Power BI for unified analytics, including streaming data sources for real-time updates.
Implemented DevOps practices using tools like GIT, Gitlab, Splunk, and Ansible to streamline development and deployment processes.
Created custom Python scripts to perform data manipulation, cleansing, and enrichment, ensuring data accuracy and consistency for analytical purposes.
Implemented and configured Alation as the primary data catalog solution and ensuring seamless integration with Apache
Flink and Apache Kafka for real-time data streaming cataloging.
Demonstrated proficiency in incident management, change management, and problem management processes, ensuring timely resolution of service-impacting issues.
Ensured optimal performance and scalability in cloud implementation strategies.
Managed and delivered data for analytics and Business Intelligence needs using Azure Synapse.
Implemented CI/CD pipelines for automated testing and deployment of data products, enabling rapid iteration and delivery of real-time data solutions across the Data Mesh ecosystem.
Led Agile delivery and integrated SAFe and DevOps frameworks.
Architected, developed, and maintained CI/CD pipelines within Azure DevOps.
Utilized PowerShell scripting, Bash, YAML, JSON, GIT, Rest API, and Azure Resource Management (ARM) templates for CI/CD.
Automated Azure Databricks jobs and constructed SSIS packages for smooth data transfer.
Collaborated closely with development teams to diagnose issues and debug code.
Mentored junior engineers on CI/CD best practices and cloud-native architectures.
Mentor junior team members and provide technical guidance on best practices for developing, deploying, and maintaining real-time data solutions using Apache Spark, Power BI, and Azure Kubernetes.
Environment: Hadoop, Scala, Spark, Hive, Sqoop, Alteryx, HBase, Flume, Ambari, Scala, MS SQL, MySQL, SSIS, Snowflake, MongoDB, Git, Data Storage Explorer, Python, Azure (Data Storage Explorer, ADF, AKS, Blob Storage), RabbitMQ, Docker.

Citi Group, New York, NY July 2022 Dec 2023
Sr. Data Engineer
Responsibilities:
Led efforts to develop a process to improve data quality by 60% for a data triangulation project and assigned quality scores by evaluating data characteristics.
Developed a SQL database and inserted over 2 million records of medical supplies data leading to a more robust and reliable data storage.
Responsible for meeting with the SMEs and Stakeholders to identify problems in the existing system and resolve issues, improve the process to ensure a stable and an accurate system.
Developed and optimized Hive and Spark jobs processing terabytes of data, achieving a 30% reduction in query response time.
Defined data requirements, modeled and designed the new data structures, source to target mapping for AWS Redshift/Tableau based projects.
Provided support for SharePoint Portal Server and Windows SharePoint Services based intranet sites, Portal Areas contents, permissions security, and managing Team Sites
Deployed PyTorch sentiment analysis model and created a gateway for accessing it from a website. Used tone, bag-of-words, and deployed the model using Amazon Sage Maker.
Used multiple machines learning algorithms, including random forest and boosted tree, SVM, SGD, neural network, and deep learning using TensorFlow.
Engineered end-to-end data pipelines, utilizing AWS Redshift as the primary data warehousing solution.
Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
Designed and implemented a serverless data processing pipeline using AWS Lambda, S3, and DynamoDB to handle real-time data streams, resulting in a 30% reduction in processing time.
Created personalized dashboards for 12 countries in Microsoft Power BI to monitor stock levels leading to reduced stock-outs by 70%.
Implemented a pipeline of automated Python scripts to optimize and reduce the data processing time by 30%.
Collaborated with cross-functional teams, including supply chain, finance, and operations, to understand data requirements and provide data-driven insights, contributing to data-informed decision-making and process improvements.
Designed and implemented a scalable data storage solution using DynamoDB for a high-traffic application, improving read/write performance by 40%.
Set up and configured TLS/SSL on Hadoop cluster ranging from different levels of 1,2, and 3 using keystore and trust store
Created and maintained various Azure DevOps related tools for the team such as provisioning scripts, RLM deployment tools, and development and staging environments.
Designed and implemented a robust data storage strategy using Azure Cosmos DB, optimizing performance by 30% and reducing latency by 20%.
Building ETL jobs using Pyspark API with Jupyter notebooks in on premise cluster for certain transforming needs and HDFS as data storage system.
Created High Level Test Scenarios, Test Cases and Test Scripts for Data Hub requirements as described in BRD (Business Requirements Document) and FRD (Functional Requirements Document)
Implemented data governance and security measures in Azure, including data encryption, access controls, and auditing, ensuring compliance with data protection regulations and industry best practices.
Conducted exploratory data analysis (EDA) on large datasets using Azure Notebooks and Azure Machine Learning, identifying data quality issues and uncovering insights to drive business strategy.
Facilitated Joint Requirement Planning JRP sessions with SME's in understanding the Requirements pertaining to Loan Origination to Loan Processing.
Optimized SQL queries and utilized Big Query s partitioning and clustering strategies to enhance query efficiency, reducing processing time by 70%.
Worked on Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
Worked on end-to-end machine learning workflow, written python code for gathering the data from AWS snowflake, data preprocessing, feature extraction, feature engineering, modeling, evaluating the model, deployment. Written python code for exploratory data analysis using Scikit-learn machine learning python packages- NumPy, Pandas, Matplotlib, Seaborn, stats models, pandas profiling.
Built and trained a deep learning network using TensorFlow on the data, and reduced wafer scrap by 15%, by predicting the likelihood of wafer damage. A combination of the z-plot features, image features (pigmentation) and probe features are being used.
Conducted root cause analysis of supply chain disruptions and developed data-driven recommendations for process improvements, resulting in a 20% reduction in supply chain delays and disruptions, improving overall supply chain performance.
Implemented monitoring and logging using CloudWatch to track Lambda function performance metrics, enabling proactive identification and resolution of bottlenecks.
Assigned as Admin/Primary Support Liaison for major applications SharePoint/CRM
Conducted data profiling and data validation checks on supply chain commodity data to identify and resolve data quality issues, resulting in a 25% reduction in data errors and improved data accuracy.
Conducted trend analysis and identify patterns in supply chain commodity data using advanced analytics techniques, such as time series analysis and clustering, resulting in data-driven insights that informed strategic decision-making and improved supply chain performance.
Environment: Python, MS SQL, MS Office, Hive, Azure, Big Query, Power BI, DynamoDB, Cosmos DB, Lambda, Redshift, Jupyter Notebooks, MS Excel, Scikit-learn, Pandas, NumPy, Jupyter Notebook, Spyder, Windows, RStudio.

Client: Northern Trust - Chicago, IL Nov 2020 June 2022
Role: Sr. Data Engineer
Responsibilities:
Contributed to the analysis, design, and development phases of the Software Development Lifecycle (SDLC). Proficient in Agile practices, participated in sprint planning, scrum calls, and retrospectives. Managed projects through JIRA and version control with GitHub.
Implemented scalable data storage and compute infrastructure for domain-specific datasets, utilizing cloud-native technologies like AWS S3, Azure Data Lake Storage, or Google Cloud Storage in alignment with Data Mesh principles.
Created and managed Amazon EC2 instances, diagnosing common issues and maintaining of EC2 instances and other AWS services.
Designed and developed data products within the data mesh ecosystem, promoting data democratization and enabling self-serve access to real-time data for business stakeholders.
Leveraged Data Mesh principles to break down monolithic data pipelines into decentralized, domain-specific microservices, resulting in improved agility and autonomy in data management.
Orchestrated the integration of Alation with Tableau for seamless data visualization, providing financial analysts with a user.
friendly interface to explore and understand cataloged data.
Collaborated with business analysts to define and implement cataloging workflows, leveraging Alation for metadata
management, resulting in improved data governance and transparency.
Integrated data governance frameworks into the data mesh architecture, enforcing data privacy, security, and compliance policies across decentralized data domains.
Utilized virtualization technologies to create isolated testing environments for complex data processing workflows, facilitating seamless integration and deployment of data solutions.
Established automated cataloging processes using Informatica Data Catalog, integrating with Oracle databases and Amazon S3,
to streamline the onboarding of new financial datasets.
Reverse engineered the reports and identified the Data Elements (in the source systems), Dimensions, Facts and Measures required for new enhancements of reports.
Created interactive and visually compelling dashboards in Power BI, integrating real-time data streams from Spark for dynamic insights and actionable intelligence.
Implemented virtualization techniques for data provisioning and sandboxing, enabling rapid experimentation and development of data pipelines in a real-time analytics environment.
Managed metadata within Collibra on AWS, integrating it with AWS Glue to capture and document metadata generated during ETL processes, providing insights into data lineage, quality, and usage.
Optimized Spark jobs for performance and scalability in processing large-scale data sets.
Engineered RESTful APIs using Python with Flask and Django frameworks, orchestrating integration across diverse data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
Design and create tables and/or views in the physical data model by analyzing different access and join criteria.
Analyze and profile the source data flowing into these tables for making the design decisions.
Designed and implemented data discovery and metadata management solutions to enable cataloging and discovery of data assets within the data mesh ecosystem.
Designed and implemented real-time data replication solutions using Alteryx on AWS, utilizing services like AWS DMS and Glue to synchronize data across distributed databases and data warehouses.
Leveraged Alation REST APIs to automate the extraction of metadata from SQL Server and Oracle databases, streamlining the cataloging process and ensuring consistency in financial data descriptions.
Led the migration of on-premises databases to AWS RDS, minimizing downtime and disruptions while optimizing database performance in the cloud environment.
Implemented machine learning algorithms and analytics using Spark MLlib for real-time data insights.
Developed and maintained data processing pipelines using Python, automating large and complex datasets' extraction, transformation, and loading (ETL).
Architected and implemented real-time data processing pipelines using Apache Spark.
Created custom Python scripts to perform data manipulation, cleansing, and enrichment, ensuring data accuracy and consistency for analytical purposes.
Implemented modular and reusable Python codebase, promoting code efficiency, maintainability, and collaboration across the data engineering team.
Designed and optimized complex SQL queries for data extraction, aggregation, and reporting, ensuring efficient data retrieval from various database systems.
Collaborate with cross-functional teams to gather requirements and design end-to-end solutions that leverage the capabilities of Apache Spark for real-time data analytics and machine learning.
APIs using Django Rest Framework (DRF) and Flask-RESTful, ensuring seamless integration with external systems.
Provided support for cloud instances on AWS, adeptly managing Linux and Windows instances, Elastic IP, Security Groups, and Virtual Private Cloud.
Orchestrated the deployment and management of microservices-based data infrastructure using containerization technologies such as Docker and Kubernetes, ensuring scalability and resilience.
Proficiently profiled structured, unstructured, and semi-structured data across diverse sources, implementing data quality metrics and pattern identification through SQL queries and Python scripts.
Automated backups of ephemeral data-stores to S3 buckets and EBS, generating nightly AMIs for mission-critical production server backups with AWS CLI.
Installed and configured automated tools like Puppet and participated in deployment processes on multiple platforms using Chef and Puppet.
Implemented CI/CD pipelines and automated testing frameworks to ensure the reliability and scalability of data mesh infrastructure and data products.
Contributed to OpenShift Pass product architecture, focusing on creating OpenShift namespaces for seamless migration of on-premises applications to the cloud.
Utilized Docker for server virtualization in testing and development environments, automating configuration tasks through Docker containers.
Worked with Amazon AWS/EC2 and Google's Kubernetes-based Docker cluster management environment.
Orchestrated Jenkins job creation, load distribution on Jenkins servers, and parallel build execution through Jenkins node configuration.
Extensively managed Jenkins CI/CD pipelines, overseeing end-to-end automation for artifact build, test, and delivery, and effectively troubleshooting issues during the build process.
Managed Jenkins artifacts within the Nexus repository, incorporating versioning with timestamps, and deployed artifacts to AWS servers with Ansible and Jenkins.
Established continuous integration systems with Ant, Jenkins, and Puppet, driving comprehensive automation and swift, error-free deployments.
Conducted performance tuning exercises for Sybase databases on AWS infrastructure, optimizing resource utilization and query performance to meet stringent real-time processing SLAs and business objectives.
Managed Nexus Repository Manager, overseeing repository management and artifact release processes.
Proficient in on-premises and AWS-based network, application, and server monitoring tools such as Nagios, Splunk, AppDynamics, and CloudWatch.
Configured and customized JIRA as a defect-tracking system, implementing workflows and plugins to optimize bug and issue tracking.
Mentored junior engineers on Spark best practices and techniques for real-time data processing
Environment: AWS, Ansible, ANT, Informatica, Alteryx, Power Centre 6.0, Jenkins, Apache spark Bamboo, Splunk, Confluence, Bitbucket, GIT, JIRA, Python, SSH, Shell Scripting, Docker, Power BI, JSON, JAVA/J2EE, Kubernetes, Nagios, Red Hat Enterprise Linux, Terraform, Kibana, Fargate.

Client: EPAM, Dallas, TX. July 2018 Oct 2020
Role: AWS Data Engineer
Responsibilities
Prepared design blueprints and application flow documentation, gathering requirements from the Business patterns.
Worked on data pre-processing and cleaning to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, directed Machine Learning use cases under Spark ML and Mllib.
Maintained the data in Data Lake (ETL), coming from the Teradata Database, writing on an average of 80 GB daily. Overall, the data warehouse had 5 PB of data and used a 135-node cluster to process the data.
Designed the procedures for getting the data from all systems to Data Warehousing system. The data was standardized to store various Business Units in tables.
Responsible for creating Hive Tables to load the data from MySQL by using Sqoop, writing java snippets to perform cleaning, pre-processing, and data validation.
Experienced in creating Hive schema, external tables, and managing views. Worked on performing Join operations in Spark using hive. Writing HQL statements as per the user requirements.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and working with Spark-Shell. Developed Spark code using Java and Spark-SQL for faster testing and data processing.
Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in parquet format.
Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
To process the massive volume of structured data, Spark SQL was used. In addition, spark Data Frames transformations and steps to migrate Map Reduce algorithms were implemented.
I was exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
I used Data Frame API solutions to pre-process massive volumes of structured data in various file formats, including Text files, CSV, Sequence files, XML and JSON files, and Parquet files, and then turn the data into named columns.
Led the design and implementation of scalable data warehouses, focusing on data modeling, schema design, and efficient data storage.
Developed and maintained star and snowflake schema designs to optimize data retrieval and analysis.
Ensure necessary system security by using best-in-class AWS cloud security solutions. Additionally, I am experienced in deploying Java projects using Maven/ANT and Jenkins.
Skilled in establishing data lineage using Secoda to track data flow and transformations from source to destination.
Facilitated collaboration among data teams by leveraging Secoda's documentation and commenting features.
Created and maintained comprehensive documentation for data assets, ETL processes, and data models within Secoda.
Integrated Secoda with various data platforms, including Snowflake, Databricks, and Azure, to enhance data accessibility and usability.
Utilized Secoda's API and connectors to streamline data ingestion and metadata synchronization.
Implemented data governance practices in Secoda to ensure data quality, compliance, and security.
DevOps and CI/CD pipeline knowledge - Mainly Teamcity, Selenium. Implement continuous integration/ delivery (CI/CD) pipelines in AWS when necessary.
I am experienced with batch processing of data sources using Apache Spark. I am developing predictive analytics using Apache Spark Java APIs. Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Java.
Worked on the core and Spark SQL modules of Spark extensively. Extensively used Broadcast Variables and Accumulators for better performances.
Worked on Apache Spark 2.0 Utilizing the Spark SQL and Streaming components to support the intraday and real-time data processing.
Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
Analyzed system performance and reliability, identifying areas for improvement and implementing solutions to enhance efficiency.
Collaborated with stakeholders to understand business needs and translate them into system requirements, ensuring alignment between technology and business objectives.
Evaluated multiple alternate solutions, weighing their advantages and disadvantages to make informed decisions.
Provided leadership and guidance to team members, effectively managing multiple priorities and ensuring project deadlines were met.
Played a key role in customer implementations, ensuring seamless integration of platform capabilities and delivering high-quality solutions.
Maintained strong attention to detail and accuracy in all aspects of work, ensuring compliance with regulatory requirements and industry standards.
Communicated effectively with cross-functional teams and senior management through verbal and written channels, presenting complex technical concepts in a clear and concise manner.
Developed complex SQL queries for data analysis, reporting, and data manipulation, leading to a 30% reduction in manual data processing time.
Environment: Hadoop, HDFS, Hive, Java 1.7, Spark 1.6, SQL, HBase, UNIX Shell Scripting, MapReduce, Putty, WinSCP, IntelliJ, Teradata, Linux.

Client: Yamaha Motor Corporation, Cypress, CA Oct 2016 June 2018
Role: Sr Data Engineer
Responsibilities:
Gathered business requirements, devised data cleansing and migration strategies, authored functional and technical specifications, created source-to-target mappings, designed data profiling and validation processes in Informatica, and developed ETL jobs in Informatica.
Worked on Hadoop cluster which ranged from 4-8 nodes during the pre-production stage and it was sometimes extended up to 24 nodes during production.
Built APIs that will allow customer service representatives to access the data and answer queries.
Designed changes to transform current Hadoop jobs to HBase.
Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
Managed cluster maintenance, monitoring, commissioning, and decommissioning of data nodes. Conducted troubleshooting, oversaw data backups, and managed log files.
Enhanced Hive functionality by developing custom UDFs and UDAFs.
Worked closely with the ETL SQL Server Integration Services (SSIS) Developers to explain the Data Transformation.
Led the development of a new BDW, improving query/report performance, reducing report development time, and establishing a self-service reporting model in Cognos for business users.
Implemented Bucketing and Partitioning using Hive to assist the users with data analysis.
Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE.
Developed database management systems for easy access, storage, and retrieval of data.
Performed DB activities such as indexing, performance tuning, and backup and restore.
Proficient in writing Hadoop jobs for data analysis using Hive QL, Pig Latin, and custom MapReduce programs in Java.
Implemented performance optimizations such as using distributed cache for small datasets, partitioning, bucketing in Hive, and map-side joins.
Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.
Involved in loading and transforming large sets of Structured, Semi-Structured, and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
Created Session Beans and Controller Servlets for handling HTTP requests from Talend.
Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including charts, summaries, and graphs to interpret the findings to the team and stakeholders.
Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
Utilized Waterfall methodology for team and project management.
Utilized Git for version control with the Data Engineer team and collaborated with Data Scientists. Created Tableau dashboards with various visualizations like stack bars, bar graphs, scattered plots, geographical maps, and Gantt charts using Tableau Desktop and Tableau Server.
Expert in creating Hive UDFs using Java to analyze the data efficiently.
Managed the loading of data from BDW Oracle database and Teradata into HDFS using Sqoop.
Implemented AJAX, JSON, and Javascript to create interactive web screens.
Performed statistical analysis using SQL, Python, R Programming, and Excel.
Worked extensively with Excel VBA Macros, Microsoft Access Forms
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, HBase, Map Reduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Oozie, Linux.

Client: Nitor Infotech - Pune, Maharashtra Jul 2014 April 2016
Role: Data Engineer
Responsibilities:
Contributed to the analysis, design, and development phases of the Software Development Lifecycle (SDLC). Proficient in agile methodologies, actively participating in sprint planning, scrum calls, and retrospective meetings. Managed project tracking with JIRA and version control via GitHub.
Designed and executed intricate data pipelines via AWS Glue for efficient transformation and loading of extensive data from diverse sources.
Automated ETL processes through maintenance and optimization of Glue jobs and crawlers, ensuring seamless data processing and analysis.
Designed and developed efficient stored procedures in AWS Redshift to automate complex data transformations, aggregations, and calculations.
Conducted performance tuning and optimization of existing AWS Redshift stored procedures, enhancing system responsiveness and reducing query latency.
Utilized SQL scripting within AWS Redshift's stored procedures to optimize query performance and minimize data movement.
Engineered Spark, Hive, Pig, Python, Impala, and HBase data pipelines for seamless customer data ingestion and processing.
Developed and sustained web applications using Django and Flask, adhering to Model-View-Controller (MVC) architecture for scalability and maintainability.
Orchestrated Amazon EC2 instances creation, troubleshooting, and monitoring, alongside other AWS services for multi-tier application deployment.
Provided Linux and Windows cloud instances support on AWS, configuring Elastic IP, Security Groups, and Virtual Private Cloud.
Implemented and managed Puppet for automated deployments and contributed to Chef and Puppet-based deployment strategies.
Created OpenShift namespaces for on-premises applications transitioning to the cloud in OpenShift Pass environment.
Virtualized servers using Docker for testing and development environments, streamlining configuration through Docker containers.
Managed Docker clusters, integrating them with Amazon AWS/EC2 and Google's Kubernetes.
Developed Jenkins CI/CD pipeline jobs for end-to-end automation, overseeing artifact management in Nexus repository.
Configured JIRA as a defect tracking system, implementing workflows and customizations to enhance bug/issue tracking.
Environment: AWS, Python, SSH, Shell Scripting, JSON, Docker, Kubernetes, Red Hat Enterprise Linux, Terraform.

Client: Sagarsoft - Hyderabad, Telangana Dec 2012 Jun 2014
Role: Data Engineer
Responsibilities:

Played a pivotal role in capturing comprehensive business, system, and design requirements. Conducted gap analysis, and illustrated findings through use case diagrams and flow charts.
Architecture a dynamic, cross-device, cross-browser, and mobile-friendly web dashboard utilizing Angular JS. Empowered the management of multiple chatbots across diverse environments.
Orchestrated the development of Bot framework conversation flows, utilizing NODE-RED, NodeJS, MS Bot framework. Crafted the user interface for the web dashboard utilizing HTML, CSS, Bootstrap, and Angular JS.
Designed, constructed, and managed SSIS packages, enabling seamless data integration and transformation within Azure. Skillfully configured and optimized SSIS Integration Runtime for efficient package execution on the Azure platform.
Pioneered the creation of custom nodes on NODE-RED dashboard, facilitating streamlined conversation building through Node.js over the MS Bot framework.
Actively contributed to the implementation of user authentication mechanisms within the application, leveraging Stormpath and Passports for robust security measures.
Employed a diverse array of Validation Controls for client-side validation. Crafted custom validation controls using Angular validation controls and Angular Material Design, enhancing data integrity.
Engineered Spark applications using PySpark and Spark-SQL for robust data extraction, transformation, and aggregation. Analyzed and transformed data from multiple file formats, unveiling valuable insights into customer usage patterns.
Successfully established a robust CI/CD pipeline leveraging Jenkins and Airflow for containerization via Docker and Kubernetes.
Orchestrated ETL operations using SSIS, NIFI, Python scripts, and Spark Applications. Constructed data flow pipelines, expertly transforming data from legacy tables to Hive, HBase tables, and S3 buckets. This data was handed off to business stakeholders and Data scientists for advanced analytics.
Implemented data quality checks using Spark Streaming, seamlessly categorizing data with bad and passable flags, ensuring data integrity and reliability.
Environment: Python, Informatica V9., MS SQL SERVER, SSIS, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
Keywords: continuous integration continuous deployment quality analyst business analyst machine learning javascript access management business intelligence sthree database rlang information technology microsoft procedural language California Illinois New York Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];4711
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: