Gayathri V - Data Engineer |
[email protected] |
Location: Dayton, Ohio, USA |
Relocation: yes |
Visa: |
GAYATHRI V
PROFESSIONAL SUMMARY Highly skilled and results-driven Data Engineer with over 8 years of extensive experience in designing, building, and optimizing scalable data pipelines, big data architectures, and cloud-based solutions. Proficient in Python, Spark, Scala, and SQL, with expertise in tools like Kubernetes, Airflow, Databricks, and Snowflake. Adept at managing large-scale data processing systems, leveraging AWS, Azure, and Google Cloud platforms for efficient data storage, transformation, and analysis. Demonstrated ability to implement advanced machine learning models, perform feature engineering, and create insightful visualizations using Tableau and Matplotlib. Strong problem-solving skills and a proven track record of improving data infrastructure, ensuring data quality, and delivering actionable insights to drive business decisions. Experience in validating the Data from source to target (ETL Process). Checking the completeness and accuracy of the data by running various testing techniques using SQL Queries. Experienced working on Big-Data technologies like Hive, Sqoop, HDFS and Spark, Good exposure on cloud technologies like AWS S3, AWS EC2, AWS Redshift. Strong programming experience using Scala, Python and SQL. Having experience in developing a data pipeline using Kafka to store data into HDFS. Good Knowledge in loading the data from Oracle and MySQL databases to HDFS system using SQOOP (Structured Data). Experience in working with NoSQL databases like HBase, Cassandra and Mongo DB. Experience in developing customized UDF s in Python to extend Hive and Pig Latin functionality. Extensive experienced Big Data - Hadoop developer with varying level of expertise around different Big Data/Hadoop ecosystem projects which include Spark streaming, HDFS, MapReduce, HIVE, HBase, Storm, Kafka, Flume, Sqoop, Zookeeper, Oozie etc. Experience in ETL process consisting of data transformation, data sourcing, mapping, conversion, and loading. Utilized various AWS Cloud services like S3, EMR, Redshift, Athena, and Glue Metastore for building data lakes in cloud and creating various data engineering pipelines utilizing all AWS cloud services. Experience as Azure Cloud Data Engineer in Microsoft Azure Cloud technologies including Azure Data Factory (ADF), Azure Data Lake Storage (ADLS), Azure Synapse Analytics (SQL Data warehouse), Azure SQL Database, Azure HDInsight Big Data Technologies like Hadoop, Apache Spark, and Azure Data bricks. Solid experience in working with csv, text, Avro, parquet, orc, Json formats of data. Accustomed with all phases of SDLC based on Waterfall, Iterative and AGILE Methodologies. Performed Data Analysis using SQL, PL/SQL, Python, Spark, Databricks, Teradata SQL Assistant, SQL server management studio, SAS. Efficient with SQL, Python and R for data wrangling, data exploration, and data analysis. TECHNICAL SKILLS Cloud Management AWS (EC2, EMR, S3, Redshift, Lambda, Snowball, Athena, Glue, DynamoDB, RDS, Aurora, IAM, Firehose), Azure (Databricks, Data Explorer, Data Lake Storage Gen-2, Data Factory, Airflow, Blob Storage, File Storage, SQL DB, Synapse Analytics, App Service, Kubernetes Service), Google Cloud Platform (GCP) Big Data Technologies Hadoop Distributed File System (HDFS), MapReduce, Apache Spark, Spark Streaming, Kafka, Hive, Pig, Impala, HBase, Sqoop, Flume, Oozie, Zookeeper, YARN, Snowflake, Cassandra Databases MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, DynamoDB, Redis, Cassandra, Azure SQL DB, Azure Synapse, Teradata Programming Languages SQL, PL/SQL, HiveQL, Python, PySpark, Scala, R, Shell Scripting, Java, Regular Expressions Version Control Git, GitHub Operating Systems Windows (XP/7/8/10), UNIX, Linux, Ubuntu Containerization Tools Kubernetes, Docker, Docker Swarm APIs and Web Frameworks Django, APIs (third-party integrations) Development Environments and IDEs Eclipse, Visual Studio Methodologies Agile Methodology, JIRA, Waterfall Methodology EDUCATION University of Central Missouri, USA, Master s in Computer Science. Vellore Institute of Technology, India, Bachelor s in Computer Science and Engineering. WORK EXPERIENCE CLIENT 1: STANLEY BLACK AND DECKER, MISSION, TEXAS AUG 2023-PRESENT ROLE: DATA ENGINEER RESPONSIBILITIES: Supported eBusiness team to enable easy access to data via web scraping, data mining, and helped design content-based recommendation (to predict recommendation for the product) Collaborated with external partners to collect product data using Python. Worked on Big Data architecture projects for over 2 years, focusing on the design and implementation of scalable and efficient data processing pipelines. Demonstrated proficiency in writing code in Python, Spark, and Scala, enabling the development of custom data solutions. Utilized a deep understanding of data stores, including Postgres, Redis, Kafka, and Elasticsearch, to design and optimize data storage and retrieval processes. Extensive experience with public cloud platforms such as AWS, Azure, and Google Cloud, and their Kubernetes offerings, ensuring seamless integration and deployment of data solutions. Actively involved in the implementation and management of data solutions on cloud platforms, specifically Airflow, Databricks, and Snowflake, to enhance data processing capabilities. Utilized ETL to perform data extraction to produce cosmetics sales attributes data using SQL queries, including group by queries, joins, subqueries, etc. in MySQL. Built a simple recommendation system using Python in Jupiter Notebook based on previous cosmetics attributes. Used Jenkins and pipelines which helped us drive all Microservices builds out to the Docker registry and then deployed to Kubernetes. Conducted structure thinking on the problem and generated assumptions. Performed Exploratory Data Analysis (EDA), feature pivoting, and visualization on attributes data to identify trends and validate assumptions using Python Seaborn and Matplotlib Performed univariate and multivariate analysis to assess the previous assumptions using Python. Pre-processed raw data using ETL tools Python Pandas, and performed data cleaning, including missing data treatment, redundant values, inconsistent information, and outliers removement. Transformed categorical values to numerical values for easier model application using Python Pandas with aggregate functions and lambda functions. Experience in implementing data governance and security practices on Azure to ensure data privacy, compliance, and data integrity within Kubernetes clusters. Managed Docker orchestration and Docker containerization using Kubernetes Used Kubernetes to orchestrate the deployment, scaling, and management of Docker Containers. Data Engineer with strong proficiency in Kubernetes cluster and Databricks cluster environments Led the setup and installation of Kubernetes clusters, Data Brick clusters, and Cloud Architect clusters, ensuring seamless and scalable data infrastructure. Automated builds using Maven and scheduled automated nightly builds using Jenkins. Built Jenkins pipeline to drive all microservices builds out to the Docker registry and then deployed to Kubernetes. Visualized cleaned sales data using Python Matplotlib and Tableau Developed Tableau visualizations and dashboards with multiple panels and parameters using Tableau desktop to show histograms, scatterplots, boxplots, correlation tables. Updated MySQL database with cleaned brands data, developed stored procedures, tables, views, for easier processing of data. Hands-on use of various EC2, EMR cluster, Redshift, Databricks, S3 Buckets, IaaS /PaaS/SaaS. Explored feature distribution, performed feature selection and feature engineering on cleaned data, created dummy variables for easier model implementation in Python. Chose relevant classification methods, performed hyper parameter tuning to determine the optimal values for models, including Logistic Regression, Decision Tree, Random Forest Implemented multiple classification methods to predict the recommendation level of beauty products using Python, obtaining an accuracy of 72.7% Calculated metric scores for models (precision, recall, F1 score, etc.) to see the performance using Python scikit-learn. Engaged A/B testing to optimize the recommendation system, involving measures such as click-through rate (CTR) and conversion rate (CR) Analyzed results from the A/B test, generated assumptions, and conducted t-test to validate the assumptions. Explored reasons behind the results and proposed further improvements. Created EC2 instance to deploy the images to AWS Cloud Environment: Python, Spark, Scala, MySQL, Postgres, Redis, Kafka, Elasticsearch, AWS (EC2, EMR, Redshift, S3), Azure, Google Cloud Platform (GCP), Kubernetes, Docker, Jenkins, Maven, Apache Airflow, Databricks, Snowflake, Tableau, Matplotlib, Seaborn, Scikit-learn, Pandas, Git, Linux/Unix CLIENT 2: DEUTSCHE BANK, CARY, NORTH CAROLINA JUN 2021-AUG 2023 ROLE: BIG DATA ENGINEER RESPONSIBILITIES: Implemented Partitioning, Dynamic Partitions, Buckets in HIVE. Develop database management systems for easy access, storage, and retrieval of data. Perform DB activities such as indexing, performance tuning, and backup and restore. Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java. Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. Expert in creating Hive UDFs using Java to analyze the data efficiently. Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop. Used Oozie scripts for deployment of the application and perforce as the secure versioning software. Connected to AWS Redshift through Tableau to extract live data for real time analysis Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group. Utilized Waterfall methodology for team and project management Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating Created Tableau dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality. Dashboards and stories as needed using Tableau Desktop and Tableau Server. Created Session Beans and controller Servlets for handling HTTP requests from Talend Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders. Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica. Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG. Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions. Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB. Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS. Built APIs that will allow customer service representatives to access the data and answer queries. Extending the functionality of Hive with custom UDF s and UDAF's. The new Business Data Warehouse (BDW) improved query/report performance reduced the time needed to develop reports and established self-service reporting model in Cognos for business users. Implemented Bucketing and Partitioning using hive to assist the users with data analysis. Developed story telling dashboards in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information. Environment: Python, SQL, Hive, Pig, Hadoop, HDFS, MapReduce, Java, Snowflake, AWS (S3, Redshift), Oracle, Teradata, Sqoop, Oozie, Tableau, Git, Talend, Informatica, MongoDB, SAS, NoSQL, Cognos, Waterfall methodology, Linux/Unix. CLIENT 3: EARLY WARNING SERVICES, LLC, SCOTTSDALE, ARIZONA FEB 2019-MAY 2021 ROLE: DATA ENGINEER RESPONSIBILITIES: Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake. Developed snow pipes for continuous injection of data using event handler from AWS (S3 bucket). Developed Snow SQL scripts to deploy new objects and update changes into Snowflake. Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake. Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda. Designing and implementing new HIVE tables, views, schema and storing data optimally. Performing Sqoop jobs to land data on HDFS and running validations. Configuring Oozie Scheduler Jobs to run the Extract jobs and queries in an automated way. Querying data by optimizing the query and increasing the query performance. Designing and creating SQL Server tables, views, stored procedures, and functions. Performing ETL operations using Apache Spark, also using Ad-Hoc queries and implementing snore techniques. Worked on configuring CICD for CaaS deployments (k8's). Involved in migrating master-data form Hadoop to AWS. Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's. Developed preprocessing job using Spark Data frames to transform JSON documents to flat file. Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response. Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). Worked on Azure Databricks, Spark, HDInsight, u-SQL, t-SQL, spark SQL, Azure ADW and hive used to load and transform data. Proficient in designing, implementing, and maintaining secure data pipelines on Azure using services such as Azure Data Factory, Azure Databricks, and Azure Data Lake Storage, with a strong focus on Azure Firewall for network security. Used Jenkins pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes. Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's. Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark Developed Apache Spark applications by using Scala for data processing from various streaming sources. Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume Responsible for design and development of Spark SQL Scripts based on Functional Specifications Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data. Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit. Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers. Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class. Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive. Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table. Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system. Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers. Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging. Partitioning Data streams using Kafka. Designed and configured cloud architect cluster and Kafka cluster to accommodate heavy throughput. Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team. Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of Chatlog for Hadoop based storage management. Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet Generated various kinds of reports using Pentaho and Tableau based on Client specification. Have come across new tools like Jenkins, Chef and Rabbit MQ. Worked with SCRUM team in delivering agreed user stories on time for every Sprint. Environment: Talend, Snowflake, AWS (S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, Lambda), Python, SQL, Hive, Sqoop, Oozie, Apache Spark, Scala, Hadoop, HDFS, MapReduce, Azure (Data Factory, Databricks, Data Lake Storage, HDInsight, ADW), Jenkins, Kubernetes, Docker, Cassandra, Kafka, Flume, Tableau, Git, RabbitMQ, u-SQL, t-SQL, Spark SQL, Avro, JSON, XML, Linux/Unix. CLIENT: HUMANA, LOUISVILLE, KENTUCKY AUG 2017- FEB 2019 ROLE: DATA ENGINEER RESPONSIBILITIES: Strategically architected and deployed an enterprise-grade Data Lake, facilitating advanced analytics, processing, storage, and reporting on high-velocity, large-scale data sets. Diligently ensured the integrity and quality of reference data within source systems through meticulous cleaning and transformation processes, working in close collaboration with stakeholders and the solution architect Skillfully crafted and implemented tabular models on Azure Analytic Services, directly aligning with business intelligence and reporting needs. Utilized Control-M's monitoring and alerting features to proactively identify and address job failures or performance issues. Directed data ingestion operations across Azure Services for a data volume exceeding 200 TB, optimizing processing workflows within Azure Databricks and achieving a 25% improvement in data processing time. Innovatively designed and executed complex data pipelines and transformations, utilizing ADF and PySpark within Databricks to meet intricate data flow requirements. Managed and optimized data storage and loading procedures for Azure BLOB and Azure Data Lake storage systems, ensuring seamless integration with Azure SQL Synapse Analytics. Conducted comprehensive data lineage and impact analysis using automated tools, ensuring transparency and trust in data transformations and usage. Formulated and executed resilient Python, Spark, and Bash scripts to optimize the seamless transformation and loading of data across complex hybrid environments. This demonstrated a commitment to efficiency and effectiveness in data processing and management. Maximized the capabilities of Apache Spark, employing SQL and Streaming modules to bolster intraday and real-time data processing operations. Optimized data processing efficiency using Spark SQL for Scala and Python, enabling seamless RDD to Data Frame conversions, significantly accelerating analysis and system performance. Streamlined data manipulation, markedly enhancing processing speed and reliability. Implemented advanced automated data validation and cleansing using Apache Spark and Python, significantly enhancing data quality and integrity. Reduced errors and inconsistencies, ensuring accurate and reliable analytics outcomes. Applied performance optimization strategies, achieving a 60% improvement in query execution time through distributed caching, data partitioning, and hive bucketing techniques. Led the innovation and creation of a scalable ETL framework, streamlining future migrations from RDBMS systems to Data Lakes by leveraging Spark Data Sources and Hive objects. This initiative demonstrated forward-thinking expertise in data architecture and optimization. Maximized NoSQL databases like MongoDB and Cassandra for high-velocity transactions and unstructured data, optimizing scalability and performance. Enabled flexible data models for efficient management and analysis of diverse data, supporting dynamic business applications. Led performance tuning for big data applications, employing optimization techniques to enhance processing speed and reduce latency, enabling real-time analysis and quicker decision-making. Improved system efficiency significantly across the organization. Orchestrated extensive database imports and exports by utilizing SQL Server Integration Services (SSIS) and Data Transformation Services (DTS), effectively streamlining data integration pipelines. This initiative facilitated seamless data movement and transformation across systems, enhancing data availability and reliability while optimizing workflow efficiency across the enterprise. Environment: Azure (Azure Databricks, Azure Data Factory, Azure Data Lake Storage, Azure SQL Synapse Analytics, Azure BLOB, Azure Analysis Services), Apache Spark (SQL, Streaming modules), PySpark, Python, Bash, Hive, Control-M, SQL Server Integration Services (SSIS), Data Transformation Services (DTS), MongoDB, Cassandra, Scala, RDBMS, NoSQL databases. CLIENT: AMGEN, NEW ALBANY, OHIO JUN 2016- AUG 2017 ROLE: SOFTWARE DEVELOPER RESPONSIBILITIES: Hands-on experience on developing SQL Scripts for automation purposes. Data conversions and data loads from various databases and file structures. Monitor, tune, and analyze database performance and allocate server resources to achieve optimum database performance. Spearheaded the development of a customer-facing web application using Django, resulting in a 20% increase in user engagement. Integrated third-party APIs to enhance application functionality and user experience. Wrote Python scripts to parse JSON documents and load the data in the database. Worked on Data validation and data quality checks for various data sources. Perform troubleshooting, fixed, and deployed many Python bug fixes of the two main applications that were the main source of data for both customers and the internal customer service team. Worked in Agile Methodology and used JIRA to maintain the stories about project. Involved in gathering the requirements, designing, development and testing. Environment: SQL, Python, Django, JSON, APIs (third-party integrations), JIRA, Agile Methodology, databases (various RDBMS), server resource management tools. Keywords: message queue business intelligence sthree database active directory rlang information technology procedural language |