Python Data Engineer - Python Data Engineer |
[email protected] |
Location: Alba, Texas, USA |
Relocation: yes |
Visa: EAD |
Bharat Kumar
Email: [email protected] Phone: +1 947 228 6768 Ext 12 SUMMARY Over 9+ years of extensive experience in designing, developing, and optimizing large-scale data infrastructure systems, proficient in Python programming and leveraging Hadoop ecosystem tools. Expertise in various Big Data technologies including HDFS, HBase, MapReduce, Spark, Storm, Scala, Kafka, Hive, Pig, Flume, Oozie, PostgreSQL, Sqoop, Talend, Impala, and Zookeeper. Hands-on experience with data analytics services such as Athena, Glue Data Catalog, and QuickSight, as well as Big Data ecosystems using Hadoop, Impala, Airflow, Snowflake, Teradata, and Oozie. Strong knowledge and experience in Agile, Scrum, and Waterfall methodologies, with hands-on experience in data ingestion tools like Kafka, Flume, and workflow management tools like Oozie. Extensive experience and expertise in real-time data analytics using Spark Streaming, Kafka, and Flume, as well as developing ETL applications on large volumes of data using MapReduce, Spark-Scala, PySpark, Spark-SQL, and Pig. Proficient in text analytics, developing Statistical Machine Learning solutions, and generating data visualizations using R and Python, with expertise in R packages and libraries. Experience in web-based UI development using jQuery UI, jQuery, Ext JS, CSS, HTML, XHTML, and JavaScript, as well as in developing and designing Web Services (SOAP and Restful Web services). Strong experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, data processing, and working with visualization tools like Tableau and PowerBI. Hands-on experience with AWS components like EMR and EC2 web services, as well as in Amazon AWS concepts like VPC, EC2, EBS, RedShift, and CFT. Strong experience in developing Web Services like SOAP, REST, and Restful using Python programming language. Proficient in writing SQL Queries, Stored procedures, functions, packages, tables, views, triggers using relational databases such as Oracle, DB2, MySQL, Sybase, PostgreSQL, and MS SQL Server. Experienced in using Docker and Ansible for fully automating the deployment and execution of benchmark suites on clusters of machines. Good experience in Linux Bash scripting and adhering to PEP-8 Guidelines in Python. Extensive knowledge in developing Spark SQL jobs by utilizing Data Frames. Executed complex HiveQL queries for data extraction from Hive tables and developed Hive UDFs. Experience in building applications on different operating systems including Linux (Ubuntu, CentOS, Debian), and Mac OS. Possess excellent interpersonal and communication skills, efficient time management and organization skills, and the ability to handle multiple tasks effectively while working well in a team environment. Experience in Machine Learning algorithms and Predictive Modeling, working with Spark SQL and Scala queries, and various databases including MySQL, SQL, DB2, Oracle, MongoDB, Cassandra, Dynamo DB, Red Shift, and HBase. Experience in tracking and logging software application builds using Azure DevOps, creating pipeline jobs, and scheduling triggers using Azure Data Factory. Proficient in Google Cloud Platform (GCP) big data products such as Big Query, Cloud DataProc, Google Cloud Storage, and Composer (Air Flow as a service). Hands-on experience in Sqoop for data migration, managing Hadoop clusters using Cloudera Manager, and running queries using Impala, with experience in developing Hadoop Architecture in Windows and Linux platforms. Technical Skill: Skill Category Specific Skills Programming Languages Python, Java, JavaScript (Node.js), Perl Database Management MySQL, PostgreSQL, MongoDB, SQL Server, SQLite Data Processing Apache Hadoop, Apache Spark (PySpark), Hive, Pig, Sqoop ETL Tools Apache Airflow, Kafka, AWS S3, Azure Data Factory, Google BigQuery Web Frameworks Django, Flask, MEAN Stack (MongoDB, Angular, Node.js), React.js Version Control Git, CVS, SVN Containerization Docker, Kubernetes Cloud Platforms AWS, Azure, Google Cloud Platform Data Security Encryption, Data Masking, Secure Data Transmission Methods API Development RESTful APIs, GraphQL, Web Socket Big Data Technologies Apache Hadoop, Apache Spark, Kafka DevOps Tools Ansible, Jenkins, Docker, Kubernetes Data Visualization Matplotlib, Seaborn Web Development HTML5, CSS3, JavaScript, Django, React.js Collaboration Tools GitHub Database Servers MySQL, PostgreSQL, MongoDB, Oracle 10g Other Tools Apache Kafka, Apache Mesos, Apache Databricks, AWS Lambda, AWS Amplify, DynamoDB, Cognito, Lambda, API Gateway Professional Experience Python/Data Engineer Dell Technologies, TX Jan 2023 - Present Responsibilities New Data Platform Starter: Started the creation and was a sole developer for the first 8 months of a new data platform using AWS, Snowflake, and Astronomer Airflow (Kubernetes Executor). Spearheaded the design and implementation of scalable data pipelines using Apache Spark, Apache Kafka, and Apache Airflow, resulting in a 30% increase in data processing efficiency. Led a team in building a real-time data streaming platform utilizing Apache Flink, enabling timely insights for business decision-making. Developed robust ETL processes leveraging Apache Hadoop ecosystem (HDFS, MapReduce, Hive) to handle large volumes of data efficiently. Designed and optimized data models using techniques such as partitioning, indexing, and denormalization to improve query performance by 40%. Implemented CI/CD pipelines with Jenkins and GitLab for automated testing and deployment of data applications, reducing deployment time by 50%. Collaborated with cross-functional teams to gather requirements and deliver customized data solutions tailored to business needs. Conducted performance tuning and troubleshooting of SQL queries and database systems (e.g., PostgreSQL, MySQL, Oracle) to enhance overall system performance. Implemented data governance policies and procedures to ensure data quality, integrity, and security across all stages of the data lifecycle. Led the development of a data platform from scratch, including requirement gathering and analysis. Designed tables in Hive and MySQL, processed large datasets of different forms (structured, semi-structured, and unstructured), and imported/exported databases to HDFS using Sqoop. Developed REST APIs using Python with Flask and Django frameworks, integrating various data sources such as Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files. Worked with Hadoop architecture and daemons (Name-Node, Data Node, Job Tracker, Task Tracker, Resource Manager). Utilized AWS data pipeline for ETL processes from homogeneous or heterogeneous data sources and built graphs for business decision-making using Python matplotlib library. Scheduled jobs using crontab, run deck, and control-m, and built Cassandra queries for CRUD operations. Developed frontend and backend modules using Python on Django Web Framework, created UI using JavaScript, Bootstrap, Cassandra with MySQL, and HTML5/CSS. Imported/exported data using Sqoop, developed Spark code and Spark-SQL/Streaming for data processing, analyzed SQL scripts, and implemented solutions using PySpark. Used JSON and XML SerDe's for serialization/deserialization to load data into Hive tables, and handled structured data using SparkSQL. Developed data processing tasks using PySpark, such as reading data from external sources, merging data, performing data enrichment, and loading into target data destinations. Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud. Worked in LINUX environment, used Jenkins for continuous integration/deployment, and deployed projects using GIT version control system. Managed imported data from different sources, performed transformations using Hive, Pig, and MapReduce, and executed Oozie workflow engine for job scheduling. Utilized Docker and Nginx for achieving Continuous Delivery on a high-scalable environment. Stored data in MongoDB in JSON format and developed/tested dashboard features using Python, Bootstrap, CSS, and JavaScript Mentored junior team members in best practices for data engineering, fostering professional growth and knowledge sharing within the team. Proficient in Hadoop ecosystem components: HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Kafka, and Spark. Hands-on experience with big data processing and storage frameworks, including Apache Hadoop, Apache Spark, and Apache Flink. Strong expertise in designing, developing, and maintaining scalable and distributed data processing systems. Skilled in programming languages commonly used in the Big Data field, such as Java, Scala, and Python. Experience in data ingestion, extraction, and transformation using various tools and technologies. Familiarity with NoSQL databases like MongoDB and Cassandra for efficient data storage and retrieval. Designed and implemented data warehouse solutions using tools like Apache Hive for efficient querying and analysis. Utilized Apache HBase for real-time, scalable, and distributed database storage. Implemented data integration pipelines using Apache Sqoop and Apache Flume for seamless data transfer between Hadoop and external systems. Developed ETL processes to cleanse and transform raw data into a usable format for analysis. Implemented real-time data processing pipelines using Apache Kafka, ensuring efficient handling of streaming data. Proficient in leveraging Apache Spark Streaming for processing and analyzing real-time data. Proficient in designing, implementing, and managing large-scale Hadoop clusters comprising hundreds of nodes, ensuring high availability and fault tolerance. Proficient in designing, implementing, and maintaining large-scale data pipelines using tools such as Apache Spark, Apache Hadoop, and Apache Flink. Skilled in utilizing cloud-based data platforms such as Amazon Web Services (AWS) and Microsoft Azure for scalable and cost-effective data processing. Experienced in architecting data solutions leveraging both relational (SQL) and NoSQL databases (MongoDB, Cassandra) for various use cases including real-time analytics, machine learning, and business intelligence. Expertise in data modeling and ETL (Extract, Transform, Load) processes to ensure data integrity, quality, and consistency across diverse datasets. Proficient in programming languages such as Python, Scala, and Java for data manipulation, analysis, and automation tasks. Proven track record in collaborating with cross-functional teams including data scientists, analysts, and software engineers to deliver end-to-end data solutions that meet business requirements. Experienced in deploying and managing containerized applications using Docker and orchestration tools like Kubernetes for scalable and resilient data infrastructure. Familiarity with stream processing frameworks like Apache Kafka for real-time data ingestion, processing, and event-driven architectures. Strong communication skills with the ability to effectively communicate technical concepts to both technical and non-technical stakeholders, facilitating collaboration and decision-making processes. Environment:Pandas, NumPy, SQL, NoSQL, MySQL, PostgreSQL, MongoDB, Hadoop, Spark, Amazon Redshift, Google BigQuery, Snowflake, ETL, Apache Airflow, Luigi, APIs, Docker, AWS, Azure, Google Cloud Platform, Hadoop, Spark, Hive, Pig. Python/Data Engineer GE Health Care, IL. Sep 2021 - Dec 2022 Responsibilities Used custom developed PySpark scripts to pre-process, transform data and map to tables inside the CIF (Non- corporate Information Factory) data warehouse Developed shell scripts of Sqoop jobs for loading periodic incremental imports of structured data from various RDMS to S3 and used Kafka to ingest real-time website traffic data to HDFS Experienced working with team, lead developers, interfaced with business analysts, coordinated with management and understand the end user experience Used Informatica Designer to create complex mappings using different transformations to move data to a Data Warehouse. Developed mappings in Informatica to load the data from various sources into the Data Warehouse using different transformations like Source Qualifier, Expression, Lookup, aggregate, Update Strategy and Joiner. Used AWS EMR to move large data (Big Data) into other platforms such as AWS data stores, Amazon S3 and Amazon Dynamo DB. Developed AWS lambdas using Python & Step functions to orchestrate data pipelines. Worked in AWS environment for development and deployment of Custom Hadoop Applications. Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift Developed data pipelines in Python for medical image pre-processing, training, and testing. Built an Artificial Intelligence platform on Amazon SageMaker for training, testing, and developing AI models. Utilized Python libraries such as Pandas, OpenCV, NumPy, Seaborn, TensorFlow, Keras, Matplotlib, Sci-kit-learn, and NLTK for developing data pipelines and various machine learning algorithms. Designed and engineered REST APIs and packages for feature extraction and complex prediction/forecasting algorithms on time series data. Developed a Python application for Google Analytics aggregation and reporting, using Django configuration for managing URLs and application parameters. Implemented pre-processing pipelines for DICOM and non-DICOM images and presented analytical insights on medical and image data. Implemented and managed workflow scheduling using Apache Oozie to orchestrate complex data workflows. Experience in designing and implementing data pipeline automation for improved efficiency. Conducted performance tuning of Hadoop and Spark clusters for optimal resource utilization and enhanced processing speed. Identified and resolved bottlenecks in data processing pipelines to improve overall system performance. Implemented monitoring solutions for Hadoop clusters using tools like Ganglia and Nagios. Used AWS Lambda for driving real-time monitoring dashboards from system logs and performed data cleansing and normalization techniques. Conducted preliminary data analysis using descriptive statistics, handled anomalies, and developed various visualization methods like scatter plots, pie charts, bar charts, box plots, and histograms. Developed REST APIs for sending and receiving data in JSON format and configured EC2 instances, IAM users/roles, and S3 data pipes using Boto API. Implemented Agile Methodology for building internal applications and developed AI machine learning algorithms like classification, regression, and deep learning. Conducted statistical analysis on healthcare data using Python and various tools and had experience with cloud versioning technologies like GitHub. Collaborated closely with Data Scientists to understand data requirements for experiments and had deep experience in using DevOps technologies like Jenkins, Docker, Kubernetes, etc. Competent in data visualization with Matplotlib and Seaborn, creating compelling graphical representations. Proficiency in machine learning using Tensor Flow and Scikit-learn, developing and deploying predictive models. Familiarity with natural language processing (NLP) using NLTK and SpaCy for text analysis and information extraction. Optimized the performance of the mappings by various tests on sources, targets and transformations. Scheduling the sessions to extract, transform and load data in to warehouse database on Business requirements using scheduling tool. Created independent data marts from existing data warehouse as per the application requirement and updated them on bi-weekly basis Used Presto distributed query engine over hive tables for its high performance and low cost Automated and validated data pipelines using Apache Airflow. Environment: ETL, SQL, MySQL, PostgreSQL, MongoDB, Apache Hadoop, Spark, Git, Matplotlib, Apache Kafka, AWS, Azure, Google Cloud, DevOps, Kubernetes, CI/CD, RESTful API, APIs, SCRUM, AML, JSON, Jenkins, Marathon, Chronos, GitHub. Python/Data Engineer Blue Cross Blue Shield, Rhode Island Jun 2019 Aug 2021 Responsibilities Constructed GDPR and Data Quality frameworks, promoting data privacy and reliability Developed an advanced Data Lineage/Data Catalog framework, providing strategic insights into data dependencies. Proficient in designing, building, and maintaining scalable data pipelines using tools such as Apache Spark, Apache Flink, and Apache Beam. Extensive experience with cloud-based data technologies including AWS (Amazon Web Services), Google Cloud Platform (GCP), and Azure. Skilled in data warehousing solutions like Amazon Redshift, Google BigQuery, and Snowflake for efficient data storage and retrieval. Skilled in utilizing various Hadoop ecosystem components such as HDFS, MapReduce, YARN, Hive, Pig, HBase, Spark, Kafka, and Oozie to process and analyze massive datasets efficiently. Experienced in optimizing Hadoop cluster performance by tuning configurations, implementing compression techniques, and employing data partitioning strategies. Demonstrated expertise in developing and deploying scalable ETL (Extract, Transform, Load) processes using tools like Apache NiFi, Talend, or Informatica to ingest, cleanse, and transform diverse data sources into Hadoop. Proficient in programming languages such as Java, Scala, Python, or R for developing custom MapReduce applications, Spark jobs, and data analysis scripts. Familiarity with NoSQL databases like Cassandra, MongoDB, or Couchbase for storing and retrieving semi-structured and unstructured data within the Hadoop ecosystem. Expertise in streaming data processing frameworks such as Kafka, Apache Storm, and AWS Kinesis for real-time analytics and event-driven architectures. Strong knowledge of distributed computing principles and parallel processing techniques to optimize data processing workflows. Hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes for deploying and managing data applications at scale. Proficient in SQL and NoSQL databases like PostgreSQL, MySQL, MongoDB, and Cassandra for managing structured and unstructured data. Familiarity with data governance, security, and compliance standards ensuring data integrity and regulatory adherence. Proven track record of collaborating with cross-functional teams including data scientists, analysts, and software engineers to deliver end-to-end data solutions. Excellent problem-solving skills and the ability to troubleshoot complex data issues in production environments. Strong communication skills with experience in presenting technical concepts to both technical and non-technical stakeholders. Pioneered a Redshift performance and Data Usage tracking tool, optimizing data management and enhancing query performance. Created a proprietary Python SQL profiling library, offering superior speed and extended dialect support, outperforming sqlparse and sql-metadata. Implemented a custom Dependency Graph visualization based on matplotlib, fostering greater data transparency and comprehension. Conducted a thorough evaluation of Data Governance tools and orchestrated the design of an Enterprise Warehouse Data Model (Medallion Architecture). Managed the automation and development of ELT and ETL pipelines, ensuring streamlined and reliable data integration processes. Environment: ETL, SQL, MySQL, PostgreSQL, MongoDB, Apache Hadoop, Spark, Git, Matplotlib, Apache Kafka, AWS, Azure, Google Cloud, DevOps, Kubernetes, CI/CD, RESTful API, APIs, SCRUM, AML, JSON, Jenkins, Marathon, Chronos, GitHub. Python/Data Engineer Mercer, TX Mar 2016 May 2019 Responsibilities Developed entire frontend and backend modules using Python on Django Web Framework and utilized Django for application development. Designed and developed the UI of websites using HTML, AJAX, CSS, and JavaScript, and used CSS Bootstrap for web application development. Involved in identifying and resolving issues through effective troubleshooting techniques. Implemented update strategies to effectively migrate data from source to target and managed mappings from development to test environments. Designed ETL processes using Informatica to load data from Flat Files and Excel Files to target Oracle Data Warehouse database. Interacted with bussiness community and database administrators to identify business requirements and data realities. Created various transformations in Informatica mappings such as Source Qualifier, Normalizer, Lookup, Stored Procedure, Sequence Generator, Router, Filter, Aggregator, Joiner, Expression, and Update Strategy. Improved workflow performance by optimizing filter placement and table selection during joins, utilized connected and unconnected lookups, and managed tasks and workflows in the Workflow Manager. Performed maintenance tasks including managing space, removing bad files, cache files, and monitoring services, and set up permissions for groups and users in development environments. Designed and developed Web services using XML and jQuery and improved performance using modularized approaches and built-in methods. Experienced in Agile Methodologies and SCRUM Process, maintained program libraries, user manuals, and technical documentation. Wrote unit test cases for testing tools, involved in the entire lifecycle of projects including Design, Development, Deployment, Testing, Implementation, and support. Built various graphs for business decision-making using Python matplotlib library, worked in development of applications in UNIX environment, and used NumPy for Numerical analysis for Insurance premium. Handled day-to-day issues and fine-tuned applications for enhanced performance, and implemented code in Python to retrieve and manipulate data. Environment: Python 2.7, Robot framework (3.1), ZeroMQ, twisted, Linux, Git (v2.x), Stash, Jira (7.x), Confluence (6.x), Ubuntu, PyCharm (4.0), PyLint, PEP-8(1.3), Django 1.6, JavaScript (v5.1), Angular 2, Angular 4, Angular6, Angular CLI, MySQL, TypeScript, Azure, Snowflake, NetApp Servers. Python/Data Engineer MetaMinds, India Jul 2013 Aug 2015 Responsibilities Wrote Kafka producers to stream the data from external rest APIs to Kafka topics. Proficient in Python programming language, with expertise in developing scalable and efficient applications. Experienced in utilizing Python frameworks such as Django and Flask to build web applications from scratch, ensuring high performance and security. Skilled in object-oriented programming (OOP) principles, leveraging them to design robust and maintainable codebases. Profound understanding of data structures and algorithms, implementing them to optimize application performance and solve complex problems. Strong knowledge of relational databases, including MySQL and PostgreSQL, proficient in designing schemas and writing efficient queries. Experienced in utilizing NoSQL databases like MongoDB for storing and managing unstructured data efficiently. Familiarity with front-end technologies such as HTML, CSS, and JavaScript, enabling seamless integration with backend services. Proficient in version control systems such as Git, adept at collaborating with teams and managing code repositories effectively. Experienced in utilizing continuous integration and continuous deployment (CI/CD) pipelines to automate software testing and deployment processes, ensuring rapid delivery of high-quality software. Skilled in using Docker for containerization, enabling the development and deployment of applications in isolated environments. Knowledgeable in cloud platforms such as AWS and Azure, experienced in deploying and managing applications on cloud infrastructure. Strong troubleshooting and debugging skills, proficient in identifying and resolving issues to ensure the smooth functioning of applications. Proven ability to work in agile development environments, collaborating with cross-functional teams to deliver software solutions that meet business requirements. Knowledgeable in containerization technologies like Docker, facilitating easy deployment and management of applications. Experienced in test-driven development (TDD) and writing unit tests using frameworks like PyTest to ensure code reliability and maintainability. Skilled in working with RESTful APIs and integrating third-party services to enhance application functionality. Familiarity with CI/CD (Continuous Integration/Continuous Deployment) pipelines for automating the software delivery process. Excellent problem-solving skills and the ability to troubleshoot issues in a timely manner. Strong communication skills, with the ability to collaborate effectively with cross-functional teams and stakeholders. Environment: Python2.7, Django1.7, Luigi, windows, Linux, MySQL, SQL, Cassandra, AWS RDS, AWS S3, AWS EC2, Kafka, JSON, React, Restful API, MVC architecture, GitLab, Agile, Enterprise Scheduler, Bitvise SSH Client, Scrum, JIRA, GIT. Keywords: continuous integration continuous deployment artificial intelligence user interface javascript business intelligence sthree database rlang microsoft Illinois Texas |