Praveen - Hadoop /Big data Lead Developer |
[email protected] |
Location: Atlanta, Georgia, USA |
Relocation: Yes |
Visa: H1B |
Praveen Ganghishetti
Lead Bigdata Developer +1 678) 707-2060 |[email protected] | Atlanta, GA Summary: 16+ years of IT experience in Analysis, Design, Development, and implementation in the areas of Bigdata and Data warehousing technologies using Big Data Hadoop, Spark,Python,Java,Informatica, Oracle SQL,PL/SQL, and Teradata. Good work exposure on Cloudera and Hortonworks distribution platforms. Solid hands-on development in Hadoop technologies including Spark using Python and Java, Hive,Impala, Sqoop, andAWS EMR. Extensive experience working with Spark distributed Framework involving Resilient Distributed Data sets (RDD) and Data Frames (Spark SQL) using Python, Java8, Scala. Good Experience in using Pandas and Numpy libraries in Python. Used Java8 to Create Spark RDD and Spark Datasets Used Spark SQL to read and write with external RDBMS like MySQL and Oracle. Good Exposure on Big data Spark Development using Scala Stored the data in compressed formats such as ORC, Parquet, Avro in HDFS using Spark. Pushed the documents using into Elastic Search using Pyspark. Good Knowledge ofKafka Producer, Broker, and Consumer API s. Experienced loading data to Hive partitions and creating buckets in Hive. Load log data into HDFS using Kafka. Worked on importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa. Experienced in working with monitoring tools to check status of the cluster using Cloudera manager. Worked on both batch and streaming data processing using Spark Extensively used the basic constructs of python such as map, reduce, Filter and lambda functions Good Knowledge of Legacy Spark Streaming(D-Streams) and Spark Structured Streaming. Implemented Spark using Amazon EMR for processing Big Data in AWS ecosystem. Good Experience working with Google Cloud Platform(GCP) Big Query,Google DataProc, Databricks on GCP, Google Cloud Composer Created Data Ingestion Pipelines using AWS Lambda functions. Worked on Scheduling Lambda Functions using AWS Event Bridge. Created Crawlers on S3 data to form Glue Catalog tables Created and Ran the Glue jobs for the data transformation from one Schema to another. Used Python Boto3 to manage Glue Catalogs, S3 Buckets. Good Working knowledge on Global Secondary Index, Local Secondary Index, Streams in AWS DynamoDB. Built transactional Layer on top of AWS S3 by using Apache Hudi in the AWS Lake house. Strong knowledge in OLAP Systems, Kimball and Inmon methodology models, Dimensional data modelling using Star schema and Snowflake schema. Experience in Extraction Transformation and Loading (ETL) data from various data sources into Data Marts and Data Warehouse using Informatica PowerCenter components (Repository Manager, Designer, Workflow Manager and Workflow Monitor). Extensively worked on Teradata, Primary, Secondary, Partition Primary and Join Indexes. In-depth expertise in the Teradata cost-based query optimizer, identified potential bottlenecks with queries from the aspects of query writing, skewed redistributions, join strategies, optimizer statistics, physical design considerations (UPI, NUPI, USI, NUSI, PPI and JI). Education: Master of Technology in Information Technology Hyderabad Central University (HCU), India.2011 Bachelor of Technology in Computer Science, JNTU, India.2007 Technical skills: Languages: Python, Java, SQL, PL/SQL, T-SQL, UnixShellScripting, Java, Hadoop, MapReduce, Hive, Sqoop, Spark Core and SparkSQL using Python,Java8,Scala, AmazonEMR, SparkStreaming, Kibana, SOLR, Apache Elastic RDBMS Oracle 8i/9i/10g, Teradata V2R5/V2R6/12, MS SQL server 2008, DB2, MySQL Tools: Apache Elasticsearch, ApacheSOLR, LireSOLR, SVN, Rally, DBVisualizer, Teradata SQL Assistant, TOAD, SQL Plus, SQL Developer, MS Visio, JIRA, HP Quality Center, Citrix, Mercury Quality Center, Erwin SCM Tools: Harvest, Kintana, Informatica Version Control, PVCS Data Warehousing: Informatica Power Center 9.6(Designer, Repository Manager, workflowmanager, Workflow monitor), Informatica DT Studio Distributions: HortonworksDataPlatform, Cloudera Professional Experience: Bank of America, Atlanta, GA July 2024 Till date Hadoop Developer Responsibilities: Work with the Business Analysts to understand the requirements work on high level detailed design to address the real time issues in production. Work with Information Architecture team for proposing technical solutions to business problems Identify gaps in technology and propose viable solutions Take accountability of the technical deliveries from offshore Understand Hadoop, Spark, Python and other Eco Systems like Impala, Hive, Oozie, Pig etc., Autosys and UNIX Shell Scripting. Work with the development teams and QA during the post code development phase Identify improvement areas within the application and work with the respective teams to implement the same Ensuring adherence to defined process quality standards, best practices, high quality levels in all deliverables Adhere to team s governing principles and policies Strong working knowledge of ETL, database technologies, big data and data processing skills BCBSA, Atlanta, GA Jun 2022 July 2024 Lead Bigdata Developer Responsibilities: Designed scalable and resilient data architectures on AWS to handle large volumes of data efficiently. Ensured that data architectures are scalable and highly available to handle growing data volumes and meet SLAs (Service Level Agreements) for availability and performance. Planned and implemented data ingestion pipelines to collect data from various sources into AWS services such as Amazon S3, Amazon Kinesis, or AWS Glue. Create custom analytic and data mining algorithms to help extract knowledge and meaning from vast stores of data. Designed and implemented data processing solutions using AWS services such as Amazon EMR (Elastic MapReduce), AWS Glue, or AWS Lambda to transform and analyze data at scale Integrating data from different sources and formats, including structured and unstructured data, and ensuring data consistency and integrity. Optimizing data processing and analysis workflows for performance and cost-efficiency, including tuning AWS services configurations, and selecting appropriate instance types. Documenting data architectures, processes, and best practices and sharing knowledge with team members through documentation, presentations, and training sessions to foster a culture of learning and knowledge sharing within the organization. Integrating data from various sources, including databases, data warehouses, streaming sources, and external APIs, to provide a unified view of data for analysis and processing. Built transactional Layer on top of AWS S3 by using Apache Hudi in the AWS Lake house. Used Redshift Spectrum to enable a unified SQL interface that can accept and process SQL statements where the same query can reference and combine datasets hosted in the data lake as well as data warehouse storage. In the AWS LakeHouse Architecture, the data warehouse and data lake are natively integrated at the storage as well as common catalog layers to present unified a Lake House interface to processing and consumption layers. Selecting and implementing appropriate data storage solutions, including data lakes, data warehouses, NoSQL databases, and distributed file systems, based on the specific requirements of the project. Built ELT Data Pipelines and data warehouses using DataProc on GCP and Google BigQuery Created AWS Lambda Functions as part of AWS Dynamo DB Streamsto capture events as part of building Real time Dashboards. Optimizing data processing and storage systems for performance, scalability, and cost-effectiveness, including tuning parameters, optimizing queries, and selecting appropriate hardware or cloud resources. Environment: Spark 2.3, Python with Spark, with Spark, Amazon S3, AWS Glue, AWS Athena, Boto3, MySQL, Oracle 10g, Informatica Power Center, Elasticsearch, ApacheSOLR, lire SOLR, DB Visualizer, SharePoint 2010, Rally CSRA, Falls Church, VA Jan 2019 Jun 2022 Lead Bigdata Developer Designed, Developed and Implemented Big Data analytic solutions on a Hadoop based platform. Refine a data processing pipeline focused on unstructured and semi-structured data refinement Create custom analytic and data mining algorithms to help extract knowledge and meaning from vast stores of data. Supported quick turn and rapid implementations and larger scale and longer duration analytic capability implementations. Created Spark SQL and Spark RDD s using Java8 as part of Horton Works HDP 3.1 migration. Configure the Data flows from different sources (relational databases, XML, JSON) and orchestrate them using Nifi. Developed Spark Frameworks using Pyspark and Java to build Raw/Analytical Layers in Bigdata. Developed utilities in Python and Core java. Write the Data Extraction, Processing and Transformation scripts using HIVE and SPARK wherever is needed. Used Jenkinsfor Continuous Integration and Git for Version Control. Wrote shell scripts and job management scripts to invoke and manage the Data Ingestion steps. Designed HIVE tables for better performance as the data volume would be very high. Apply partitions wherever is needed. Designed and developed the SPARK programs that processes high volumes of data with higher processing speeds. Work on AWS Services like S3, EMR, Lambda,Glue Jobs, Athena as part of Open data initiative. Created RedShift Tables with Various Distribution Styles such as ALL, AUTO, KEY, EVEN. Created Redshift External Schema to Postgres Database Grant Access on Glue Data Catalog to Redshift Cluster. Environment: Java8, Spark 2.3, Python with Spark, with Spark, Hadoop, Hive, HDP 2.6, Amazon EMR, Amazon S3, AWS Lamba, AWS Glue, AWS Athena, Boto3, MySQL, Oracle 10g, ElasticSearch, ApacheSOLR, lire SOLR, DB Visualizer, SharePoint 2010, Rally, SVN, Maven, GIT,Jenkins. Wellmark, DeSimone s, IA Nov 2018 Jan 2019 Data Lead Involved in the full life cycle of the project from Design, Analysis, logical and physical architecture modeling, development, Implementation, testing. Responsible for design and creation of Hive tables, partitioning, bucketing, loading data and writing hive queries. Implemented, migrated existing Hive Script in Spark SQL for better performance. Developed Spark streaming jobs in JAVA 8 to receive real time data from Kafka, process and store the data to HDFS. Acted as a single POC between business and technical teams for data and reports. Involved in Designing a system that would process 1B records a day and store 7 years of such data that would involve Hive, Sqoop, Spark, Kafka, Java UI, Presto, Oracle. This system involved massive size and number of Tar files and binary files. Gather and create business requirement document and involve in framing the logic for extraction,transformation, and loading (ETL) processes. Created technicaldesign and documented system and technical process changes per Businessrequirements. Create, design, develop and update system code changes using Informatica Power Center (9.6,10.2) or SQL Stored Procedures and provide technical expertise on solutions. Developed code, unit test cases and perform unit testing of jobs in development, support jobs andresolve any defects during UAT (User Acceptance Testing) Promoted code and provided warranty in the higher pre-production and production environments. Prepare Korn shell (Unix/Linux) scripts and integrate with the scheduler for automating theInformatica jobs. Ensures the integration of software packages, programs, and reusable solutions on multipleplatforms. Coordinate back-out plans for test and production environments. Performance tuning of the Teradata queries and help with data modelling in the EDW (EnterpriseData Warehouse) stream. Follow all steps of system life cycle and project SDLC phases in all technical work. Determine the best logical and physical architecture for the project and maintain architecturalintegrity of software. Identify Root cause analysis and resolve testing defects. Assist with finding new test data andcreating extracts for test data as needed. Environment:Hadoop, Sqoop, Hive, Cloudera Distribution platform, Spark, Python, Informatica PowerCenter 9.6, Informatica Power Exchange, Mainframes, DB2, Oracle 10g, SQL Developer, SharePoint 2010, Autosys, Jira Transamerica, Cedar Rapids, IA Feb 2018 Nov 2018 Bigdata Hadoop Technical Lead Used Spark to land data in Hadoop systems using Python and Scala. Developed UDFs, UADFs to prepare the data that will be fed to JAVA MapReduce programs. Developed Java code that APIs will use and execute HIVE, PIG scripts as a part of Java Code. Worked on the Continues Integration of Big Data build process. Moved data to Hadoop to streamline Transamerica s business by automating many processes which were manual. Pulling data from mainframe policy administration systems and landing the data in Hadoop Worked on importing and exporting data, into & out of HDFS and Hive using Sqoop Worked on creating Hive tables and wrote Hive queries for data analysis to meet business requirements. Extensively used Spark stack to develop preprocessing job which includes RDD, Datasets and Data frames APIs to transform the data for upstream consumption. Analyzing the requirement specifications provided by the client and translating them to Technical Impacts in the system. Created High Level design, Detailed Level design, design specifications, test plan and test scripts. Involve in code development and code review during development and Integration testing. Environment:Hadoop, Sqoop, Hive, Cloudera Distribution platform, Spark, Python, Informatica PowerCenter 9.6, Informatica Power Exchange,Mainframes, DB2, Oracle 10g, SQL Developer, SharePoint 2010, Autosys, Jira American Family Insurance-Madison, WI Jul 2014 Feb 2018 Big Data/Informatica Technical Lead Utilized in-depth knowledge of functional and technical experience in Data Warehousing and other leading-edge products and technology in conjunction with industry and business skills to deliver solutions to customers. Wrote Sqoop scripts to import data into Hive/HDFS from RDBMS. Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns. Used Spark, Python Pandas, Numpy modules for Data analysis, Data scraping and parsing Implemented Concurrent execution of workflows and Session Partitioning techniques as part ofPerformance Tuning. Worked on Informatica DT Studio to parse input XML and Json files Created Informatica Web service provider using XML Generator, XML Parser and SQL Transform Handled Pushdown Optimization technique in order to tune the mappings and sessions when working on bulk loads or huge volumes of data. Used Debugger in the mappings, identified bugs in existing mappings by analyzing the data flow and evaluating transformations. Cleaned up mappings to not create lengthy log files by turning off verbose logging and getting rid of warning messages. Designed and developed stored procedures using PL/SQL and tuned SQL queries for better performance. Implemented slowly changing dimensions (SCD Type 1 & 2) in various mappings. Created and used reusable Transformations using Informatica PowerCenter. Worked on AutoSys to automate Execution of Informatica jobs Responsible for writing/documenting the Unit Test Cases with different testing scenarios to meet business rules implemented in ETL mappings. Environment:Hadoop, HDFS, Hortonworks distribution platform PIG, Hive, Python, Spark, Spark SQL, Informatica PowerCenter 9.1, Informatica DT Studio, Oracle 10g, Greenplum, SQL Developer, Winsql, SharePoint 2010, Autosys, Jira, Harvest, XML Spy Editor Manulife Insurance, Boston, MA Dec 2013-July 2014 Sr. Informatica Developer Worked closely with project managers, business analysts, DBAs to achieve business and functional requirements. Worked in Software Development Life Cycle (SDLC) like in Agile Scrum methodologies. Used Informatica Power Center 9.1 for extraction, loading and transformation (ETL) of data in the data warehouse. Built efficient ETL Informatica packages for processing fact and dimension tables with complex transformations using type 1 and type 2 changes. Designed and developed complex mappings from varied transformation logic like Unconnected and Connected lookups, Router, Filter, Expression, Aggregator, Joiner, Update Strategy and more. Worked on Informatica Power Center tool - Source Analyzer, Target Designer, Mapping Designer and Transformation Developer. Assisted in the design and Maintenance of the Metadata environment. Created Workflows and Sessions to load data from the SQL Server, Oracle, flat file, and XML file sources that exist on servers located at various locations. Responsible for creating business solutions for Incremental and full loads. Created different parameter files and changed Session parameters, mapping parameters, and variables at run time. Created high-level design documents for extracting data from complex relational database tables, data conversions, transformation and loading into specific formats Architected and Designed the ETL solution that included designing Mappings and workflows, deciding load strategies, implementing appropriate error handling and error notification processes, scheduling and designing re-usable ETL pieces through parameterization. Developed Mapping using parameters, Session parameters, Mapping variable/parameters and created Parameter files and runs of workflows based on changing variable values. Created Unix Shell Scripts to automate pre-session and post-session processes. Involved in creating new table structures and modifying existing tables discussing with Data Modeler. Updated the tables as per the requirements by generating queries using SQL to check the data consistency in the table. Created shortcuts for reusable transformation, source/target definitions, Mapplets in Shared folder. Performed unit testing, Integration testing, Performance and Functional testing of the mappings. Involved in High level and low-level design, analyzing Source to Target Mapping (STM's), Test Cases and Code migration report. Environment:InformaticaPowerCenter 9.6, MS SQL server 2008, SQL Developer, Toad, T-SQL, Windows Server Volkswagen Group of America, Detroit, MI Jul 2013-Dec 2013 Sr. Informatica Developer Identified all OMD/CRM Tables in PRD1 Shared with Other Systems Identifiedall CRM/OMD Tables that Provide Data to Other Systems Identifiedall Interfaces with CRM/OMD Data Identified all Tables Referenced by Interfaces Catalog all Dependencies on PRD1 CRM/OMD Data Formulate Oracle SQL-Queries on Informatica Metadata database repository in order to identify Informatica jobs which source through/from OMD/CRM systems to other systems. Involved in loading analysis tables to SDR-SAMBA s data repository. Involved in the design of SDR-SAMBA S data repository. Analyzed Views, Materializedviews, and Stored procedures. Analyzed DB audit and Trace to check the Table access, patterns, and frequencies. Involved in basic profiling on DB schema to Identify the table data usage and volume. Analyzed Informatica Jobs using Informatica Meta data to identify the tables involved. AnalyzedPerl and Unix scripts to identify the tables used. Created the Interface dependencies and hierarchies. Environment:Informatica PowerCenter, Toad, SQL developer, Oracle Metadata, Erwin, Visio, Windows Server Cisco Systems- SanJose, CAMay 2011-Jul 2013 Informatica Developer Responsible for designing and development, testing of processes necessary to extract data from operational databases, Transform and Load into data warehouse using Informatica Power center. Responsible for Modeling, Design, Development, and Integration Testing for BIDS Platform. Make sure that all the Dev and Stage Environments are sanitized within the given SLA. Develop the ETL s based on the BRD s and ETL specification documents provided by the client using Informatica. Created complex mappings in Power Center Designer using Expression, Filter, Sequence Generator, Update Strategy, Joiner, and Stored procedure transformations. Worked Extensively on Informatica tools like Designer, Workflow monitor and Workflow Manager. Worked on all the Transformations like Lookup, Aggregator, Expression, Router, Filter, Update Strategy, Stored procedure, Sequence Generator. Created connected and unconnected Lookup transformations to look up the data from the source to ETL target tables. Wrote SQL, PL/SQL, stored procedures for implementing business rules and transformations. Used the update strategy to effectively migrate data from source to target. Created test cases and completed unit, integration, and system tests for Data warehouse. Involved in debugging and validating the mappings and Code Review to rectify the issue. Developed, implemented, and enforced ETL best practices standards. Worked on all the Transformations like Lookup, Aggregator, Expression, Router, Filter, Update Strategy, Stored procedure, Sequence Generator. Created Scheduled Sessions and Batch Process based on demand, run on time, run only once using Informatica Server Manager. Re-designed multiple existing Power Center mappings to implement change request (CR) representing the updated business logic. Developed CDC and SCD Type 1,2,3 mappings in order to meet the business requirements. Created reusable transformations to increase the re-usability during the development life cycle. Responsible for creation of ETL technical Specification documents based on BRD. Environment:Informatica PowerCenter 9.1, Teradata V2R5, Oracle 10g, TOAD, SQL Developer HP Quality Center, Dollar Universe, Kintana, PVCS Abercrombie & Fitch, Columbus, OH Jun 2007-Mar 2010 Oracle SQL, PLSQL, Forms10g Developer Involved in developing technical documents from functional specs. Performed data quality analysis to validate the input data based on the cleansing rules. Extensively used PLSQL Collections like NESTED Tables and VARRAYS. Ensure Quality and On-time delivery. Actively participated in gathering Business Requirements and System Specification from system users. Analyze the current data management procedures in practice and suggest ways for automating the process or improving the existing system. Improved performance and tuning of SQL queries and fixed the slow running queries in production by using utilities like Explain plan and tk prof. Environment: Oracle 11g/10g, SQL, PL/SQL, SQL*Loader, Oracle Designer, Oracle Forms 9i, Mercury Quality Center. Keywords: quality analyst user interface sthree database information technology hewlett packard microsoft procedural language Georgia Iowa Massachusetts Michigan Ohio Virginia Wisconsin |