Manisha Yadav — Data Engineer

About Me

Turning data chaos into clarity

I'm a Data Engineer based in New York City with 5+ years of experience designing scalable data infrastructure across financial services and healthcare. I specialize in PySpark, SQL, and Python for large-scale ETL/ELT workflows, with deep hands-on experience across AWS and Azure.

My work spans building Medallion Architecture lakehouses, real-time streaming pipelines with Kafka and Kinesis, and ML-integrated data systems for fraud detection and risk analytics. I've worked across Citigroup, Fuse Machines, and Cedar Gate Technologies — each building on the last.

Currently seeking my next Data Engineering role in NYC where I can build reliable, high-impact data systems at scale.

0

Years of experience

0

Cloud platforms (AWS, Azure, GCP)

0

Industries: Fintech, Healthcare, AI

NYC

Based in New York City

Experience

Where I've worked

A track record of building high-impact data systems across fintech, AI, and healthcare.

Data Engineer

Citigroup · New York, NY

May 2024 — Present

Designed and maintained scalable ETL/ELT pipelines using PySpark, SQL, and AWS (S3, Glue, EMR, Lambda) to process high-volume financial transaction data.
Built batch and near-real-time workflows using AWS Glue, EMR, and Kinesis supporting fraud detection and risk analytics use cases.
Optimized PySpark jobs through partitioning, caching, and parallel processing, improving performance for multi-terabyte datasets.
Developed data lake and warehousing solutions on Amazon Redshift with star and snowflake schemas for regulatory reporting.
Collaborated with data scientists to build feature engineering pipelines for ML models used in customer risk scoring and fraud analytics.
Created dashboards in Power BI and Tableau for stakeholder KPI monitoring and operational reporting.

PySparkAWSKinesisRedshiftPower BI

Data Engineer

Fuse Machines · New York, NY

Sep 2022 — Dec 2023

Designed and implemented scalable Lakehouse architecture on Azure using ADLS Gen2 and Databricks, leveraging Medallion (Bronze/Silver/Gold) framework.
Developed ETL/ELT pipelines using Azure Data Factory and PySpark in Databricks for large-scale structured and semi-structured datasets.
Implemented Delta Lake capabilities including ACID transactions, schema evolution, and time travel for reliable pipelines.
Migrated legacy data warehouse systems to Snowflake, improving scalability and reducing maintenance overhead.
Implemented RBAC and Azure IAM policies to enforce secure, compliant data access.

AzureDatabricksDelta LakeSnowflakeADF

Data Engineer

Cedar Gate Technologies · Kathmandu, Nepal

Jan 2020 — Aug 2022

Designed ETL pipelines using Python and AWS (Glue, Lambda) to ingest and process healthcare datasets including claims, clinical, and provider data.
Built a centralized data warehouse on Amazon Redshift for payer-provider analytics, risk stratification, and value-based care insights.
Implemented data lake architecture on S3 with Athena and Redshift Spectrum for efficient querying of structured and semi-structured data.
Orchestrated data workflows using Apache Airflow with scheduling, monitoring, and error handling.
Supported CI/CD using Git, Jenkins, and Terraform for automated deployment of pipelines and infrastructure.

AWSPythonAirflowRedshiftTerraform

Technical Skills

Tools & technologies

The stack I reach for when building data systems — from ingestion to serving.

Languages & Query

PythonSQLScalaUnix/Shell Scripting

Python

95%

SQL

92%

Scala

70%

Cloud Platforms

AWS (S3, Glue, EMR, Lambda, Redshift, Kinesis)Azure (ADF, ADLS Gen2)GCP (BigQuery, Looker Studio)

AWS

90%

Azure

82%

GCP

65%

Big Data & Processing

PySparkApache SparkHadoopHiveHDFSDatabricksDelta Lakedbt

PySpark

93%

Databricks

85%

Kafka

80%

Streaming & Messaging

Apache KafkaAWS KinesisSpark Structured StreamingAzure Event Hubs

Kafka

80%

Kinesis

85%

Databases & Warehousing

Amazon RedshiftSnowflakeBigQueryOracleMySQLMongoDBAzure SQL

Redshift

88%

Snowflake

85%

BigQuery

72%

Orchestration & DevOps

Apache AirflowGitDockerJenkinsTerraformCI/CD

Airflow

88%

Docker

75%

Terraform

72%

Visualization & Analytics

Power BITableauLooker StudioPandasNumPyAlteryx

Power BI

85%

Tableau

82%

ML & Healthcare Standards

Scikit-learnTensorFlowPyTorchFHIRHL7X12 837HIPAA

Scikit-learn

75%

TensorFlow

65%

Projects

Things I've built

Personal and professional data engineering projects showcasing end-to-end pipeline design.

🏅

COVID-19 Data Pipeline — Medallion Architecture

End-to-end data pipeline implementing Bronze → Silver → Gold Medallion Architecture with dimension and fact tables for analytical reporting on COVID-19 trends.

PySparkPythonMedallionDelta LakeSQL

GitHub →

🏦

Financial Transaction Pipeline — Citigroup

Scalable ETL/ELT pipelines using PySpark and AWS to process high-volume financial transactions with ML inference for near real-time fraud detection and risk scoring.

PySparkAWS GlueRedshiftKinesisEMR

GitHub →

🏥

Healthcare Data Lake — Cedar Gate Technologies

HIPAA-compliant data lake on Amazon S3 for payer-provider analytics supporting risk stratification and value-based care with Athena and Redshift Spectrum querying.

AWS S3RedshiftAthenaAirflowPython

GitHub →

☁️

Azure Lakehouse — Fuse Machines

Enterprise Lakehouse on Azure using ADLS Gen2 and Databricks with Medallion framework. Migrated 10+ TB from legacy Oracle to Snowflake with ACID-compliant Delta Lake operations.

AzureDatabricksDelta LakeSnowflakeADF

GitHub →

⚡

Real-Time Streaming Analytics Platform

Event-driven streaming pipeline using Apache Kafka and AWS Kinesis for sub-second latency data processing with windowed aggregations for live fraud signal detection.

KafkaKinesisSpark StreamingAWSPython

GitHub →

Education

Academic background

🎓

Master's in Artificial Intelligence

The Katz School of Science and Health at Yeshiva University

Manhattan, NY

🎓

Bachelor of Science in Computer Science

Leeds Beckett University

Leeds, UK

Contact

Let's work together

I'm actively looking for my next Data Engineer role in NYC. If you're building a world-class data platform and need someone who cares about quality, reliability, and craft — let's talk.

@

manishayv07@gmail.com

in

linkedin.com/in/manisha-yadav

#

+1 (347) 580-8709

RG

ResearchGate