Hooman Ramezani

I'm Hooman, a machine learning engineer passionate about applications of AI. Currently I work as an ML Solutions Architect at Nebius, primarily focused on working with clients to ensure best performance when migrating large-scale ML workloads onto next-generation GPUs. This includes profiling and optimizing training runs, deploying production inference endpoints, and working hands-on with the latest NVIDIA Blackwell hardware.

I have completed my BASc in Systems Design Engineering at the University of Waterloo, and my MASc at the University of Toronto. My thesis focused on applying transformers to clinical workloads, training LLM and ViT models on multimodal data for lung cancer treatment.

My expertise spans GPU performance optimization across large-scale ML workloads, including CUDA-level profiling and RDMA/InfiniBand networking as well as setting up distributed training with TorchTitan. Recently I’ve spent a lot of time on inference, tuning engines like vLLM, SGLang, and TRT-LLM for best performance on specific client workloads.

In my free time you can find me playing guitar. See more here.

Email  /  CV  /  LinkedIn  /  Github

profile photo

Research and Projects

I am interested in large-scale ML systems, GPU performance optimization, and Health AI. My research experience spans my current work at Nebius on training and inference optimization, previous work with the UW VIP Lab, and my Masters at UofT.

Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan
Hooman Ramezani, PyTorch and Nebius Teams, 2026
Blog / TorchTitan / TorchAO / DeepEP

Achieved 41% faster pre-training of DeepSeek-V3 671B on 256 NVIDIA B200 GPUs by combining MXFP8 (Microscaling FP8) quantized training via TorchAO with DeepEP expert-parallel communication optimization. MXFP8 leverages Blackwell's native tensor core support for up to 2x peak TFLOPS over BF16, while DeepEP replaces standard all-to-all collectives with purpose-built NVLink/RDMA kernels for MoE workloads. Throughput improved from 651 to 918 tokens/sec with no loss convergence degradation.

LN-Transformer: Lung Nodule Transformer for Sparse CT Segmentation
Hooman Ramezani, Dionne Aleman, Daniel Letourneau, arXiv, 2025
Paper

Published a novel two-stage transformer for lung nodules segmentation featured in CVPRW, Strongest model on benchmark dataset Dice 91.4%, F1 94.2%, includes Meta SAM, DETR architectures.

Lung-DETR: Deformable Detection Transformer for Sparse Lung Nodule Anomaly Detection
Hooman Ramezani, Dionne Aleman, Daniel Letourneau, arXiv, 2024
arXiv

A novel architecture based to detect lung tumor, specifically designed to mitigate extreme class imbalance and find tumors among vastly health tissue.

Enhancing DL Interpretability: IBA for Transformer Attribution
Hooman Ramezani, University of Toronto , 2024  
Paper / Presentation

Information Bottleneck Attribution (IBA) leverages principles from information theory to identify critical information in neural networks for decision-making attribution. In this work IBA is successfully applied to CNN and Transformer models, enabling a detailed analysis of model decision-making.

Parkinsons Freezing of Gait Detection
Hooman Ramezani, Medical Time Series Deep Learning, 2023
Paper / Github

A deep learning network for time-series analysis designed to identify gait freezing in patients with Parkinson's disease, utilizing biometric signals for the prevention of falls.

Rat-Brain-Inspired Reinforcement Learning for Optimal Pathfinding in Mazes
Hooman Ramezani, Computational Neuroscience, 2023
Paper / Github

A deep reinforcement learning model inspired by the basal ganglia of mouse brains, designed to master maze navigation using Q-learning. It showcases the intricacies of decision-making and learning as the model identifies optimal paths through mazes.

Grasp-Proposition-Net: Robotic Vision For Grasping Everyday Objects
Hooman Ramezani, UW VIP Lab , 2022  
Github

Developed a 3D computer vision model with VIP-Lab and Festo for a robotic arm, designed to determine optimal grasp points using LiDAR camera data.

Drone-Aided Surface Defect Detection
Hooman Ramezani, Vison Model with Temporal Context, 2021
Github

A highly accurate embedded model for classifying surface defects via drones, utilizing a convolutional-RNN architecture and synthetic data generation. Model is optimized for on-device execution in real-world applications.