Hooman Ramezani

I'm Hooman, a machine learning engineer passionate about applications of AI. Currently I work as an ML Solutions Architect at Nebius, primarily focused on working with clients to ensure best performance when migrating large-scale ML workloads onto next-generation GPUs. This includes profiling and optimizing training runs, deploying production inference endpoints, and working hands-on with the latest NVIDIA Blackwell hardware.

I have completed my BASc in Systems Design Engineering at the University of Waterloo, and my MASc at the University of Toronto. My thesis focused on applying transformers to clinical workloads, training LLM and ViT models on multimodal data for lung cancer treatment.

My expertise spans GPU performance optimization across large-scale ML workloads, including CUDA-level profiling and RDMA/InfiniBand networking as well as setting up distributed training with TorchTitan. Recently I’ve spent a lot of time on inference, tuning engines like vLLM, SGLang, and TRT-LLM for best performance on specific client workloads.

In my free time you can find me playing guitar. See more here.

Email / CV / LinkedIn / Github

Research and Projects

I am interested in large-scale ML systems, GPU performance optimization, and Health AI. My research experience spans my current work at Nebius on training and inference optimization, previous work with the UW VIP Lab, and my Masters at UofT.

	Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan Hooman Ramezani, PyTorch and Nebius Teams, 2026 Blog / TorchTitan / TorchAO / DeepEP Achieved 41% faster pre-training of DeepSeek-V3 671B on 256 NVIDIA B200 GPUs by combining MXFP8 (Microscaling FP8) quantized training via TorchAO with DeepEP expert-parallel communication optimization. MXFP8 leverages Blackwell's native tensor core support for up to 2x peak TFLOPS over BF16, while DeepEP replaces standard all-to-all collectives with purpose-built NVLink/RDMA kernels for MoE workloads. Throughput improved from 651 to 918 tokens/sec with no loss convergence degradation.
	LN-Transformer: Lung Nodule Transformer for Sparse CT Segmentation Hooman Ramezani, Charlotte Vedrines, Dionne Aleman, Daniel Létourneau, CVPR, 2025 Paper / CVF Published at CVPR 2025, a novel two-stage transformer for lung nodule segmentation. Strongest model on benchmark dataset with Dice 91.4%, F1 94.2%, combining Meta SAM and DETR architectures.
	Lung-DETR: Deformable Detection Transformer for Sparse Lung Nodule Anomaly Detection Hooman Ramezani, Dionne Aleman, Daniel Letourneau, arXiv, 2024 arXiv A novel architecture based to detect lung tumor, specifically designed to mitigate extreme class imbalance and find tumors among vastly health tissue.
	Enhancing DL Interpretability: IBA for Transformer Attribution Hooman Ramezani, University of Toronto , 2024 Paper / Presentation Information Bottleneck Attribution (IBA) leverages principles from information theory to identify critical information in neural networks for decision-making attribution. In this work IBA is successfully applied to CNN and Transformer models, enabling a detailed analysis of model decision-making.
	Parkinsons Freezing of Gait Detection Hooman Ramezani, Medical Time Series Deep Learning, 2023 Paper / Github A deep learning network for time-series analysis designed to identify gait freezing in patients with Parkinson's disease, utilizing biometric signals for the prevention of falls.
	Rat-Brain-Inspired Reinforcement Learning for Optimal Pathfinding in Mazes Hooman Ramezani, Computational Neuroscience, 2023 Paper / Github A deep reinforcement learning model inspired by the basal ganglia of mouse brains, designed to master maze navigation using Q-learning. It showcases the intricacies of decision-making and learning as the model identifies optimal paths through mazes.
	Grasp-Proposition-Net: Robotic Vision For Grasping Everyday Objects Hooman Ramezani, UW VIP Lab , 2022 Github Developed a 3D computer vision model with VIP-Lab and Festo for a robotic arm, designed to determine optimal grasp points using LiDAR camera data.
	Drone-Aided Surface Defect Detection Hooman Ramezani, Vison Model with Temporal Context, 2021 Github A highly accurate embedded model for classifying surface defects via drones, utilizing a convolutional-RNN architecture and synthetic data generation. Model is optimized for on-device execution in real-world applications.