H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]

Summary

Hi everyone, I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch. Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop. Features 249M-parameter Transformer Grouped Query Attention (GQA) Sparse Mixture-of-Experts (8 experts, Top-2 routing) with 3 auxiliary routing losses SwiGLU, RoPE, RMSNorm Sliding-window attention Mixed-precision trai

SpinGraph analysis pending — check back after processing.

Ask AI about this story

See how AI engines summarize this narrative — one click, prompt included.

ChatGPT Claude Perplexity Gemini Grok

More from Reddit r/MachineLearning

View all →

Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO