SPIN Unprocessed July 3, 2026 ai_technology community
H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch [P]
View original on reddit.comSummary
Hi everyone, I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch. Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop. Features 249M-parameter Transformer Grouped Query Attention (GQA) Sparse Mixture-of-Experts (8 experts, Top-2 routing) with 3 auxiliary routing losses SwiGLU, RoPE, RMSNorm Sliding-window attention Mixed-precision trai
SpinGraph analysis pending — check back after processing.
Ask AI about this story
See how AI engines summarize this narrative — one click, prompt included.
More from Reddit r/MachineLearning
View all →- What does "Safe AI" look like? [D]
- Small Language Model SLM [D]
- Tom Yeh's AI by hand? is it worth it? [D]
- I built my 'first' flow matching image generator, here's what I learned [P]
- Contrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]
- A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]
Markdown (.md) · JSON-LD schema (.json) · Machine-readable for AI & GEO