Avatar
Jon
b7e6dad1f2f4b4e4d7949131feeeeb07aad87312a2a9b2fa7255a922cdd5d282
Pleb spreading the word. Humbly stacking.
Replying to Avatar Jameson Lopp

I'm pretty good at JavaScript, been working with it for years. I've limited it down to only using functions (no classes), constants (no let's no vars). Building a frontend or some services works well at the end of the day.

Worst thing that I hate is setting up a project and all the tools around it.

Out of my ignorance, cause I dont know so good any other language. Why are people hating js?

Llama 4 is out. ๐Ÿ‘€

Llama 4 Maverick (402B) and Scout (109B) - natively multimodal, multilingual and scaled to 10 MILLION context! BEATS DeepSeek v3๐Ÿ”ฅ

Llama 4 Maverick:

> 17B active parameters, 128 experts, 400B total parameters > Beats GPT-4o & Gemini 2.0 Flash, competitive with DeepSeek v3 at half the active parameters > 1417 ELO on LMArena (chat performance). > Optimized for image understanding, reasoning, and multilingual tasks

Llama 4 Scout:

> 17B active parameters, 16 experts, 109B total parameters

> Best-in-class multimodal model for its size, fits on a single H100 GPU (with Int4 quantization)

> 10M token context window

> Outperforms Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1 on benchmarks

Architecture & Innovations

> Mixture-of-Experts (MoE):

First natively multimodal Llama models with MoE

> Llama 4 Maverick: 128 experts, shared expert + routed experts for better efficiency.

Native Multimodality & Early Fusion:

> Jointly pre-trained on text, images, video (30T+ tokens, 2x Llama 3)

> MetaCLIP-based vision encoder, optimized for LLM integration

> Supports multi-image inputs (up to 8 tested, 48 pre-trained)

Long Context & iRoPE Architecture:

> 10M token support (Llama 4 Scout)

> Interleaved attention layers (no positional embeddings)

> Temperature-scaled attention for better length generalization

Training Efficiency:

> FP8 precision (390 TFLOPs/GPU on 32K GPUs for Behemoth)

> MetaP technique: Auto-tuning hyperparameters (learning rates, initialization)

Revamped Pipeline:

> Lightweight Supervised Fine-Tuning (SFT) โ†’ Online RL โ†’ Lightweight DPO

> Hard-prompt filtering (50%+ easy data removed) for better reasoning/coding

> Continuous Online RL: Adaptive filtering for medium/hard prompts