- multimodal, diffusion
- Scaling Diffusion Transformers Efficiently via μP
- establish μP as a principled and efficient scaling strategy for diffusion Transformers
- with appendix on Theoretical Background of μP
- Neurosymbolic Diffusion Models
- the first method to integrate masked diffusion models as the neural network extractor in neurosymbolic predictors
- with a very long appendix on math background
- Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
- propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks
- focus on Dream 7b: Introducing dream 7b, the most powerful open diffusion large language model to date
- consistently outperforms existing diffusion language models by a large margin
- matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities
- demonstrates strong planning ability and inference flexibility that naturally benefits from the diffusion modeling
- virtually all leading LLMs relying on this same sequential left-to-right architecture
- Discrete diffusion models (DMs) have gained attention as a promising alternative for sequence generation since their introduction to the text domain, which dynamically refine the full sequence in parallel starting from a fully noised state
- MMaDA: Multimodal Large Diffusion Language Models
- unified diffusion architecture
- superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation
- rich and impressive examples
- with appendix on Preliminaries of Discrete Diffusion, PPO and GRPO
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding
- Large Vision-Language Diffusion Model with Masking
- follows a similar design to common AR VLMs like LLaVa
- GRIT: Teaching MLLMs to Think with Images
- generate visually grounded reasoning chains by interleaving natural language with explicit bounding box coordinates referencing relevant image regions
- Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
- trained using a novel two-phase paradigm–Autoregressive-then-Diffusion
- dKV-Cache: The Cache for Diffusion Language Models
- diffusion language models have long been constrained by slow inference
- motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process
- propose a delayed and conditioned caching strategy for key and value states
- Understanding Generative AI Capabilities in Everyday Image Editing Tasks
- analyzing 83k requests with their associated 305k edits from the recent 12 years on the `/r/PhotoshopRequest` Reddit community
- new dataset: PSR
- Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
- lots of examples of game creation
- efficiency
- Scaling Law for Quantization-Aware Training
- a comprehensive scaling law for 4-bit QAT of LLMs, integrating model size, training dataset size, and quantization granularity
- previous methods do not account for quantization granularity G
- weight and activation quantization errors tend to contribute almost equally to the total error
- Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
- push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework
- perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training
- Zeroth-order optimization (ZO) methods are often used in cases where gradients and higher-order derivatives of the objective cannot be directly computed or are unreliable
- successfully fine-tune Stable Diffusion 3.5 Large quantized by BitsAndBytes on stylized images using a single Nvidia RTX 4090 24GB GPU
- A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
- trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher
- remarkable distillation efficiency, achieving superior performance with more than 1000× fewer training tokens
- LRC w/o FFN produces a substantial performance degradation that persists throughout training, further confirming the critical importance of FFN activations
- LRC’s projection-based alignment is not only sufficient for effective knowledge transfer but also more efficient and stable
- agents, reasoning, RL
- NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
- Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
- AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
- Training-Free Reasoning and Reflection in MLLMs
- Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
- RLVR-World: Training World Models with Reinforcement Learning
- SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
- Risk-Averse Reinforcement Learning with Itakura-Saito Loss
- safety
- Phare: A Safety Probe for Large Language Models
- Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
- Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
- application
- Steering Large Language Models for Machine Translation Personalization
- This Time is Different: An Observability Perspective on Time Series Foundation Models
- Prior Prompt Engineering for Reinforcement Fine-Tuning
- Using Large Language Models for Commit Message Generation: A Preliminary Study
- The Distracting Effect: Understanding Irrelevant Passages in RAG
- more
- Distilling LLM Agent into Small Models with Retrieval and Code Tools
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
- DiSA: Diffusion Step Annealing in Autoregressive Image Generation
- Capability-Based Scaling Laws for LLM Red-Teaming
- FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
- GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- watch The 3D Gaussian Splatting Adventure: Past, Present, Future
- DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
- GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
- Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation
- Large Language Models Often Know When They Are Being Evaluated
- Tiny-diffusion: A minimal implementation of probabilistic diffusion models
- AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
- Time Series Forecasting with Graph Transformers
- The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
- Compiling LLMs into a MegaKernel: A path to low-latency inference
- Magenta RealTime: An Open-Weights Live Music Model
- Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models
- Let Your Video Listen to Your Music!
- Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
- Bridging Cinematic Principles and Generative AI for Automated Film Generation
- Show HN: PILF, The ultimate solution to catastrophic oblivion on AI models
- Qwen VLo: From “Understanding” the World to “Depicting” It
- WorldVLA: Towards Autoregressive Action World Model (on HN)
- Small language models are the future of agentic AI (on HN)
- Overclocking LLM Reasoning: Monitoring and Controlling LLM Thinking Path Lengths (on HN)
- Reinforcement Learning from Human Feedback (RLHF) in Notebooks
- LLMs should not replace therapists (on HN)
- Mercury: Ultra-fast language models based on diffusion (on HN)
- Biomni: A General-Purpose Biomedical AI Agent (on HN)
- Distributed AI Agents for Cognitive Underwater Robot Autonomy
- GEPA: Reflective prompt evolution can outperform reinforcement learning (on HN)
- Hijacking multi-agent systems in your PajaMAS
- Core Safety Values for Provably Corrigible Agents
- Flow Matching Policy Gradients
- Fine-tuned small LLMs can beat large ones with programmatic data curation (on HN)
- the chosen task is considered not challenging
- Persona vectors: Monitoring and controlling character traits in language models (on HN)
- Qwen-Image: Crafting with native text rendering (on HN)
- Exploring Autonomous Agents: A Closer Look at Why They Fail When...
- Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models
- Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (on HN)
- Context Rot: How Increasing Input Tokens Impacts LLM Performance (on HN) (on lobste.rs)
- All AI models might be the same (on HN)
- LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
- Subliminal learning: Models transmit behaviors via hidden signals in data (on HN)
- Simon Willison | Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
- Flow Matching Meets Biology and Life Science: A Survey
- Seed-Prover/SeedProver at main · ByteDance-Seed/Seed-Prover
- Transformers Without Normalization (on HN)