Home

❯

posts

❯

some HF papers worth skimming

some HF papers worth skimming

May 25, 20257 min read

  • draft
  • lm
  • mm
  • agent
  • multimodal, diffusion
    • Scaling Diffusion Transformers Efficiently via μP
      • establish μP as a principled and efficient scaling strategy for diffusion Transformers
      • with appendix on Theoretical Background of μP
    • Neurosymbolic Diffusion Models
      • the first method to integrate masked diffusion models as the neural network extractor in neurosymbolic predictors
      • with a very long appendix on math background
    • Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
      • propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks
      • focus on Dream 7b: Introducing dream 7b, the most powerful open diffusion large language model to date
        • consistently outperforms existing diffusion language models by a large margin
        • matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities
        • demonstrates strong planning ability and inference flexibility that naturally benefits from the diffusion modeling
        • virtually all leading LLMs relying on this same sequential left-to-right architecture
        • Discrete diffusion models (DMs) have gained attention as a promising alternative for sequence generation since their introduction to the text domain, which dynamically refine the full sequence in parallel starting from a fully noised state
    • MMaDA: Multimodal Large Diffusion Language Models
      • unified diffusion architecture
      • superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation
      • rich and impressive examples
      • with appendix on Preliminaries of Discrete Diffusion, PPO and GRPO
    • LaViDa: A Large Diffusion Language Model for Multimodal Understanding
      • Large Vision-Language Diffusion Model with Masking
      • follows a similar design to common AR VLMs like LLaVa
    • GRIT: Teaching MLLMs to Think with Images
      • generate visually grounded reasoning chains by interleaving natural language with explicit bounding box coordinates referencing relevant image regions
    • Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
      • trained using a novel two-phase paradigm–Autoregressive-then-Diffusion
    • dKV-Cache: The Cache for Diffusion Language Models
      • diffusion language models have long been constrained by slow inference
      • motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process
      • propose a delayed and conditioned caching strategy for key and value states
    • Understanding Generative AI Capabilities in Everyday Image Editing Tasks
      • analyzing 83k requests with their associated 305k edits from the recent 12 years on the /r/PhotoshopRequest Reddit community
      • new dataset: PSR
    • Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
      • lots of examples of game creation
  • efficiency
    • Scaling Law for Quantization-Aware Training
      • a comprehensive scaling law for 4-bit QAT of LLMs, integrating model size, training dataset size, and quantization granularity
      • previous methods do not account for quantization granularity G
      • weight and activation quantization errors tend to contribute almost equally to the total error
    • Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
      • push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework
      • perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training
      • Zeroth-order optimization (ZO) methods are often used in cases where gradients and higher-order derivatives of the objective cannot be directly computed or are unreliable
      • successfully fine-tune Stable Diffusion 3.5 Large quantized by BitsAndBytes on stylized images using a single Nvidia RTX 4090 24GB GPU
    • A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
      • trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher
      • remarkable distillation efficiency, achieving superior performance with more than 1000× fewer training tokens
      • LRC w/o FFN produces a substantial performance degradation that persists throughout training, further confirming the critical importance of FFN activations
      • LRC’s projection-based alignment is not only sufficient for effective knowledge transfer but also more efficient and stable
  • agents, reasoning, RL
    • NovelSeek: When Agent Becomes the Scientist — Building Closed-Loop System from Hypothesis to Verification
    • Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
    • Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
    • AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
    • Training-Free Reasoning and Reflection in MLLMs
    • Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
    • RLVR-World: Training World Models with Reinforcement Learning
    • SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
    • Risk-Averse Reinforcement Learning with Itakura-Saito Loss
  • safety
    • Phare: A Safety Probe for Large Language Models
    • Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
    • Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
  • application
    • Steering Large Language Models for Machine Translation Personalization
    • This Time is Different: An Observability Perspective on Time Series Foundation Models
    • Prior Prompt Engineering for Reinforcement Fine-Tuning
    • Using Large Language Models for Commit Message Generation: A Preliminary Study
    • The Distracting Effect: Understanding Irrelevant Passages in RAG
  • more
    • Distilling LLM Agent into Small Models with Retrieval and Code Tools
    • CLEVER: A Curated Benchmark for Formally Verified Code Generation
    • DiSA: Diffusion Step Annealing in Autoregressive Image Generation
    • Capability-Based Scaling Laws for LLM Red-Teaming
    • FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
    • GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
    • watch The 3D Gaussian Splatting Adventure: Past, Present, Future
    • DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
    • GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
    • Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation
    • Large Language Models Often Know When They Are Being Evaluated
    • Tiny-diffusion: A minimal implementation of probabilistic diffusion models
    • AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
    • Time Series Forecasting with Graph Transformers
    • The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
    • Compiling LLMs into a MegaKernel: A path to low-latency inference
    • Magenta RealTime: An Open-Weights Live Music Model
    • Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models
    • Let Your Video Listen to Your Music!
    • Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
    • Bridging Cinematic Principles and Generative AI for Automated Film Generation
    • Show HN: PILF, The ultimate solution to catastrophic oblivion on AI models
    • Qwen VLo: From “Understanding” the World to “Depicting” It
    • WorldVLA: Towards Autoregressive Action World Model (on HN)
    • Small language models are the future of agentic AI (on HN)
    • Overclocking LLM Reasoning: Monitoring and Controlling LLM Thinking Path Lengths (on HN)
    • Reinforcement Learning from Human Feedback (RLHF) in Notebooks
    • LLMs should not replace therapists (on HN)
    • Mercury: Ultra-fast language models based on diffusion (on HN)
    • Biomni: A General-Purpose Biomedical AI Agent (on HN)
    • Distributed AI Agents for Cognitive Underwater Robot Autonomy
    • GEPA: Reflective prompt evolution can outperform reinforcement learning (on HN)
    • Hijacking multi-agent systems in your PajaMAS
    • Core Safety Values for Provably Corrigible Agents
    • Flow Matching Policy Gradients
    • Fine-tuned small LLMs can beat large ones with programmatic data curation (on HN)
      • the chosen task is considered not challenging
    • Persona vectors: Monitoring and controlling character traits in language models (on HN)
    • Qwen-Image: Crafting with native text rendering (on HN)
    • Exploring Autonomous Agents: A Closer Look at Why They Fail When…
    • Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models
    • Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (on HN)
    • Context Rot: How Increasing Input Tokens Impacts LLM Performance (on HN) (on lobste.rs)
    • All AI models might be the same (on HN)
    • LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
    • Subliminal learning: Models transmit behaviors via hidden signals in data (on HN)
      • Simon Willison | Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
    • Flow Matching Meets Biology and Life Science: A Survey
    • Seed-Prover/SeedProver at main · ByteDance-Seed/Seed-Prover
    • Transformers Without Normalization (on HN)
    • Embedding-Aware Quantum-Classical SVMs for Scalable Quantum Machine Learning

Graph View

Created with Quartz v4.5.1 © 2026

  • Source