❯

❯

some HF papers worth skimming

some HF papers worth skimming

May 25, 20257 min read

draft
lm
mm
agent

multimodal, diffusion
- Scaling Diffusion Transformers Efficiently via μP
  - establish μP as a principled and efficient scaling strategy for diffusion Transformers
  - with appendix on Theoretical Background of μP
- Neurosymbolic Diffusion Models
  - the first method to integrate masked diffusion models as the neural network extractor in neurosymbolic predictors
  - with a very long appendix on math background
- Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
  - propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks
  - focus on Dream 7b: Introducing dream 7b, the most powerful open diffusion large language model to date
    - consistently outperforms existing diffusion language models by a large margin
    - matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities
    - demonstrates strong planning ability and inference flexibility that naturally benefits from the diffusion modeling
    - virtually all leading LLMs relying on this same sequential left-to-right architecture
    - Discrete diffusion models (DMs) have gained attention as a promising alternative for sequence generation since their introduction to the text domain, which dynamically refine the full sequence in parallel starting from a fully noised state
- MMaDA: Multimodal Large Diffusion Language Models
  - unified diffusion architecture
  - superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation
  - rich and impressive examples
  - with appendix on Preliminaries of Discrete Diffusion, PPO and GRPO
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding
  - Large Vision-Language Diffusion Model with Masking
  - follows a similar design to common AR VLMs like LLaVa
- GRIT: Teaching MLLMs to Think with Images
  - generate visually grounded reasoning chains by interleaving natural language with explicit bounding box coordinates referencing relevant image regions
- Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
  - trained using a novel two-phase paradigm–Autoregressive-then-Diffusion
- dKV-Cache: The Cache for Diffusion Language Models
  - diffusion language models have long been constrained by slow inference
  - motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process
  - propose a delayed and conditioned caching strategy for key and value states
- Understanding Generative AI Capabilities in Everyday Image Editing Tasks
  - analyzing 83k requests with their associated 305k edits from the recent 12 years on the /r/PhotoshopRequest Reddit community
  - new dataset: PSR
- Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
  - lots of examples of game creation
efficiency
- Scaling Law for Quantization-Aware Training
  - a comprehensive scaling law for 4-bit QAT of LLMs, integrating model size, training dataset size, and quantization granularity
  - previous methods do not account for quantization granularity G
  - weight and activation quantization errors tend to contribute almost equally to the total error
- Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
  - push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework
  - perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training
  - Zeroth-order optimization (ZO) methods are often used in cases where gradients and higher-order derivatives of the objective cannot be directly computed or are unreliable
  - successfully fine-tune Stable Diffusion 3.5 Large quantized by BitsAndBytes on stylized images using a single Nvidia RTX 4090 24GB GPU
- A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
  - trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher
  - remarkable distillation efficiency, achieving superior performance with more than 1000× fewer training tokens
  - LRC w/o FFN produces a substantial performance degradation that persists throughout training, further confirming the critical importance of FFN activations
  - LRC’s projection-based alignment is not only sufficient for effective knowledge transfer but also more efficient and stable
agents, reasoning, RL
safety
application
more

Graph View

Created with Quartz v4.5.1 © 2026

Source