NOTE: This site has just upgraded to Forester 5.x and is still having some style and functionality issues, we will fix them ASAP.

some HF papers worth skimming [uts-016E]

- multimodal, diffusion
    - Scaling Diffusion Transformers Efficiently via μP
        - establish μP as a principled and efficient scaling strategy for diffusion Transformers
        - with appendix on Theoretical Background of μP
    - Neurosymbolic Diffusion Models
        - the first method to integrate masked diffusion models as the neural network extractor in neurosymbolic predictors
        - with a very long appendix on math background
    - Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
        - propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks
        - focus on Dream 7b: Introducing dream 7b, the most powerful open diffusion large language model to date
            - consistently outperforms existing diffusion language models by a large margin
            - matches or exceeds top-tier Autoregressive (AR) language models of similar size on the general, math, and coding abilities
            - demonstrates strong planning ability and inference flexibility that naturally benefits from the diffusion modeling
            - virtually all leading LLMs relying on this same sequential left-to-right architecture
            - Discrete diffusion models (DMs) have gained attention as a promising alternative for sequence generation since their introduction to the text domain, which dynamically refine the full sequence in parallel starting from a fully noised state
    - MMaDA: Multimodal Large Diffusion Language Models
        - unified diffusion architecture
        - superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation
        - rich and impressive examples
        - with appendix on Preliminaries of Discrete Diffusion, PPO and GRPO
    - LaViDa: A Large Diffusion Language Model for Multimodal Understanding
        - Large Vision-Language Diffusion Model with Masking
        - follows a similar design to common AR VLMs like LLaVa
    - GRIT: Teaching MLLMs to Think with Images
        - generate visually grounded reasoning chains by interleaving natural language with explicit bounding box coordinates referencing relevant image regions
    - Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
        - trained using a novel two-phase paradigm–Autoregressive-then-Diffusion
    - dKV-Cache: The Cache for Diffusion Language Models
        - diffusion language models have long been constrained by slow inference
        - motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process
        - propose a delayed and conditioned caching strategy for key and value states
    - Understanding Generative AI Capabilities in Everyday Image Editing Tasks
        - analyzing 83k requests with their associated 305k edits from the recent 12 years on the `/r/PhotoshopRequest` Reddit community
        - new dataset: PSR
    - Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
        - lots of examples of game creation
- efficiency
    - Scaling Law for Quantization-Aware Training
        - a comprehensive scaling law for 4-bit QAT of LLMs, integrating model size, training dataset size, and quantization granularity
        - previous methods do not account for quantization granularity G
        - weight and activation quantization errors tend to contribute almost equally to the total error
    - Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
        - push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework
        - perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training
        - Zeroth-order optimization (ZO) methods are often used in cases where gradients and higher-order derivatives of the objective cannot be directly computed or are unreliable
        - successfully fine-tune Stable Diffusion 3.5 Large quantized by BitsAndBytes on stylized images using a single Nvidia RTX 4090 24GB GPU
    - A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
        - trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher
        - remarkable distillation efficiency, achieving superior performance with more than 1000× fewer training tokens
        - LRC w/o FFN produces a substantial performance degradation that persists throughout training, further confirming the critical importance of FFN activations
        - LRC’s projection-based alignment is not only sufficient for effective knowledge transfer but also more efficient and stable
- agents, reasoning, RL
    - NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
    - Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
    - Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
    - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
    - Training-Free Reasoning and Reflection in MLLMs
    - Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
    - RLVR-World: Training World Models with Reinforcement Learning
    - SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
    - Risk-Averse Reinforcement Learning with Itakura-Saito Loss
- safety
    - Phare: A Safety Probe for Large Language Models
    - Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
    - Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
- application
    - Steering Large Language Models for Machine Translation Personalization
    - This Time is Different: An Observability Perspective on Time Series Foundation Models
    - Prior Prompt Engineering for Reinforcement Fine-Tuning
    - Using Large Language Models for Commit Message Generation: A Preliminary Study
    - The Distracting Effect: Understanding Irrelevant Passages in RAG
- more
    - Distilling LLM Agent into Small Models with Retrieval and Code Tools
    - CLEVER: A Curated Benchmark for Formally Verified Code Generation
    - DiSA: Diffusion Step Annealing in Autoregressive Image Generation
    - Capability-Based Scaling Laws for LLM Red-Teaming
    - FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information
    - GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
    - watch The 3D Gaussian Splatting Adventure: Past, Present, Future
    - DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
    - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
    - Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation
    - Large Language Models Often Know When They Are Being Evaluated
    - Tiny-diffusion: A minimal implementation of probabilistic diffusion models
    - AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
    - Time Series Forecasting with Graph Transformers
    - The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games
    - Compiling LLMs into a MegaKernel: A path to low-latency inference
    - Magenta RealTime: An Open-Weights Live Music Model
    - Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models
    - Let Your Video Listen to Your Music!
    - Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
    - Bridging Cinematic Principles and Generative AI for Automated Film Generation
    - Show HN: PILF, The ultimate solution to catastrophic oblivion on AI models
    - Qwen VLo: From “Understanding” the World to “Depicting” It
    - WorldVLA: Towards Autoregressive Action World Model (on HN)
    - Small language models are the future of agentic AI (on HN)
    - Overclocking LLM Reasoning: Monitoring and Controlling LLM Thinking Path Lengths (on HN)
    - Reinforcement Learning from Human Feedback (RLHF) in Notebooks
    - LLMs should not replace therapists (on HN)
    - Mercury: Ultra-fast language models based on diffusion (on HN)
    - Biomni: A General-Purpose Biomedical AI Agent (on HN)
    - Distributed AI Agents for Cognitive Underwater Robot Autonomy
    - GEPA: Reflective prompt evolution can outperform reinforcement learning (on HN)
    - Hijacking multi-agent systems in your PajaMAS
    - Core Safety Values for Provably Corrigible Agents
    - Flow Matching Policy Gradients
    - Fine-tuned small LLMs can beat large ones with programmatic data curation (on HN)
        - the chosen task is considered not challenging
    - Persona vectors: Monitoring and controlling character traits in language models (on HN)
    - Qwen-Image: Crafting with native text rendering (on HN)
    - Exploring Autonomous Agents: A Closer Look at Why They Fail When...
    - Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models
    - Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (on HN)
    - Context Rot: How Increasing Input Tokens Impacts LLM Performance (on HN) (on lobste.rs)
    - All AI models might be the same (on HN)
    - LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra
    - Subliminal learning: Models transmit behaviors via hidden signals in data (on HN)
        - Simon Willison | Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
    - Flow Matching Meets Biology and Life Science: A Survey
    - Seed-Prover/SeedProver at main · ByteDance-Seed/Seed-Prover
    - Transformers Without Normalization (on HN)