My recent learning has become too scattered, and I have long wished to write notes on LMs (language models), so I hope focusing on this will prevent me from losing focus.
I've settled on the prefix lm
instead of llm
as I think phrases like "small LLMs" are redundant, and I hope LM will include multimodal variants, such as VLMs (visual language models).
The interest doesn't stop there - I haven't properly studied diffusion models yet, and I wish to explore their integration with LMs, or at least, the perspective of diffusion models in relation to LMs.
As my interests in the LM field are quite diverse, I will organize the notes into several series rather than one. I've reserved lm-0002 through lm-0005 for root notes covering these topics. The areas that come to mind include:
- math behind LMs since ML
- various optimization methods of LMs, including GPU programming
- post-training techniques
- application of GA in LMs
- alternative architectures of LMs, that includes RWKV, stuff related to diffusion models
Individual notes will be written as I read related papers and books, and may be referenced across these root notes.
I have scattered notes about LMs previously, here I list them for easier reference.
This is also an experiment to see if modern VLMs could better convert screenshots to Forester math formula markup.
In Transformers: from self-attention to performance optimizations, I was focused on visualizing transformer architectures using subscript-free tensor notations ("Named Tensor Notation" [chiang2021named]). However, mathematical concepts can often be expressed succinctly once you develop the ability to visualize them mentally. In this series of notes, I'll focus more on the mathematical foundations rather than visualizations or introductory explanations, having progressed beyond that stage after reading hundreds of LM papers.