Math notes (early drafts) [uts-000V]
Math notes (early drafts) [uts-000V]
Notes on language models [lm-0001]
Notes on language models [lm-0001]
This is a placeholder root note, for now, see opening thoughts.
drafts for Notes on language models
drafts for Notes on language models
definition. L-GATr architecture [spinner2024lorentz, p. 5] [lm-0007]
\[
\begin {align*}
\bar {x}&=\operatorname {LayerNorm}(x),\\
\operatorname {AttentionBlock}(x)&=\operatorname {Linear} \circ \operatorname {Attention}(\operatorname {Linear}(\bar {x}), \operatorname {Linear}(\bar {x}), \operatorname {Linear}(\bar {x})) + x,\\
\operatorname {MLPBlock}(x)&=\operatorname {Linear} \circ \operatorname {GatedGELU} \circ \operatorname {Linear} \circ \operatorname {GP}(\operatorname {Linear}(\bar {x}), \operatorname {Linear}(\bar {x})) + x,\\
\operatorname {Block}(x)&=\operatorname {MLPBlock} \circ \operatorname {AttentionBlock}(x),\\
\operatorname {L-GATr}(x)&=\operatorname {Linear} \circ \operatorname {Block} \circ \operatorname {Block} \circ \cdots \circ \operatorname {Block} \circ \operatorname {Linear}(x).
\end {align*}
\]
definition. L-GATr architecture [spinner2024lorentz, p. 5] [lm-0007]
opening thoughts [lm-0006]
opening thoughts [lm-0006]
My recent learning has become too scattered, and I have long wished to write notes on LMs (language models), so I hope focusing on this will prevent me from losing focus.
I've settled on the prefix lm
instead of llm
as I think phrases like "small LLMs" are redundant, and I hope LM will include multimodal variants, such as VLMs (visual language models).
The interest doesn't stop there - I haven't properly studied diffusion models yet, and I wish to explore their integration with LMs, or at least, the perspective of diffusion models in relation to LMs.
As my interests in the LM field are quite diverse, I will organize the notes into several series rather than one. I've reserved lm-0002 through lm-0005 for root notes covering these topics. The areas that come to mind include:
- math behind LMs since ML
- various optimization methods of LMs, including GPU programming
- post-training techniques
- application of GA in LMs
- alternative architectures of LMs, that includes RWKV, stuff related to diffusion models
Individual notes will be written as I read related papers and books, and may be referenced across these root notes.
I have scattered notes about LMs previously, here I list them for easier reference.
- Transformers: from self-attention to performance optimizations
- LLM Daily Picks
- ML
- Learning diary › Year 2025 › May, 2025 › 2025-05-06
- Learning diary › Year 2025 › May, 2025 › 2025-05-02
- Learning diary › Year 2025 › April, 2025 › 2025-04-21
This is also an experiment to see if modern VLMs could better convert screenshots to Forester math formula markup.
In Transformers: from self-attention to performance optimizations, I was focused on visualizing transformer architectures using subscript-free tensor notations ("Named Tensor Notation" [chiang2021named]). However, mathematical concepts can often be expressed succinctly once you develop the ability to visualize them mentally. In this series of notes, I'll focus more on the mathematical foundations rather than visualizations or introductory explanations, having progressed beyond that stage after reading hundreds of LM papers.
notes on group algebras [spin-0010]
notes on group algebras [spin-0010]
rationale [spin-0011]
rationale [spin-0011]
In a sense, group algebras are the source of all you need to know about representation theory.
The primary reference is [james2001representations] for understanding FG-module, Group algebra, the presentation of groups, Clifford theory (which is the standard method of constructing representations and characters of semi-direct products, see [woit2017quantum], and "3.6 Clifford theory" in [lux2010representations]), Schur indices etc. We also need to check [lux2010representations] for its introduction to GAP, and we should pay close attention to the progress of GAP-LEAN. [sims1994computation] might also be interesting in a similar manner as [lux2010representations] but with emphasis on the presentation of groups.
See also group algebra on nlab, particularly that "A group algebra is in particular a Hopf algebra and a \(G\)-graded algebra."
The related Zulip thread is here, and I have preliminary explorations and experiments in Lean here.
This interest originates from reading Robert A. Wilson's work [wilson2024discrete]. The ultimate goal is to understand the group algebra of the binary tetrahedral group (\(Q_8 \rtimes Z_3\)), then the three-dimensional complex reflection group (\(G_{27} \rtimes Q_8 \rtimes Z_3\)), a.k.a. the triple cover of the Hessian group, which can be interpreted as a finite analogue of the complete gauge group \(U(1) \times SU(2) \times SU(3)\).
A further neccecity arises from reading [hamilton2023supergeometric] and [hamilton2023unification].
Notes on Algebraic Geometry [ag-0001]
- August 13, 2024
- Utensil Song
Notes on Algebraic Geometry [ag-0001]
- August 13, 2024
- Utensil Song
What is Algebraic Geometry? [ag-0005]
What is Algebraic Geometry? [ag-0005]
Algebraic geometry deals with solution sets of systems of polynomial equations [borisov2024adventures, sec. 1] from an geometric view, with many algebraic concepts imitating the notions of analysis and topology [kriz2021introduction, p. vii], but has the advantage of being able to deal with singularities [mehrle2017algebraic, lect. 1, sec. 1] and other pathological objects.
Algebraic geometry has developed in waves, each with its own language and point of view, see [hartshorne1977graduate, p. xiv] for some discussion. The older language is closer to the geometric intuition, while the newer language developes powerful techniques to solve problems in great generality, but the study of the latter is considered tedious, or even depressing, by many, when not accompanied with tangible applications [grothendieck1964elements, p. 12].
remark. the grand plan [ag-0002]
remark. the grand plan [ag-0002]
We will use [kriz2021introduction] as a holistic guide to the organization of materials, which has done a great job of having the minimal prerequisites, being self-contained, and covering most of the topics that concern us, including the geometric motivation.
For a similar purpose, we use the formalization papers [bordg2022simple] and [buzzard2022schemes] to guide the path, at least the part towards schemes, and their counterparts in the Mathlib of Lean 4.
For prerequisites in basic algebra and commutative algebra, we will use [knapp2006basic] and [knapp2007advanced], and notes by Andreas Gathmann [gathmann2023plane][gathmann2013commutative] and David Mehrle [mehrle2015commutative]. For the geometric motivation and intuition, we will use [cox1997ideals] and [borisov2024adventures].
For upstream treatment of algebraic geometry, we will use [grothendieck1964elements] (particularly the English translation available at ryankeleti/ega) and [fantechi2006fundamental]. For modern notes, we will use [vakil2024rising], [gathmann2022algebraic] and [mehrle2017algebraic], with an eye on the classic textbook [hartshorne1977graduate].
We also need to tap into the language of Stacks in a modern setting, as treated in [khan2023lectures], with preliminaries on \(\infty \)-categories and derived categories.
See the plan for notes on algebraic geometry for an early discussion of the plan.
convention. rings and fields [gathmann2013commutative, 0.1] [ag-0008]
convention. rings and fields [gathmann2013commutative, 0.1] [ag-0008]
A ring, usually denoted \(R\), is always assumed to be a commutative ring (i.e. \(a+b=b+a\) and \(a b=b a\) for all \( a, b \in R\)) with 1 (i.e. the multiplicative identity element, or called multiplicative unit).
\(1 \neq 0\) is not required, where \(0\) is the additive neutral element. If \(1=0\), then \(R\) must be the zero ring (or called the trivial ring), which consisting of one element, and is denoted \(\{0\}\).
Subrings must have the same unit, and ring homomorphisms are always required to map \(1\) to \(1\).
A field, usually denoted \(K\), is a commutative ring with \(1\), where every nonzero element has a multiplicative inverse (thus division can be defined).
definition. formal variable, formal expression [ag-0009]
definition. formal variable, formal expression [ag-0009]
A formal variable is an arbitrary symbol that is used to represent some mathematical object, and assumes nothing about the value or nature of the object.
A formal expression is mathematical expression with formal variables, assuming nothing but the formal variables participate in the expression have operations that are used in the expression.
A formal expression can be evaluated by replacing the formal variables with actual mathematical objects that have the operations defined in the expression.
When there is no ambiguity, we may omit the word "formal" and simply say variables or an expression.
definition. monomial [cox1997ideals, 1.1.1] [ag-0003]
definition. monomial [cox1997ideals, 1.1.1] [ag-0003]
A monomial in \(n\) formal variables \(x_1, \ldots , x_n\), denoted \(x^\alpha \), is a formal expression of the form \[ x_1^{\alpha _1} x_2^{\alpha _2} \cdots x_n^{\alpha _n} \] where \(n \in \mathbb N\), \(\alpha =\left (\alpha _1, \ldots , \alpha _n\right )\) is an \(n\)-tuple of nonnegative integers.
The total degree of the monomial is denoted \(|\alpha |=\alpha _1+\cdots +\alpha _n\).
definition. polynomial [cox1997ideals, 1.1.2, 1.1.3] [ag-0004]
definition. polynomial [cox1997ideals, 1.1.2, 1.1.3] [ag-0004]
A polynomial \(f\) over a ring \(R\) in \(n\) variables is a finite linear combination (with coefficients \(a_\alpha \) in \(R\) ) of monomials, i.e. the formal expression of the form \[ f=\sum _\alpha a_\alpha x^\alpha , \quad a_\alpha \in R, \]
The set of all polynomials in \(x_1, \ldots , x_n\) with coefficients in \(R\) is denoted \(R\left [x_1, \ldots , x_n\right ]\).
\(a_\alpha x^\alpha \) is called a term of \(f\) if \(a_\alpha \neq 0\).
The total degree of \(f \neq 0\), denoted \(\operatorname {deg}(f)\), is the maximum \(|\alpha |\) such that the coefficient \(a_\alpha \) is nonzero. The total degree of the zero polynomial is undefined.
remark. polynomial ring [cox1997ideals, 1.1.3] [ag-0007]
remark. polynomial ring [cox1997ideals, 1.1.3] [ag-0007]
Under addition and multiplication, \(R\left [x_1, \ldots , x_n\right ]\) satisfies all axioms of a commutative ring, and for this reason we will refer to \(R\left [x_1, \ldots , x_n\right ]\) as a polynomial ring.
definition. affine space [gathmann2013commutative, 0.3] [ag-000A]
definition. affine space [gathmann2013commutative, 0.3] [ag-000A]
The (\(n\)-dimensional) affine space over a field \(K\), denoted \(\mathbb {A}_K^n\), is \[ \left \{\left (c_1, \ldots , c_n\right ): c_i \in K \text { for } i=1, \ldots , n\right \} \] which is just \(K^n\) as a set, without the its additional structures as a \(K\)-vector space and a ring.
We'll often use the term affine \(n\)-space to indicate the dimension. Particularly, an affine 1-space is called an affine line, an affine 2-space is an affine plane.
definition. affine variety [gathmann2013commutative, 0.3] [ag-000D]
definition. affine variety [gathmann2013commutative, 0.3] [ag-000D]
Let \(S \subset K\left [x_1, \ldots , x_n\right ]\) be a set of polynomials. The zero locus (or zero set) of \(S\) is \[ V(S):=\left \{x \in \mathbb {A}_K^n: f(x)=0 \text { for all } f \in S\right \} \subset \mathbb {A}_K^n \]
An affine algebraic variety over \(K\) is a subset of \(\mathbb {A}_K^n\) of this form. It's usually simply called an affine variety over \(K\), or an affine \(K\)-variety.
If \(S=\left (f_1, \ldots , f_k\right )\) is a finite set, \(V(S)\) can be written as \(V\left (f_1, \ldots , f_k\right )\).
Obviously, it is the set of all solutions of the system of polynomial equations \(f_1\left (x_1, \ldots , x_n\right )=\cdots =f_s\left (x_1, \ldots , x_n\right )=0\) [cox1997ideals, 1.2.1], denoted \(\operatorname {Sol}(S;K)\) [dolgachev2013introduction, p. 1].
remark. affine v.s. projective [michalek2021invitation, ch. 2] [ag-000E]
remark. affine v.s. projective [michalek2021invitation, ch. 2] [ag-000E]
The prefix "affine" of affine variety is used to distinguish it from a projective variety. Affine varieties arise from arbitrary polynomials, while projective varieties arise from systems of homogeneous polynomials, i.e. linear combinations of monomials of fixed degree.
Since affine varieties are the general case, they are sometimes simply called varieties.
Figure out its relation to affine space in Geometry, which preserves parallelism and ratio of lengths for parallel line segments, but not distances and measures of angles.
example. varieties [ag-000F]
example. varieties [ag-000F]
\(\mathbf {V}\left (10 x^2-x^3-y^2\right )\) from [cox1997ideals, p. 24]:
\(\mathbf {V}\left (x^2-y^2 z^2+z^3\right )\) from [cox1997ideals, p. 7, p. 16]:
figure. [uts-000J]
figure. [uts-000J]
#define AS_LIB 1 int get_shape() { return int(iTime) % 52; } #include "/forest/shader/implicit.glsl" void mainImage( out vec4 fragColor, in vec2 fragCoord ) { vec2 uv = 2.*(fragCoord-iResolution.xy/2.)/iResolution.y; // contains [-1,1]^2 vec3 col = vec3(0.); // Camera rays vec3 camPos = vec3(4.,0.,0.); vec3 camDir = - normalize(camPos); vec3 rayPos, rayDir; float zoom = 1.3; // 1.8*cos(iTime); // if (checkKey(KEY_E)) zoom = 0.5; float fov = 0.4*zoom; float fov_ortho = 1.5*zoom; #if perspective // perspective cam rayPos = camPos; rayDir = normalize(camDir + fov*vec3(0., uv.x, uv.y)); #else // orthographic cam rayPos = camPos + fov_ortho*vec3(0., uv.x, uv.y); rayDir = camDir; #endif // for perspective background in orthographic mode vec3 cubemapDir = normalize(camDir + fov*vec3(0., uv.x, uv.y)); // Mouse-controlled rotation vec2 mouse = initMouse + vec2(0.015625*sin(iTime*PI), 0.0); // initMouse; // iMouse.xy == vec2(0.,0.) ? initMouse : (iMouse.xy/iResolution.xy - 0.5); float yaw = clamp(- mouse.x * 2.*PI * 1., -PI,PI); float pitch = clamp( mouse.y * PI * 1.2, -PI*0.5, PI*0.5); // pitch and yaw rotations (column-wise matrices) mat3 rot = mat3(cos(yaw), sin(yaw), 0., -sin(yaw), cos(yaw), 0., 0., 0., 1.); rot = rot * mat3(cos(pitch), 0., -sin(pitch), 0., 1., 0., sin(pitch), 0., cos(pitch)); // apply camPos = rot*camPos; camDir = rot*camDir; rayPos = rot*rayPos; rayDir = rot*rayDir; cubemapDir = rot*cubemapDir; //cubemapDir = vec3(cubemapDir.x, cubemapDir.z, cubemapDir.y); vec3 hitPoint = raycast(rayPos, rayDir); if (hitPoint == BINGO) { fragColor = vec4(BINGO,1.0); return; } //if (hitPoint == NOHIT) { fragColor = vec4(NOHIT,1.0); return; } //if (hitPoint == NOBOUNDHIT) { fragColor = vec4(NOBOUNDHIT,1.0); return; } //if (hitPoint == ESCAPEDBOUNDS) { fragColor = vec4(ESCAPEDBOUNDS,1.0); return; } //if (hitPoint == MAXDISTREACHED) { fragColor = vec4(MAXDISTREACHED,1.0); return; } //if (hitPoint == MAXITERREACHED) { fragColor = vec4(MAXITERREACHED,1.0); return; } if (hitPoint == NOBOUNDHIT || hitPoint == NOHIT || hitPoint == ESCAPEDBOUNDS || hitPoint == MAXITERREACHED) { //fragColor = vec4(vec3(0.2),1.0); return; // make background transparent fragColor = vec4(0.0,0.0,0.0,0.0); return; col = with_background(cubemapDir); #if showBoundingCube // darken bounding cube if (hitPoint != NOBOUNDHIT) { col *= vec3(0.7); } #endif fragColor = vec4(col,1.0); return; } vec3 grad = gradf(hitPoint+1.1*EPS*(-rayDir)); float s = -sign(dot(grad,rayDir)); col = with_color_mode(grad, s, hitPoint, camPos); col = clamp(col, 0., 1.); col = with_surface_pattern(col, hitPoint); col = with_shading(col, grad, s, rayDir); col = clamp(col, 0., 1.); fragColor = vec4(col,1.0); }
appendix [ag-000B]
appendix [ag-000B]
For draft notes, see drafts for Notes on Algebraic Geometry.
Notes on Hopf Algebras [hopf-0001]
- May 6, 2024
- Utensil Song
Notes on Hopf Algebras [hopf-0001]
- May 6, 2024
- Utensil Song
I would like to have some notes on Hopf algebras, particularly its relation to group algebras of finite groups and Clifford algebras.
The following papers interest me (marked with extra keywords hit):
- [ablamowicz2016clifford]: finite groups
- [fauser2004grade]: grade, knot
- [fauser2002treatise]: grade, knot, graphical calculi
- [rodriguez1996clifford]: action
- [trindade2019clifford]: physics
- [bulacu2011clifford]: category
definition. Peano space [fauser2002treatise] [hopf-0002]
definition. Peano space [fauser2002treatise] [hopf-0002]
Let \(V\) be a linear space of finite dimension \(n\). Let lower case \(x_i\) denote elements of \(V\), which we will call also letters. We define a bracket as an alternating multilinear scalar valued function \[ \begin {aligned} [, \ldots , .] & : V \times \ldots \times V \rightarrow \mathbb {k} \quad (n\text {-factors}) \\ {\left [x_1, \ldots , x_n\right ]} & =\operatorname {sign}(p)\left [x_{p(1)}, \ldots , x_{p(n)}\right ] \\ {\left [x_1, \ldots , \alpha x_r+\beta y_r, \ldots , x_n\right ] } & =\alpha \left [x_1, \ldots , x_r, \ldots , x_n\right ]+\beta \left [x_1, \ldots , y_r, \ldots , x_n\right ] \end {aligned} \]
The sign is due to the permutation \(p\) on the arguments of the bracket. The pair \(\mathcal {P}=(V,[., \ldots ,])\). is called a Peano space.
remark. Peano space [fauser2002treatise] [hopf-0003]
remark. Peano space [fauser2002treatise] [hopf-0003]
Of course, this structure is much weaker as e.g. a normed space or an inner product space. It does not allow to introduce the concept of length, distance or angle. Therefore it is clear that a geometry based on this structure cannot be metric. However, the bracket can be addressed as a volume form. Volume measurements are used e.g. in the analysis of chaotic systems and strange attractors.
definition. standard Peano space [fauser2002treatise] [hopf-0004]
definition. standard Peano space [fauser2002treatise] [hopf-0004]
A standard Peano space is a Peano space over the linear space \(V\) of dimension \(n\) whose bracket has the additional property that for every vector \(x \in V\) there exist vectors \(x_2, \ldots , x_n\) such that \[ \left [x, x_2, \ldots , x_n\right ] \neq 0 . \]
In such a space the length of the bracket, i.e. the number of entries, equals the dimension of the space, and conversely. We will be concerned here with standard Peano spaces only.