↑ Contents · Contents · Ch 2 →
Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5
Large Language Models — Foundations
Cameron R. Wolfe: Training Specialized LLMs (Substack) Cameron Wolfe’s Substack is one of the most technically rigorous newsletters on LLM training — covering pre-training dynamics, alignment techniques (RLHF, DPO), and practical guidance on fine-tuning specialized models for narrow domains. Each issue goes deeper than typical ML blogs, with derivations and ablation studies from recent papers. https://cameronrwolfe.substack.com/
Attention in Transformers, Visually Explained — 3Blue1Brown Chapter 6 (YouTube) Grant Sanderson’s canonical visual explanation of the attention mechanism in transformers — arguably the clearest pedagogical treatment of how queries, keys, and values interact to produce context-aware token representations. The video builds up multi-head self-attention from first principles using geometric intuitions and animated diagrams. https://www.youtube.com/watch?v=eMlx5fFNoYc
Deep Learning Foundations by Soheil Feizi: Large Language Models Part II (YouTube) A lecture from Feizi’s UMD graduate course covering the theoretical foundations of large language models, including scaling laws, emergent capabilities, and the statistical properties of autoregressive generation. Part II focuses on capabilities and limitations that have been characterized empirically but lack complete theoretical explanations. https://youtu.be/mxERaO8FXHc
A Primer on Inner Workings of Transformer-based Language Models (arXiv 2405.00208) A survey paper synthesizing mechanistic interpretability findings about how transformers represent and process information internally — covering attention head specialization, factual recall circuits, and the role of the residual stream. Aimed at researchers who want to move beyond treating LLMs as black boxes. https://arxiv.org/abs/2405.00208
What Is ChatGPT Doing and Why Does It Work? — Stephen Wolfram (Wolfram Media) Wolfram’s characteristically detailed essay explaining transformer language models through his own computational lens — starting from next-token prediction, building up through embeddings and attention, and ending with a Wolfram-flavored theory of why models trained on human text happen to produce coherent language. Unusually long and thorough for a popular-science treatment. https://www.wolfram-media.com/products/what-is-chatgpt-doing-and-why-does-it-work/
Petar Velickovic: Transformers and GNNs — Softmax Issue Velickovic (DeepMind, GraphNets) discusses the structural similarity between graph neural networks and transformer attention — both aggregate neighborhood information via weighted sums — and raises the softmax saturation problem: when attention distributions become too peaked, gradient flow degrades. This thread connects two major neural architecture families through a shared mathematical lens.
Model Parallelism Chapter — ML Engineering (Stas Bekman tweet) A tweet pointing to Stas Bekman’s comprehensive chapter on model parallelism strategies — tensor parallelism, pipeline parallelism, sequence parallelism, and ZeRO — from his ML Engineering handbook. Essential reading for anyone training or serving large models across multiple GPUs.
Machine Learning Street Talk: Control Theory of LLM Prompting — Aman Bhargava A podcast episode and associated paper by Aman Bhargava framing LLM prompting through the lens of control theory: the user’s prompt is a control input, the model’s internal state is the system to be steered, and prompt engineering becomes a feedback control problem. An intellectually satisfying framework that connects prompting to dynamical systems theory.
RAG — Retrieval-Augmented Generation
DSPy — Declarative Language Model Calls into Self-Improving Pipelines (arXiv 2310.03714) DSPy (Declarative Self-improving Python) by Omar Khattab et al. replaces manual prompt engineering with a declarative programming model: the developer specifies the signature of each LLM call (inputs → outputs), and DSPy automatically optimizes the actual prompts using compiled few-shot demonstrations. A genuine paradigm shift from artisanal prompt crafting to systematic pipeline optimization. https://arxiv.org/abs/2310.03714
Bill Chambers: Gentle Intro to DSPy A blog post or tutorial providing an accessible entry point to DSPy for practitioners who want to experiment with programmatic prompt optimization without wading through the full academic paper. Chambers grounds the concepts in practical examples, making DSPy’s compilation loop concrete.
Santiago (@svpino) RAG Tutorials (multiple tweets) A thread series by ML educator Santiago Valdarrama covering retrieval-augmented generation from basics through advanced patterns — including chunking strategies, embedding model selection, hybrid search, re-ranking, and evaluation with RAGAS. Santiago’s visual explanations are unusually clear. https://twitter.com/svpino
LlamaIndex RAG Guide Tweet A tweet linking to LlamaIndex’s comprehensive RAG guide, covering the framework’s abstractions for document loading, chunking, indexing, retrieval, and synthesis — and how to compose them into production-grade pipelines with observability and evaluation hooks.
Lance Martin RAG From Scratch Tutorial (freeCodeCamp) A full-length tutorial by LangChain’s Lance Martin walking through building a RAG system from scratch — without using any framework abstractions — so the student understands every component: embedding generation, vector storage, cosine similarity retrieval, and prompt construction. The from-scratch approach builds intuition that framework tutorials often obscure.
Jerry Liu: Firecrawl for LLM Apps A tweet by LlamaIndex founder Jerry Liu on integrating Firecrawl — a web scraping API optimized for LLM data ingestion — into document pipelines. Firecrawl converts arbitrary web pages to clean Markdown, dramatically simplifying the data collection step in building RAG applications over web content.
Akshay Tweet: RAG With Llama-3 Locally A tutorial tweet demonstrating end-to-end local RAG using Meta’s Llama-3 model via Ollama, a local vector store, and LangChain — enabling fully private document Q&A without any API calls to cloud providers. The setup is representative of the “local-first AI” stack gaining popularity for sensitive enterprise use cases.
Santiago: RAG Hallucinations 35%, @eyelevelai Release A tweet noting that RAG pipelines still hallucinate roughly 35% of the time on standard benchmarks — a sobering counterweight to RAG’s reputation as a hallucination cure — alongside a mention of eyelevelai’s release of tools for grounded retrieval evaluation. The statistic underscores that retrieval quality and citation grounding remain active research problems.
Listening With LLMs (paul.mou.dev) A blog post describing an application of LLMs to audio content — likely podcast or lecture transcription and semantic question-answering — exploring the challenges of long-context retrieval and temporal alignment between audio timestamps and retrieved text passages. https://paul.mou.dev/