Ch 2 — Fine-Tuning & Indic LLMs

← Ch 1 · Contents · Ch 3 →

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Fine-Tuning & Indic LLMs

Ramsri Goutham: Training Indic LLMs — DoRA vs LoRA A practical comparison of LoRA (Low-Rank Adaptation) and DoRA (Weight-Decomposed Low-Rank Adaptation) for fine-tuning base language models on Indian language data. DoRA decomposes weight updates into magnitude and direction components, often improving convergence on low-resource language adaptation tasks where LoRA’s rank constraints may limit expressivity.

Harman Singh: IndicGenBench — Multilingual Benchmark (29 Indic Languages) IndicGenBench provides the first comprehensive generation benchmark spanning 29 Indian languages across machine translation, summarization, and question answering — filling a critical gap given that existing benchmarks heavily over-index on English. The benchmark enables systematic comparison of models like Llama, mT5, and IndicBERT on genuinely diverse South Asian language tasks.

IndicGenBench Paper (arXiv 2404.16816) The academic paper introducing IndicGenBench, providing methodological details on dataset construction, evaluation metrics adapted for morphologically rich languages, and baseline results from 29 models across 29 languages. The paper documents both benchmark creation challenges (data collection, quality filtering) and model performance patterns across language families. https://arxiv.org/abs/2404.16816

GitHub: google-research-datasets/indic-gen-bench The official GitHub repository for IndicGenBench, containing dataset splits, evaluation scripts, and submission guidelines. A practical starting point for anyone benchmarking multilingual generation quality on Indian languages. https://github.com/google-research-datasets/indic-gen-bench

Tamil-Llama (GitHub: abhinand5/tamil-llama, Based on Llama 2) A fine-tuned variant of Meta’s Llama 2 adapted for Tamil language generation — one of the earlier attempts to create a Tamil-centric LLM by continual pre-training on Tamil text corpora followed by instruction tuning. The model demonstrated that language-specific adaptation of open base models is a viable path to improved Indic language capabilities. https://github.com/abhinand5/tamil-llama

Tamil-Llama Paper (arXiv 2311.05845) The technical report documenting the Tamil-Llama training process: data collection from Tamil Wikipedia, news corpora, and literary sources; tokenizer extension to better cover Tamil morphology; and evaluation across translation, summarization, and instruction following. The paper serves as a template for similar efforts in other Indian languages. https://arxiv.org/abs/2311.05845

Kannada Llama (Tensoic) Tensoic’s Kannada-specific language model, representing one of the first dedicated LLM efforts for Kannada — a language with rich morphology and significant digital text scarcity relative to its 50+ million speakers. The existence of this model marks an important milestone in making generative AI accessible to Kannada speakers. https://tensoic.com/

Hamel Husain (@HamelHusain): Apple Talk Is the Best Ad for Fine-Tuning Hamel Husain observes that Apple’s on-device intelligence presentation — which demonstrated adapter-based hot-swapping of specialized LLM modules — is effectively the strongest public endorsement of fine-tuning with LoRA/adapter methods. The tweet highlights how Apple’s engineering choices validate the research community’s direction on parameter-efficient adaptation. https://x.com/HamelHusain/status/1800546715277357263

Deep Learning Foundations LLMs Part II — Fine-Tuning, LoRA, Quantization, QLoRA, Prefix Tuning, RAG A lecture covering the full spectrum of LLM adaptation techniques: full fine-tuning (impractical at scale), LoRA (low-rank weight decomposition), QLoRA (quantized LoRA for 4-bit inference), prefix tuning (learnable virtual tokens prepended to input), and RAG (retrieval augmentation as an alternative to parameter updates). A solid conceptual map of the adaptation landscape.

Novel Architectures

KAN — Kolmogorov-Arnold Networks Paper (arXiv 2404.19756) The KAN paper proposes replacing the fixed activation functions and learnable weights of multi-layer perceptrons with learnable univariate functions (B-splines) on edges — inspired by the Kolmogorov-Arnold representation theorem, which states that any continuous multivariate function can be written as a composition of univariate functions. KANs are more interpretable and potentially more parameter-efficient for scientific function fitting tasks. https://arxiv.org/abs/2404.19756

Carlos Perez: KAN Networks (Kolmogorov-Arnold Networks) A tweet thread by AI educator Carlos Perez explaining KANs intuitively — why replacing fixed ReLU/GELU activations with learnable spline functions on edges (rather than at nodes) gives the network greater expressivity for representing smooth mathematical relationships, and what this means for scientific machine learning applications.

API Demos for Kolmogorov-Arnold Networks (pykan) The official interactive demo site for pykan — the Python library implementing KANs — including live examples of fitting mathematical functions, symbolic regression, and comparison with MLPs on benchmark tasks. The demos make the interpretability advantages of KANs concrete: trained KANs can sometimes be read as explicit mathematical formulas. https://kindxiaoming.github.io/pykan/

Novel Architecture Makes Neural Networks More Understandable | Quanta Magazine Quanta’s accessible coverage of KAN networks, explaining the Kolmogorov-Arnold representation theorem and why building it into network architecture enables scientific discovery — specifically, the ability to extract closed-form mathematical relationships from trained models rather than treating them as black boxes. https://www.quantamagazine.org/novel-architecture-makes-neural-networks-more-understandable-20240911/

Understanding the Difference Between KAN and Multi-Layer Perceptrons (Medium) A Medium post comparing KAN and MLP architectures side-by-side: where MLP applies fixed nonlinearities at nodes after learnable linear transformations, KAN learns the nonlinear function on each edge — flipping where the expressivity lives. The post makes the Kolmogorov-Arnold theorem concrete by walking through a simple function decomposition example. https://medium.com/@kingstonkishanthan/understanding-the-difference-between-kolmogorov-arnold-networks-kan-and-multi-layer-perceptrons-ad3c2238097c

ThunderKittens — Flash Attention, 100 Lines, 30% Faster ThunderKittens is a GPU programming framework that allows writing FlashAttention kernels in ~100 lines of readable code while achieving performance 30% above the reference CUDA implementation. It demonstrates that attention kernel efficiency can be improved substantially without the engineering complexity of hand-optimized CUDA — relevant as attention remains the primary computational bottleneck in transformer inference.

Tim Clicks (@timClicks): NVIDIA’s Grip Has Vanished — Matrix Multiplication Can Be Avoided A tweet responding to research suggesting that matrix-matrix multiplication (MatMul) — the core operation driving NVIDIA’s GPU dominance — can be avoided in certain neural network architectures. The claim references work on ternary weight networks or alternative computations that could theoretically run on less specialized hardware, though the practical implications remain actively debated. https://x.com/timClicks/status/1799926065642852725

Carlos Perez: Multi-Token Prediction A tweet covering Meta’s research on training LLMs to predict multiple future tokens simultaneously (rather than just the next token) — a training objective that improves both training efficiency and model capability on code and reasoning tasks by forcing the model to plan ahead rather than greedily optimize the immediate next step.

ScrapeGraphAI (Elvis/omarsar0 tweet) A tweet about ScrapeGraphAI — a library that uses LLMs to intelligently scrape web content by reasoning about page structure rather than relying on brittle CSS selectors. The approach generalizes across website layouts and handles dynamic JavaScript-rendered pages that traditional scrapers cannot easily handle.

Vincent Abbott (@vtabbott_): Neural Circuit Diagrams (NCDs) + Optimization A tweet about Neural Circuit Diagrams — a graphical language for representing and reasoning about deep learning computations — and their application to algorithm optimization. NCDs provide a formal, compositional notation that makes the structure of attention mechanisms, normalization layers, and residual connections explicit, potentially enabling automated optimization passes. https://x.com/vtabbott_/status/1830812531293856239

LLM Evaluation & Benchmarks

Jim Fan: 3 Types of LLM Evaluations (tweet) NVIDIA researcher Jim Fan taxonomizes LLM evaluations into three types: static benchmarks (fixed question sets), interactive evaluations (model-in-the-loop), and deployment metrics (real user feedback). The taxonomy is useful for thinking about what standard benchmarks like MMLU or HumanEval actually measure versus what matters in production.

Gergely Orosz: SWE-Agent (Princeton, 4x Better Than LLMs) A tweet by Gergely Orosz about SWE-agent — Princeton’s autonomous software engineering agent that achieves 4× the bug-fix rate of prompting an LLM directly on the SWE-bench benchmark. The agent uses a structured action space (file editing, terminal commands, test running) and an agent-computer interface designed specifically for software development workflows.

Dwarkesh Patel: ARC-AGI — Buck and Ryan Beat SOTA in 6 Days, 85% Human Accuracy A tweet reporting that two researchers using hybrid LLM + program synthesis approaches reached 85% on the ARC-AGI challenge’s held-out test set in just 6 days — matching human performance on a benchmark designed specifically to resist pattern-matching solutions. ARC-AGI (François Chollet’s Abstraction and Reasoning Corpus) had been widely cited as a test of general fluid intelligence that LLMs consistently failed. https://x.com/dwarkesh_sp/status/1802771055016378554

Machine Learning Street Talk: ARC Challenge Winners — Jack Cole, Mohamed Osman, Michael Hodel A podcast episode interviewing the top ARC-AGI competition winners, covering their approaches: test-time compute scaling via model ensembles, program synthesis, and geometric augmentations. The strategies that worked were fundamentally different from pure LLM prompting, suggesting ARC-AGI successfully identified a capability gap in standard neural approaches. https://x.com/MLStreetTalk/status/1803189533980144113

Foyle: Logging Implicit Human Feedback Foyle is an open-source tool for logging implicit human feedback signals from developer workflows — capturing not just explicit ratings but behavioral signals like edit distance between AI suggestions and accepted code, time-to-accept, and correction patterns. This data enables continual learning loops that improve AI coding tools without requiring explicit annotation campaigns. https://foyle.io/

Musings on Building a Generative AI Product (LinkedIn) A LinkedIn post reflecting on the practical lessons learned building a production generative AI product — covering the gap between benchmark performance and user satisfaction, the challenges of evaluation, prompt brittleness, latency vs. quality trade-offs, and organizational dynamics when introducing AI into existing workflows.