← Ch 4 · Contents · ↑ Contents
Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5
Phases 7–1 — Foundation, Initial Cataloguing, Early Pipeline
2026-03 (initial)
Phase 7 — Book 15 Hybrid Extraction & Processing (2026-03-12)
Book 15 — Inglish Kannada Padanerake — first dictionary-format book processed:
- Source: 53-page sample PDF (
English-KannadaPadanerakeSample.pdf) — pre-print prelims covering letter A (A→Az) - Challenge: Old WX Nudi font encoding. English headwords extracted via raw pdfminer ASCII; Kannada equivalents decoded with
wx_decode.pyconvert_text(). Result: 3,454-linebook.mdwith 84,475 Kannada Unicode characters. kn-eke.md— Eke romanisation of preface, conventions, pattern tables, and ~100-entry A–Az transliteration sampleen.md— English analysis with 10 word-formation pattern sections: N+N compounding, verb-derived nouns (-ta/-ike/-uge/-me/-vu), agent nouns (-ga/-uga/-gAra), ಅರಿಮೆ domain names, ಮನೆ institutions, ಗಾಳಿ air-cluster, ಮಾಂಜುಗೆ therapy compounds, productive prefixes, and moreclaude-prompt.md— Rich AI primer: 6-step word-generation decision tree, 11 domain-specific head-noun cluster tables, 100 curated entries by domain, common mistakes table (Sanskrit vs native), philosophical frame
Phase 6 — Bulk Processing of OCR Books (2026-03-11)
Created en.md + kn-eke.md + claude-prompt.md for all 5 remaining unprocessed OCR books using parallel background agents:
| Book | en.md | kn-eke.md | claude-prompt.md | Agents |
|---|---|---|---|---|
| 03 — Padagala Olarachane | ✅ (prev.) | ✅ (prev.) | ✅ (prev.) | — |
| 07 — Kannadada Sollarime | ✅ 41KB | ✅ 9KB | ✅ 18KB | aed65cd |
| 17 — Nudi Nadedu Banda Dari | ✅ 35KB | ✅ 12KB | ✅ 20KB | ae26172 |
| 25 — Vakyagala Olarachane | ✅ 30KB | ✅ 19KB | ✅ 25KB | ae33f5d |
| 27 — Baasheya Bagge | ✅ (prev.) | ✅ (prev.) | ✅ (prev.) | — |
| 28 — Kannadakke Beku | ✅ 25KB | ✅ 14KB | ✅ 20KB | aaccf5f |
| 29 — Kannada Vyakarana Yaake Beku | ✅ 28KB | ✅ 12KB | ✅ 19KB | a0622eb |
Format notes for claude-prompt.md:
- Books 07, 17, 25 (technical linguistics): modelled on Book 03 template
- Books 27, 28, 29 (advocacy/popular): modelled on Book 08 template
Phase 5 — WX Decode Tool
Books 07, 17, 25, 28, 29 had OCR output in WX Kannada encoding (old Nudi/KGP font). Built wx_decode.py — a lookup-table decoder (based on aravindavk/ascii2unicode) that converts WX-encoded bytes to Unicode Kannada. Decoded ~1.9M Kannada Unicode chars across 6 files.
Phase 4 — Sarvam Vision OCR Pipeline
Problem: 8 PDFs in Google Drive use old WX Kannada font encoding. pdfminer and pdftotext extract garbled bytes. Clean Kannada Unicode text is not recoverable by text extraction alone — OCR is needed.
Solution: Sarvam Vision API — a 3B multimodal VLM with 84.3% accuracy on olmOCR-Bench, trained on Indian languages including Kannada (kn-IN). Accepts PDF pages, outputs Markdown.
Implementation: ettuge/src/main/python/sarvam-ocr/
| File | Purpose |
|---|---|
requirements.txt | sarvamai, pypdf |
ocr.py | Single-PDF OCR: splits into 10-page chunks (API limit), OCRs each chunk, concatenates |
ocr_dnsbhat.py | Batch processor: all 8 WX-encoded books, prepends Kannada header to output |
README.md | Setup, usage, API workflow documentation |
Key technical detail: The Sarvam Document Intelligence API has a 10-page limit per job. ocr.py uses pypdf to split each PDF into ≤10-page chunks, submits each as a separate job (create → upload → start → poll → download ZIP → extract .md), then concatenates.
OCR run status (all ✅ complete as of 2026-03-10): | Book | Pages | Chunks | book.md | |——|——-|——–|———| | 03 — Padagala Olarachane | 239 | 24 | ✅ | | 07-vol1 — Sollarime vol.1 | 327 | 33 | ✅ | | 07-vol2 — Sollarime vol.2 | 301 | 31 | ✅ | | 17 — Nudi Nadedu Banda Dari | 405 | 41 | ✅ | | 25 — Vakyagala Olarachane | 289 | 29 | ✅ | | 27 — Baasheya Bagge | 208 | 21 | ✅ | | 28 — Kannadakke Beku | 253 | 26 | ✅ | | 29 — Kannada Vyakarana Yaake Beku | 260 | 26 | ✅ |
Phase 3 — Structured Files (kn / en / kn-eke / claude-prompt)
Book 08 — Kannadakke Mahaprana Yake Beda — first fully structured book:
kn.md— 5-chapter structured Kannada text with TOC,<a id>anchorskn-eke.md— Eke romanisation (first kn-eke produced; served as format template)en.md— English chapter-by-chapter analysisclaude-prompt.md— AI context primer with key terms table
Book 14 — Nijakku Halegannada Vyakarana Entahadu:
kn.md— 12-chapter structured Kannada textkn-eke.md— Eke romanisation (created by background agentaa87faf)en.md— English summary with blog series integrationclaude-prompt.md— AI primerblog.md— 7 Shabdamani blog posts
Book 18 — Kannada Nudiya Bagege Chintane:
en.md— English summaries of all 13 blog posts (29KB) — created by agenta1f957ekn-eke.md— Eke romanisation of all 13 posts (11KB)claude-prompt.md— AI primer with key terms table (7KB)
Book 20 — An Outline Grammar of Havyaka:
en.md— English chapter summaries (phonology, morphology, verb paradigms, case system)claude-prompt.md— 10 key grammatical features, 25 key terms
Book 02 — Kannadalle Hosapadagalannu Kattuva Bage — most complex structured book:
kn.md— 547 lines, 17 chapters, 37 sections, all with<a id>anchorskn-eke.md— 830 lines, 43KB — Eke romanisation of all 37 sections, complete with dual cross-links to kn.md and en.mden.md— English summary with cross-referencesclaude-prompt.md— word-formation methodology primerblog.md— 15 blog posts (6,469 lines)
Phase 2 — Text Collection
YouTube transcripts extracted for books 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 12, 13:
- Best quality: 02 (519 lines), 03 (605 lines), 05 (539 lines)
- Corrupted: 06, 09, 10, 11, 13 (garbled or truncated)
Internet Archive (archive.org) extractions:
- Book 08: PDF + DjVu → 4,243-line
book.md+ 1,965-linedjvu.md— ✅ full clean text - Book 14: PDF + DjVu → 11,791-line
book.md+ 7,033-linedjvu.md— ✅ full clean text - Book 20: DjVu → full text of 1971 Havyaka grammar — ✅
Blog collection via Wayback Machine CDX + fetch:
- Book 02: 15 blog posts (parts 4–18 of ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಸಾಟಿ… series)
- Book 14: 7 blog posts (ಶಬ್ದಮಣಿದರ್ಪಣದಲ್ಲಿ ತಳಮಟ್ಟದ ತಪ್ಪುಗಳು series, May–June 2017)
- Book 18: 13 blog posts (parts 1–3, 10, 14, 18, 20, 23, 27–29, 33, 35 of ನುಡಿಯರಿಮೆಯ ಇಣುಕುನೋಟ series)
Google Drive PDF downloads (10 new PDFs, 2026-03-10):
- Books 03, 07(vol1+2), 17, 25 — confirmed in Drive
- Books 27, 28, 29 — newly discovered books, not previously in catalogue
- Book 15 — sample/prelims PDF (53 pages)
- Problem identified: all PDFs use old WX Kannada font encoding (non-Unicode, Ghostscript/PageMaker era) → pdfminer extracts garbled text like
¨sÁµÉAiÀÄ
Phase 1 — Discovery & Cataloguing
- Crawled dnshankarabhat.net via CDX Wayback API → found 91 unique archived pages
- Identified 29 works across Kannada popular books, English academic monographs, journal articles, and blog series
- Created
BOOKS.md— comprehensive annotated catalogue of all 29 works, organised into thematic sections (Grammar, Dialect Studies, History, Vocabulary, Old Kannada, English Academic, Sound Change, Syntax)