Ch 5 — Phases 7–1 — Foundation, Initial Cataloguing, Early Pipeline

← Ch 4 · Contents · ↑ Contents

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Phases 7–1 — Foundation, Initial Cataloguing, Early Pipeline

2026-03 (initial)

Phase 7 — Book 15 Hybrid Extraction & Processing (2026-03-12)

Book 15 — Inglish Kannada Padanerake — first dictionary-format book processed:

Source: 53-page sample PDF (English-KannadaPadanerakeSample.pdf) — pre-print prelims covering letter A (A→Az)
Challenge: Old WX Nudi font encoding. English headwords extracted via raw pdfminer ASCII; Kannada equivalents decoded with wx_decode.py convert_text(). Result: 3,454-line book.md with 84,475 Kannada Unicode characters.
kn-eke.md — Eke romanisation of preface, conventions, pattern tables, and ~100-entry A–Az transliteration sample
en.md — English analysis with 10 word-formation pattern sections: N+N compounding, verb-derived nouns (-ta/-ike/-uge/-me/-vu), agent nouns (-ga/-uga/-gAra), ಅರಿಮೆ domain names, ಮನೆ institutions, ಗಾಳಿ air-cluster, ಮಾಂಜುಗೆ therapy compounds, productive prefixes, and more
claude-prompt.md — Rich AI primer: 6-step word-generation decision tree, 11 domain-specific head-noun cluster tables, 100 curated entries by domain, common mistakes table (Sanskrit vs native), philosophical frame

Phase 6 — Bulk Processing of OCR Books (2026-03-11)

Created en.md + kn-eke.md + claude-prompt.md for all 5 remaining unprocessed OCR books using parallel background agents:

Book	en.md	kn-eke.md	claude-prompt.md	Agents
03 — Padagala Olarachane	✅ (prev.)	✅ (prev.)	✅ (prev.)	—
07 — Kannadada Sollarime	✅ 41KB	✅ 9KB	✅ 18KB	aed65cd
17 — Nudi Nadedu Banda Dari	✅ 35KB	✅ 12KB	✅ 20KB	ae26172
25 — Vakyagala Olarachane	✅ 30KB	✅ 19KB	✅ 25KB	ae33f5d
27 — Baasheya Bagge	✅ (prev.)	✅ (prev.)	✅ (prev.)	—
28 — Kannadakke Beku	✅ 25KB	✅ 14KB	✅ 20KB	aaccf5f
29 — Kannada Vyakarana Yaake Beku	✅ 28KB	✅ 12KB	✅ 19KB	a0622eb

Format notes for claude-prompt.md:

Books 07, 17, 25 (technical linguistics): modelled on Book 03 template
Books 27, 28, 29 (advocacy/popular): modelled on Book 08 template

Phase 5 — WX Decode Tool

Books 07, 17, 25, 28, 29 had OCR output in WX Kannada encoding (old Nudi/KGP font). Built wx_decode.py — a lookup-table decoder (based on aravindavk/ascii2unicode) that converts WX-encoded bytes to Unicode Kannada. Decoded ~1.9M Kannada Unicode chars across 6 files.

Phase 4 — Sarvam Vision OCR Pipeline

Problem: 8 PDFs in Google Drive use old WX Kannada font encoding. pdfminer and pdftotext extract garbled bytes. Clean Kannada Unicode text is not recoverable by text extraction alone — OCR is needed.

Solution: Sarvam Vision API — a 3B multimodal VLM with 84.3% accuracy on olmOCR-Bench, trained on Indian languages including Kannada (kn-IN). Accepts PDF pages, outputs Markdown.

Implementation: ettuge/src/main/python/sarvam-ocr/

File	Purpose
`requirements.txt`	`sarvamai`, `pypdf`
`ocr.py`	Single-PDF OCR: splits into 10-page chunks (API limit), OCRs each chunk, concatenates
`ocr_dnsbhat.py`	Batch processor: all 8 WX-encoded books, prepends Kannada header to output
`README.md`	Setup, usage, API workflow documentation

Key technical detail: The Sarvam Document Intelligence API has a 10-page limit per job. ocr.py uses pypdf to split each PDF into ≤10-page chunks, submits each as a separate job (create → upload → start → poll → download ZIP → extract .md), then concatenates.

OCR run status (all ✅ complete as of 2026-03-10): | Book | Pages | Chunks | book.md | |——|——-|——–|———| | 03 — Padagala Olarachane | 239 | 24 | ✅ | | 07-vol1 — Sollarime vol.1 | 327 | 33 | ✅ | | 07-vol2 — Sollarime vol.2 | 301 | 31 | ✅ | | 17 — Nudi Nadedu Banda Dari | 405 | 41 | ✅ | | 25 — Vakyagala Olarachane | 289 | 29 | ✅ | | 27 — Baasheya Bagge | 208 | 21 | ✅ | | 28 — Kannadakke Beku | 253 | 26 | ✅ | | 29 — Kannada Vyakarana Yaake Beku | 260 | 26 | ✅ |

Phase 3 — Structured Files (kn / en / kn-eke / claude-prompt)

Book 08 — Kannadakke Mahaprana Yake Beda — first fully structured book:

kn.md — 5-chapter structured Kannada text with TOC, <a id> anchors
kn-eke.md — Eke romanisation (first kn-eke produced; served as format template)
en.md — English chapter-by-chapter analysis
claude-prompt.md — AI context primer with key terms table

Book 14 — Nijakku Halegannada Vyakarana Entahadu:

kn.md — 12-chapter structured Kannada text
kn-eke.md — Eke romanisation (created by background agent aa87faf)
en.md — English summary with blog series integration
claude-prompt.md — AI primer
blog.md — 7 Shabdamani blog posts

Book 18 — Kannada Nudiya Bagege Chintane:

en.md — English summaries of all 13 blog posts (29KB) — created by agent a1f957e
kn-eke.md — Eke romanisation of all 13 posts (11KB)
claude-prompt.md — AI primer with key terms table (7KB)

Book 20 — An Outline Grammar of Havyaka:

en.md — English chapter summaries (phonology, morphology, verb paradigms, case system)
claude-prompt.md — 10 key grammatical features, 25 key terms

Book 02 — Kannadalle Hosapadagalannu Kattuva Bage — most complex structured book:

kn.md — 547 lines, 17 chapters, 37 sections, all with <a id> anchors
kn-eke.md — 830 lines, 43KB — Eke romanisation of all 37 sections, complete with dual cross-links to kn.md and en.md
en.md — English summary with cross-references
claude-prompt.md — word-formation methodology primer
blog.md — 15 blog posts (6,469 lines)

Phase 2 — Text Collection

YouTube transcripts extracted for books 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 12, 13:

Best quality: 02 (519 lines), 03 (605 lines), 05 (539 lines)
Corrupted: 06, 09, 10, 11, 13 (garbled or truncated)

Internet Archive (archive.org) extractions:

Book 08: PDF + DjVu → 4,243-line book.md + 1,965-line djvu.md — ✅ full clean text
Book 14: PDF + DjVu → 11,791-line book.md + 7,033-line djvu.md — ✅ full clean text
Book 20: DjVu → full text of 1971 Havyaka grammar — ✅

Blog collection via Wayback Machine CDX + fetch:

Book 02: 15 blog posts (parts 4–18 of ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಸಾಟಿ… series)
Book 14: 7 blog posts (ಶಬ್ದಮಣಿದರ್ಪಣದಲ್ಲಿ ತಳಮಟ್ಟದ ತಪ್ಪುಗಳು series, May–June 2017)
Book 18: 13 blog posts (parts 1–3, 10, 14, 18, 20, 23, 27–29, 33, 35 of ನುಡಿಯರಿಮೆಯ ಇಣುಕುನೋಟ series)

Google Drive PDF downloads (10 new PDFs, 2026-03-10):

Books 03, 07(vol1+2), 17, 25 — confirmed in Drive
Books 27, 28, 29 — newly discovered books, not previously in catalogue
Book 15 — sample/prelims PDF (53 pages)
Problem identified: all PDFs use old WX Kannada font encoding (non-Unicode, Ghostscript/PageMaker era) → pdfminer extracts garbled text like ¨sÁµÉAiÀÄ

Phase 1 — Discovery & Cataloguing

Crawled dnshankarabhat.net via CDX Wayback API → found 91 unique archived pages
Identified 29 works across Kannada popular books, English academic monographs, journal articles, and blog series
Created BOOKS.md — comprehensive annotated catalogue of all 29 works, organised into thematic sections (Grammar, Dialect Studies, History, Vocabulary, Old Kannada, English Academic, Sound Change, Syntax)