← Ch 4  ·  Contents  ·  ↑ Contents

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Phases 7–1 — Foundation, Initial Cataloguing, Early Pipeline

2026-03 (initial)


Phase 7 — Book 15 Hybrid Extraction & Processing (2026-03-12)

Book 15Inglish Kannada Padanerake — first dictionary-format book processed:

  • Source: 53-page sample PDF (English-KannadaPadanerakeSample.pdf) — pre-print prelims covering letter A (A→Az)
  • Challenge: Old WX Nudi font encoding. English headwords extracted via raw pdfminer ASCII; Kannada equivalents decoded with wx_decode.py convert_text(). Result: 3,454-line book.md with 84,475 Kannada Unicode characters.
  • kn-eke.md — Eke romanisation of preface, conventions, pattern tables, and ~100-entry A–Az transliteration sample
  • en.md — English analysis with 10 word-formation pattern sections: N+N compounding, verb-derived nouns (-ta/-ike/-uge/-me/-vu), agent nouns (-ga/-uga/-gAra), ಅರಿಮೆ domain names, ಮನೆ institutions, ಗಾಳಿ air-cluster, ಮಾಂಜುಗೆ therapy compounds, productive prefixes, and more
  • claude-prompt.md — Rich AI primer: 6-step word-generation decision tree, 11 domain-specific head-noun cluster tables, 100 curated entries by domain, common mistakes table (Sanskrit vs native), philosophical frame

Phase 6 — Bulk Processing of OCR Books (2026-03-11)

Created en.md + kn-eke.md + claude-prompt.md for all 5 remaining unprocessed OCR books using parallel background agents:

Book en.md kn-eke.md claude-prompt.md Agents
03 — Padagala Olarachane ✅ (prev.) ✅ (prev.) ✅ (prev.)
07 — Kannadada Sollarime ✅ 41KB ✅ 9KB ✅ 18KB aed65cd
17 — Nudi Nadedu Banda Dari ✅ 35KB ✅ 12KB ✅ 20KB ae26172
25 — Vakyagala Olarachane ✅ 30KB ✅ 19KB ✅ 25KB ae33f5d
27 — Baasheya Bagge ✅ (prev.) ✅ (prev.) ✅ (prev.)
28 — Kannadakke Beku ✅ 25KB ✅ 14KB ✅ 20KB aaccf5f
29 — Kannada Vyakarana Yaake Beku ✅ 28KB ✅ 12KB ✅ 19KB a0622eb

Format notes for claude-prompt.md:

  • Books 07, 17, 25 (technical linguistics): modelled on Book 03 template
  • Books 27, 28, 29 (advocacy/popular): modelled on Book 08 template

Phase 5 — WX Decode Tool

Books 07, 17, 25, 28, 29 had OCR output in WX Kannada encoding (old Nudi/KGP font). Built wx_decode.py — a lookup-table decoder (based on aravindavk/ascii2unicode) that converts WX-encoded bytes to Unicode Kannada. Decoded ~1.9M Kannada Unicode chars across 6 files.


Phase 4 — Sarvam Vision OCR Pipeline

Problem: 8 PDFs in Google Drive use old WX Kannada font encoding. pdfminer and pdftotext extract garbled bytes. Clean Kannada Unicode text is not recoverable by text extraction alone — OCR is needed.

Solution: Sarvam Vision API — a 3B multimodal VLM with 84.3% accuracy on olmOCR-Bench, trained on Indian languages including Kannada (kn-IN). Accepts PDF pages, outputs Markdown.

Implementation: ettuge/src/main/python/sarvam-ocr/

File Purpose
requirements.txt sarvamai, pypdf
ocr.py Single-PDF OCR: splits into 10-page chunks (API limit), OCRs each chunk, concatenates
ocr_dnsbhat.py Batch processor: all 8 WX-encoded books, prepends Kannada header to output
README.md Setup, usage, API workflow documentation

Key technical detail: The Sarvam Document Intelligence API has a 10-page limit per job. ocr.py uses pypdf to split each PDF into ≤10-page chunks, submits each as a separate job (create → upload → start → poll → download ZIP → extract .md), then concatenates.

OCR run status (all ✅ complete as of 2026-03-10): | Book | Pages | Chunks | book.md | |——|——-|——–|———| | 03 — Padagala Olarachane | 239 | 24 | ✅ | | 07-vol1 — Sollarime vol.1 | 327 | 33 | ✅ | | 07-vol2 — Sollarime vol.2 | 301 | 31 | ✅ | | 17 — Nudi Nadedu Banda Dari | 405 | 41 | ✅ | | 25 — Vakyagala Olarachane | 289 | 29 | ✅ | | 27 — Baasheya Bagge | 208 | 21 | ✅ | | 28 — Kannadakke Beku | 253 | 26 | ✅ | | 29 — Kannada Vyakarana Yaake Beku | 260 | 26 | ✅ |


Phase 3 — Structured Files (kn / en / kn-eke / claude-prompt)

Book 08Kannadakke Mahaprana Yake Beda — first fully structured book:

  • kn.md — 5-chapter structured Kannada text with TOC, <a id> anchors
  • kn-eke.md — Eke romanisation (first kn-eke produced; served as format template)
  • en.md — English chapter-by-chapter analysis
  • claude-prompt.md — AI context primer with key terms table

Book 14Nijakku Halegannada Vyakarana Entahadu:

  • kn.md — 12-chapter structured Kannada text
  • kn-eke.md — Eke romanisation (created by background agent aa87faf)
  • en.md — English summary with blog series integration
  • claude-prompt.md — AI primer
  • blog.md — 7 Shabdamani blog posts

Book 18Kannada Nudiya Bagege Chintane:

  • en.md — English summaries of all 13 blog posts (29KB) — created by agent a1f957e
  • kn-eke.md — Eke romanisation of all 13 posts (11KB)
  • claude-prompt.md — AI primer with key terms table (7KB)

Book 20An Outline Grammar of Havyaka:

  • en.md — English chapter summaries (phonology, morphology, verb paradigms, case system)
  • claude-prompt.md — 10 key grammatical features, 25 key terms

Book 02Kannadalle Hosapadagalannu Kattuva Bage — most complex structured book:

  • kn.md — 547 lines, 17 chapters, 37 sections, all with <a id> anchors
  • kn-eke.md — 830 lines, 43KB — Eke romanisation of all 37 sections, complete with dual cross-links to kn.md and en.md
  • en.md — English summary with cross-references
  • claude-prompt.md — word-formation methodology primer
  • blog.md — 15 blog posts (6,469 lines)

Phase 2 — Text Collection

YouTube transcripts extracted for books 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 12, 13:

  • Best quality: 02 (519 lines), 03 (605 lines), 05 (539 lines)
  • Corrupted: 06, 09, 10, 11, 13 (garbled or truncated)

Internet Archive (archive.org) extractions:

  • Book 08: PDF + DjVu → 4,243-line book.md + 1,965-line djvu.md — ✅ full clean text
  • Book 14: PDF + DjVu → 11,791-line book.md + 7,033-line djvu.md — ✅ full clean text
  • Book 20: DjVu → full text of 1971 Havyaka grammar — ✅

Blog collection via Wayback Machine CDX + fetch:

  • Book 02: 15 blog posts (parts 4–18 of ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಸಾಟಿ… series)
  • Book 14: 7 blog posts (ಶಬ್ದಮಣಿದರ್ಪಣದಲ್ಲಿ ತಳಮಟ್ಟದ ತಪ್ಪುಗಳು series, May–June 2017)
  • Book 18: 13 blog posts (parts 1–3, 10, 14, 18, 20, 23, 27–29, 33, 35 of ನುಡಿಯರಿಮೆಯ ಇಣುಕುನೋಟ series)

Google Drive PDF downloads (10 new PDFs, 2026-03-10):

  • Books 03, 07(vol1+2), 17, 25 — confirmed in Drive
  • Books 27, 28, 29 — newly discovered books, not previously in catalogue
  • Book 15 — sample/prelims PDF (53 pages)
  • Problem identified: all PDFs use old WX Kannada font encoding (non-Unicode, Ghostscript/PageMaker era) → pdfminer extracts garbled text like ¨sÁµÉAiÀÄ

Phase 1 — Discovery & Cataloguing

  • Crawled dnshankarabhat.net via CDX Wayback API → found 91 unique archived pages
  • Identified 29 works across Kannada popular books, English academic monographs, journal articles, and blog series
  • Created BOOKS.md — comprehensive annotated catalogue of all 29 works, organised into thematic sections (Grammar, Dialect Studies, History, Vocabulary, Old Kannada, English Academic, Sound Change, Syntax)