Ch 4 — Phases 14–8 — OCR Cleanup: Books 07–29, English Summaries

← Ch 3 · Contents · Ch 5 →

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Phases 14–8 — OCR Cleanup: Books 07–29, English Summaries

2026-03 (earlier)

Phase 14 — Bulk OCR Cleanup: Books 03, 07, 17, 25, 27 (2026-03-17)

Character-level and structural cleanup of the five remaining uncleaned OCR books, producing -kn.md files for each and regenerating all six -kn-eke.md files. All books now have 0 residual Kannada characters in their Eke output.

Two OCR error classes, two fix scripts:

Book(s)	OCR source	Error type	Fix script
03, 27	Sarvam Vision OCR	Structural artifacts only (page numbers, `---` separators, running headers)	`fix_books_sarvam.py`
07, 17, 25	Sarvam OCR + WX-decode	Character-level garbling + structural artifacts	`fix_books_wx.py`

WX character-level errors fixed (books 07, 17, 25):

Error class	Pattern	Scope
Arka-ottu reversal	`ಣ್ರ→ರ್ಣ`, `ಥ್ರ→ರ್ಥ`, `ಮ್ರ→ರ್ಮ`, `ಯ್ರ→ರ್ಯ`, `ಧ್ರ→ರ್ಧ`	All 3 books
Ç-fix (garbled aa-mathrā U+00C7)	`\u0CC6\u00C7\u0CBF` → `\u0CCB` (oo-sign); `\u0CC6\u00C7` → `\u0CCA` (o-sign)	All 3 books
Ya-garble	`0iÉ` (U+0030+U+0069+U+00C9) → `ಯ`; 159 occurrences	Book 17 only
Word-specific	`ನಿದ್ರಿಷ್ಟ→ನಿರ್ದಿಷ್ಟ`, `ನಿದ್ರೇಶ→ನಿರ್ದೇಶ`, `¦ÃpPÉ→ಪೀಠಿಕೆ`, `ದೀಘ್ರ→ದೀರ್ಘ`	Book 25 only

Key decisions on safe vs. unsafe replacements:

ದ್ರ→ರ್ದ blanket fix is UNSAFE — legitimate in ಕೇಂದ್ರ, ದ್ರಾವಿಡ, ಚಂದ್ರ etc.; targeted word-level fixes only
ನ್ರ in ಏನ್ರಿ is legitimate colloquial Kannada — not an arka-ottu error
ಧ್ರ in books 03/27 (Sarvam OCR) is legitimate (ಆಂಧ್ರ, ಉತ್ತರಧ್ರುವ) — Sarvam OCR was correct

Multi-pass dependency fix: The ಸಾಮಥ್ಯ್ರ problem: ಯ್ರ→ರ್ಯ creates ಥ್ರ after ಥ್ರ→ರ್ಥ has already run. Solved by apply_char_fixes() iterating up to max_passes=3 until text is stable. A single pass was insufficient for this class of chained reversal.

Critical structural insight: Char fixes must be applied to the entire file (including TOC, acknowledgements, index sections before the first <a id="adhyAya-N"> anchor), not only the body. An earlier version that split header/body first and fixed only the body left hundreds of errors in front matter and prefatory sections.

Results — book.md → kn.md line counts:

Book	book.md	kn.md	Lines removed	Notes
03 — Padagala Olarachane	12,319	11,437	882	Sarvam OCR; 653 structural lines deleted
07 vol1 — Sollarime	24,861	20,475	4,386	WX; Ç-fix
07 vol2 — Sollarime	15,324	13,928	1,396	WX; Ç-fix
17 — Nudi Nadedu Banda Dari	22,312	16,883	5,429	WX; ya-fix + Ç-fix + 1,539 zero-Kannada lines removed
25 — Vakyagala Olarachane	14,485	11,676	2,809	WX; word-level fixes + zero-Kannada lines removed
27 — Baasheya Bagge	9,138	8,245	893	Sarvam OCR; 545 structural lines deleted

kn-eke.md generation (generic gen_kn_eke.py):

A single generic transliterator replaced the book-28-specific kn_to_eke.py. Key fix over earlier stubs: <td> and <th> table cell content is now transliterated (matched by regex >[^<]*< inside HTML lines) rather than passed through verbatim. This eliminated 7,201 residual Kannada chars in book 03’s previous stub (book 03 is table-heavy). All 6 books now output 0 residual Kannada characters.

Commit: d8e037a “Phase 14: OCR cleanup + kn.md + kn-eke.md for books 03, 07, 17, 25, 27” — 12 files (6 new kn.md + 6 kn-eke.md regenerated)

Phase 13 — Book 28 OCR Deep-Clean + kn.md + Anchors (2026-03-17)

Full multi-pass OCR cleanup of book 28 (ಕನ್ನಡಕ್ಕೆ ಬೇಕು ಕನ್ನಡದ್ದೇ ವ್ಯಾಕರಣ), creation of a clean kn.md (both books 28 and 29), anchor scaffolding, and cross-link updates. Also documented the full OCR-cleaning methodology in a new kannada-ocr-cleaner skill.

OCR character-level fixes — book 28 (three-pass pipeline)

The raw -book.md OCR text used the legacy Nudi/KGP font encoding (WX-decoded to Unicode by wx_decode.py, but with residual character-level errors). Three targeted fix scripts ran in sequence:

Script	What it fixed	Instances
`fix_arka_ottu.py`	arka-ottu reversals: `ಣ್ರ→ರ್ಣ`, `ಥ್ರ→ರ್ಥ`, `ಮ್ರ→ರ್ಮ`, `ಯ್ರ→ರ್ಯ`, `ದೀಘ್ರ→ದೀರ್ಘ`; word-specific `ತ್ರ/ಧ್ರ` cases; `ಙõï/ÂÐ→ಙ್`; `ಬಹÅ→ಬಹು`	~500+
`fix_residual_ocr.py`	Residual long-e `ಯುೀ→ಯೇ` (59×); archaic diphthong `0iÀiï→ಯ್` (2×); proper name `ತಾರಾಪೆÇರೆವಾಲಾ→ತಾರಾಪೋರ್ವಾಲಾ`; bibliography English garbling (`ಖಿhe → The`, `ಆeಟhi → Delhi`, etc.)	~75
`fix_page_fragments.py`	Structural: orphaned fragments before section headings (12 cases); standalone running chapter headers (8 cases — 4 with fragments, 4 without)	96 lines deleted

OCR structural artifacts (new class discovered)

The most interesting class of errors was purely structural, not character-level: page-break artifacts where the last words of a print page ended up as isolated blank-line-separated lines just before the next subsection heading. Example before fix:

ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ.

ವ್ಯಾಕರಣವೆಂಬುದು

ನುಡಿಯಿಂದ

ನುದಿಗೆ

1.2  ವ್ಯಾಕರಣದ ಉದ್ದೇಶ

After fix: ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ. ವ್ಯಾಕರಣವೆಂಬುದು ನುಡಿಯಿಂದ ನುದಿಗೆ

Running chapter headers (chapter titles printed at the top of each print page, OCR’d into the body) were deleted — the 12 chapter names (ಮುನ್ನೋಟ, ಸೇರಿಕೆಯ ನಿಯಮಗಳು, …, ಮುಕ್ತಾಯ) were built into a RUNNING_HEADERS set and matched exactly. The key detection rule: a line is an orphaned fragment only if preceded by a blank line — this guards against misidentifying normal wrapped paragraph lines (which are not preceded by blanks).

Total effect: 9,613 → 9,517 lines (96 lines removed, 16 modified).

kn.md creation (books 28 + 29)

New kn.md files created from the cleaned OCR for both books, with:

Jekyll front matter (nav_order, title, parent, redirect_from)
<a id="adhyAya-N"></a> anchors before each ## N. ChapterTitle heading
12 chapter anchors for book 28 (adhyAya-1 through adhyAya-12)
11 chapter anchors for book 29 (adhyAya-1 through adhyAya-11)
Cross-navigation links: [English →](en#chapter-N--...) | [Eke →](kn-eke#anchor) before each chapter heading

en.md anchors + cross-links (books 28 + 29)

Both -en.md files updated with:

13 <a id="chapter-N--..."> anchors in book 28 (chapters 1–12 + key-terms-glossary)
12 <a id="chapter-N--..."> anchors in book 29 (chapters 1–11 + key-terms-glossary)
All [ಕನ್ನಡ →] links updated from bare -book targets to -kn#adhyAya-N fragment URLs

kannada-ocr-cleaner skill created

New Claude skill at .claude/skills/kannada-ocr-cleaner/SKILL.md documenting all four classes of error and the methodology:

Class 1: Vowel-sign/consonant garbling (ಯ, ಯೇ, ಯ್, ಙ್, ಬಹು patterns)
Class 2: Arka-ottu reversal (global-safe + word-specific replacements)
Class 3: English text garbled through legacy font (bibliography, titles)
Class 4: OCR page-break structural artifacts (orphaned fragments + running headers) — added in this phase, with the full three-pass fix script pattern

Phase 12 — mahAprana (Aspirate) Eke Correction (2026-03-16)

Corrected a systematic error in the Eke romanisation rule for aspirated consonants. All kn-eke.md files had been generated with the wrong rule “drop aspirates” (ಭ→b, ಧ→d, ಖ→k, ಥ→t, ಫ→p, ಭ→b). The correct rule is “preserve aspirates with h marker” (ಭ→bh, ಧ→dh, ಖ→kh, ಥ→th, ಫ→ph).

The guiding principle: Eke romanises what is written in the source. If the source uses ಭ (aspirated labial), the romanisation must write bh, not b — to faithfully represent what DNS Bhat wrote. (Separately, the word-coining philosophy of ellara kannaDa avoids mahapranas in new coinages, since they don’t occur in native Dravidian speech — but that’s a word-formation rule, not a transcription rule.)

Specific fixes applied (Python re.sub + replace):

Wrong Eke	Correct Eke	Source consonant	Instances
`bAShe`	`bhAShe`	ಭ (ಭಾಷೆ)	~18 files
`bAga`	`bhAga`	ಭ (ಭಾಗ)	multiple
`sankara baT`	`sankara bhaT`	ಭ (ಭಟ್)	all files
`adyAya`	`adhyAya`	ಧ (ಅಧ್ಯಾಯ)	multiple
`sambanda`	`sambandha`	ಧ	multiple
`\badika`	`adhika`	ಧ	multiple
`\bmukya`	`mukhya`	ಖ (ಮುಖ್ಯ)	multiple
`leKana`	`lEkhana`	ಖ (ಲೇಖನ)	multiple
`lEkakaru`	`lEkhakaru`	ಖ (ಲೇಖಕರು)	multiple
`\barta`	`artha`	ಥ (ಅರ್ಥ)	multiple
`\bpATa`	`pATha`	ಠ (ಪಾಠ)	multiple
`bAgya`	`bhAgya`	ಭ (ಭಾಗ್ಯ)	1
`dIrga`	`dIrgha`	ಘ (ದೀರ್ಘ)	2

Commit: 907ac31 “eke: fix all mahAprANa (aspirate) romanization” — 22 files, 398 insertions/deletions

Skill files and PROJECT-RECAP also updated in this phase to reflect the corrected rule.

Phase 11 — GitHub Pages / Jekyll Deployment (2026-03-14–15)

Set up a public-facing website at https://vwulf.github.io/ettuge/ using Jekyll with the just-the-docs theme, served via GitHub Pages. The entire pipeline is in .github/workflows/pages.yml — no pre-generated files are checked in; everything is built fresh from src/ on every push to master.

Infrastructure built:

Component	Details
`docs/_config.yml`	`theme: just-the-docs`, `color_scheme: light`, `search_enabled: true`, `nav_fold: true`, `defaults: layout: default`
`docs/Gemfile`	`gem "just-the-docs"` pinned for GitHub Actions
`.github/workflows/pages.yml`	Build + deploy pipeline: copy src/ → docs/dnsbhat/, generate nav, add front matter, Jekyll build, deploy
`.claude/launch.json`	Local preview server entry: `bundle exec jekyll serve --livereload --baseurl ""`

Bugs encountered and fixed:

Bug	Root cause	Fix
404 on all pages	`configure-pages@v5` step had no `id:` → `base_path` was empty → Jekyll built without baseurl → all links used `/dnsbhat/...` instead of `/ettuge/dnsbhat/...`	Added `id: setup-pages` to the step
Content files (en/kn/eke) returned 404	`-en.md`, `-kn.md`, `-kn-eke.md` lacked Jekyll front matter → Jekyll treated them as static assets, never built HTML	Workflow step prepends `---\nnav_exclude: true\n---` to every `.md` missing front matter
Sidebar flooded with transcripts, prompts, metadata	Empty `---\n---` front matter made all files sidebar-visible	Changed injected front matter to `nav_exclude: true`
Accidentally committed `docs/dnsbhat/` files	`.gitignore` pattern `docs/dnsbhat/*/` only excluded subdirs, not root-level loose files	`git rm --cached`; added `docs/dnsbhat/*.md` exception (`!docs/dnsbhat/index.md`) to `.gitignore`

Format-first sidebar redesign:

The original sidebar showed books by number, and each book’s entry pointed to the raw README metadata — noisy and not useful for readers. Replaced with a format-oriented three-section sidebar generated by a Python script in the workflow:

English Summaries        (nav_order: 10)
  └── Book 02 — Hosapadagalannu Kattuva Bage
  └── Book 03 — Padagala Olarachane
  └── ...  (en.md preferred; blog.md fallback; YouTube transcript last resort)

Kannada Text             (nav_order: 20)
  └── Book 02, 03, 07, 08, 14, ...  (kn.md or book.md)

Eke Transliteration      (nav_order: 30)
  └── Book 02, 03, 04, ...  (kn-eke.md)

The Python script (inline python3 - <<'PYEOF' in the workflow):

Creates three section index pages (english-summaries.md, kannada-text.md, eke-transliteration.md) with has_children: true
For each book directory, finds the best available file per format (en > blog > fallback)
Rewrites each content file’s front matter with parent: "English Summaries" (or Kannada Text / Eke Transliteration) and nav_order: {book_num} — placing it in the correct sidebar section
Rewrites the book’s main landing page body as a clean summary: Kannada title, English title, description blockquote, status, and a links table — no more raw README metadata or YouTube TOC at top level
Books with a dedicated en.md get nav_exclude: true on their landing page (the en.md IS the sidebar entry); transcript-only books get parent: "English Summaries" on their landing page (it IS the entry)

Result: All 29 books appear in the English Summaries section; 16 appear in Kannada Text; 14 appear in Eke Transliteration. Book landing pages are clean reader-facing pages, not developer metadata dumps.

Phase 10 — Book 08 varSa OCR Correction (2026-03-14)

Targeted fix for a systematic OCR misread in book 08 (Kannadakke Mahaprana Yake Beda). The word ವರ್ಷ (“year”, Eke: varSa) was rendered in four wrong patterns by the OCR engine — corrected in both the Eke romanisation file and the structured Kannada file.

Wrong form	Correct	Count
`varragaL*` / `varradoLage`	`varSagaL*` / `varSadoLage`	6
`varmagaL*`	`varSagaL*`	3
`varyagaL*`	`varSagaL*`	2
`varka`	`varSa`	2

Files fixed: 08-kannaDakke-mahAprANa-yAke-bEDa-kn-eke.md (13 instances) and 08-kannaDakke-mahAprANa-yAke-bEDa-kn.md (matching Kannada script corrections: ವರ್ರ, ವರ್ಮ, ವರ್ಯ, ವರ್ಕ → ವರ್ಷ). No other books contained these patterns.

Phase 9 — Eke Romanisation Bug-Fix Passes (2026-03-13)

Systematic correction of four romanisation errors propagated across all processed kn-eke.md files. Errors originated from LLM generation using HK-adjacent conventions instead of Eke rules.

Pass	Error pattern	Correct Eke	Files	Instances
1 — Anusvara (M)	`M` used for anusvāra (HK style)	Assimilated nasal+C: `mb, mp, nk, ng, nc, nt, nd…` — never standalone M	14 files	~1,511
2 — N as anusvara	`N` (retroflex ಣ) used before stop consonants	`n` before stop consonants; N reserved exclusively for ಣ	12 files	~69
3 — R as ರ	`R` (exclusively ಱ) incorrectly used for common ರ	Lowercase `r` for all ರ; R kept exclusively for ಱ	9 files	~230
4 — Vocalic ṛ as ri/ru	`kRi/sri/kri/kru` used for ೃ/ಋ (vocalic ṛ)	Eke `x` (short ṛ = ಋ/ೃ); Eke `X` (long ṝ = ೠ/ೄ)	all files	~400+

Pass 4 specific word families fixed:

Wrong	Correct	Kannada	Count
`samskrita / samskruta / samskrta`	`samskxta`	ಸಂಸ್ಕೃತ	~400
`sriSTi`	`sxSTi`	ಸೃಷ್ಟಿ	~30
`driSTi`	`dxSTi`	ದೃಷ್ಟಿ	~10
`driSya`	`dxSya`	ದೃಶ್ಯ	3
`srijana`	`sxjana`	ಸೃಜನ	1
`mrita`	`mxta`	ಮೃತ	1
`tritIya`	`txtIya`	ತೃತೀಯ	2

Caution applied: Words where original Kannada genuinely has ಕ್ರ + ಉ/ಇ (consonant r + vowel) were not changed. Examples: krutaka (ಕ್ರುತಕ), krudanta, krullingagaLa, kriyA (ಕ್ರಿಯಾ) — all verified to use consonant r, not ೃ sign.

Skill files updated — both dns-bhat-book-summarizer/SKILL.md and dns-bhat-transcript-summarizer/SKILL.md vowel/consonant tables corrected with all four rules. ellara-kannada-word-coiner/SKILL.md and references/eke-romanization.md also updated.

Phase 8 — Cross-link Audit & Fixes; Book 15 kn-eke Restructure; Transcript Book Processing (2026-03-13)

Cross-link audit — systematic review of all processed books revealed inconsistent cross-linking in en.md files:

Issue	Books affected	Fix applied
`[ಕನ್ನಡ →]` links present but no `[Eke →]` links	02, 08, 14	Appended `\\| [Eke →](kn-eke.md#anchor)` to every existing section link (43 + 33 + 61 links)
No cross-links at all (bulk-processed books)	07, 17, 25, 27, 28, 29	Inserted `[ಕನ್ನಡ →](book.md) \\| [Eke →](kn-eke.md#anchor)` after each chapter heading (60 total links across 6 files)
Broken links to non-existent `kn.md`	03	Retargeted 9 links to `book.md` (bare) + added `[Eke →]` with correct kn-eke.md anchors; fixed footer reference

All en.md files now follow the Book 14 template: every chapter/section heading has a [ಕನ್ನಡ →] | [Eke →] cross-link pair on the line immediately below.

Book 15 kn-eke.md restructured — original file mixed analytical pattern sections (N+N compounds, suffix tables, etc.) derived from en.md work with source-text romanisation. Restructured to be a proper romanisation of the source text:

Removed: analytical kaTTaNe pattern tables (those belong in en.md)
Retained: Eke romanisation of actual munnuDi (preface), irusarikegaLu (conventions), and all A–Az dictionary entries with usage examples
Title updated to ingliS-kannaDa padanerake — Eke mUla to make the source-text nature explicit

Transcript books 04, 05, 09 fully processed — first systematic processing of YouTube-transcript-only books. Each had only a raw .md transcript and a website stub; now all three have a complete set of structured files:

Book	Source Quality	en.md	kn-eke.md	claude-prompt.md
04 — Mathu Matthu Barahada Gondala	~25/44 parts (~57%)	✅ 7 themes	✅ key passages	✅ AI primer
05 — Mathina Olaguttu	~27/37 parts (~73%)	✅ 8 themes	✅ key passages	✅ AI primer
09 — Havyaka Kannada	~72/88 slots (~82%)	✅ 5 themes	✅ key passages	✅ AI primer

Methodology for transcript books: Unlike OCR books, transcript books have no continuous chapter text — only partial lecture recordings with gaps. The en.md files use a “Thematic Structure (Replacing Table of Contents)” format with coverage notes per section indicating which parts are readable vs. garbled. The kn-eke.md files extract and romanise the best available passages (rather than attempting whole-book coverage). The claude-prompt.md files follow the standard template adapted for transcript-quality sources, including explicit source limitation notes.