← Ch 3  ·  Contents  ·  Ch 5 →

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Phases 14–8 — OCR Cleanup: Books 07–29, English Summaries

2026-03 (earlier)


Phase 14 — Bulk OCR Cleanup: Books 03, 07, 17, 25, 27 (2026-03-17)

Character-level and structural cleanup of the five remaining uncleaned OCR books, producing -kn.md files for each and regenerating all six -kn-eke.md files. All books now have 0 residual Kannada characters in their Eke output.

Two OCR error classes, two fix scripts:

Book(s) OCR source Error type Fix script
03, 27 Sarvam Vision OCR Structural artifacts only (page numbers, --- separators, running headers) fix_books_sarvam.py
07, 17, 25 Sarvam OCR + WX-decode Character-level garbling + structural artifacts fix_books_wx.py

WX character-level errors fixed (books 07, 17, 25):

Error class Pattern Scope
Arka-ottu reversal ಣ್ರ→ರ್ಣ, ಥ್ರ→ರ್ಥ, ಮ್ರ→ರ್ಮ, ಯ್ರ→ರ್ಯ, ಧ್ರ→ರ್ಧ All 3 books
Ç-fix (garbled aa-mathrā U+00C7) \u0CC6\u00C7\u0CBF\u0CCB (oo-sign); \u0CC6\u00C7\u0CCA (o-sign) All 3 books
Ya-garble 0iÉ (U+0030+U+0069+U+00C9) → ; 159 occurrences Book 17 only
Word-specific ನಿದ್ರಿಷ್ಟ→ನಿರ್ದಿಷ್ಟ, ನಿದ್ರೇಶ→ನಿರ್ದೇಶ, ¦ÃpPÉ→ಪೀಠಿಕೆ, ದೀಘ್ರ→ದೀರ್ಘ Book 25 only

Key decisions on safe vs. unsafe replacements:

  • ದ್ರ→ರ್ದ blanket fix is UNSAFE — legitimate in ಕೇಂದ್ರ, ದ್ರಾವಿಡ, ಚಂದ್ರ etc.; targeted word-level fixes only
  • ನ್ರ in ಏನ್ರಿ is legitimate colloquial Kannada — not an arka-ottu error
  • ಧ್ರ in books 03/27 (Sarvam OCR) is legitimate (ಆಂಧ್ರ, ಉತ್ತರಧ್ರುವ) — Sarvam OCR was correct

Multi-pass dependency fix: The ಸಾಮಥ್ಯ್ರ problem: ಯ್ರ→ರ್ಯ creates ಥ್ರ after ಥ್ರ→ರ್ಥ has already run. Solved by apply_char_fixes() iterating up to max_passes=3 until text is stable. A single pass was insufficient for this class of chained reversal.

Critical structural insight: Char fixes must be applied to the entire file (including TOC, acknowledgements, index sections before the first <a id="adhyAya-N"> anchor), not only the body. An earlier version that split header/body first and fixed only the body left hundreds of errors in front matter and prefatory sections.

Results — book.md → kn.md line counts:

Book book.md kn.md Lines removed Notes
03 — Padagala Olarachane 12,319 11,437 882 Sarvam OCR; 653 structural lines deleted
07 vol1 — Sollarime 24,861 20,475 4,386 WX; Ç-fix
07 vol2 — Sollarime 15,324 13,928 1,396 WX; Ç-fix
17 — Nudi Nadedu Banda Dari 22,312 16,883 5,429 WX; ya-fix + Ç-fix + 1,539 zero-Kannada lines removed
25 — Vakyagala Olarachane 14,485 11,676 2,809 WX; word-level fixes + zero-Kannada lines removed
27 — Baasheya Bagge 9,138 8,245 893 Sarvam OCR; 545 structural lines deleted

kn-eke.md generation (generic gen_kn_eke.py):

A single generic transliterator replaced the book-28-specific kn_to_eke.py. Key fix over earlier stubs: <td> and <th> table cell content is now transliterated (matched by regex >[^<]*< inside HTML lines) rather than passed through verbatim. This eliminated 7,201 residual Kannada chars in book 03’s previous stub (book 03 is table-heavy). All 6 books now output 0 residual Kannada characters.

Commit: d8e037a “Phase 14: OCR cleanup + kn.md + kn-eke.md for books 03, 07, 17, 25, 27” — 12 files (6 new kn.md + 6 kn-eke.md regenerated)


Phase 13 — Book 28 OCR Deep-Clean + kn.md + Anchors (2026-03-17)

Full multi-pass OCR cleanup of book 28 (ಕನ್ನಡಕ್ಕೆ ಬೇಕು ಕನ್ನಡದ್ದೇ ವ್ಯಾಕರಣ), creation of a clean kn.md (both books 28 and 29), anchor scaffolding, and cross-link updates. Also documented the full OCR-cleaning methodology in a new kannada-ocr-cleaner skill.

OCR character-level fixes — book 28 (three-pass pipeline)

The raw -book.md OCR text used the legacy Nudi/KGP font encoding (WX-decoded to Unicode by wx_decode.py, but with residual character-level errors). Three targeted fix scripts ran in sequence:

Script What it fixed Instances
fix_arka_ottu.py arka-ottu reversals: ಣ್ರ→ರ್ಣ, ಥ್ರ→ರ್ಥ, ಮ್ರ→ರ್ಮ, ಯ್ರ→ರ್ಯ, ದೀಘ್ರ→ದೀರ್ಘ; word-specific ತ್ರ/ಧ್ರ cases; ಙõï/ÂÐ→ಙ್; ಬಹÅ→ಬಹು ~500+
fix_residual_ocr.py Residual long-e ಯುೀ→ಯೇ (59×); archaic diphthong 0iÀiï→ಯ್ (2×); proper name ತಾರಾಪೆÇರೆವಾಲಾ→ತಾರಾಪೋರ್ವಾಲಾ; bibliography English garbling (ಖಿhe → The, ಆeಟhi → Delhi, etc.) ~75
fix_page_fragments.py Structural: orphaned fragments before section headings (12 cases); standalone running chapter headers (8 cases — 4 with fragments, 4 without) 96 lines deleted

OCR structural artifacts (new class discovered)

The most interesting class of errors was purely structural, not character-level: page-break artifacts where the last words of a print page ended up as isolated blank-line-separated lines just before the next subsection heading. Example before fix:

ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ.

ವ್ಯಾಕರಣವೆಂಬುದು

ನುಡಿಯಿಂದ

ನುದಿಗೆ

1.2  ವ್ಯಾಕರಣದ ಉದ್ದೇಶ

After fix: ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ. ವ್ಯಾಕರಣವೆಂಬುದು ನುಡಿಯಿಂದ ನುದಿಗೆ

Running chapter headers (chapter titles printed at the top of each print page, OCR’d into the body) were deleted — the 12 chapter names (ಮುನ್ನೋಟ, ಸೇರಿಕೆಯ ನಿಯಮಗಳು, …, ಮುಕ್ತಾಯ) were built into a RUNNING_HEADERS set and matched exactly. The key detection rule: a line is an orphaned fragment only if preceded by a blank line — this guards against misidentifying normal wrapped paragraph lines (which are not preceded by blanks).

Total effect: 9,613 → 9,517 lines (96 lines removed, 16 modified).

kn.md creation (books 28 + 29)

New kn.md files created from the cleaned OCR for both books, with:

  • Jekyll front matter (nav_order, title, parent, redirect_from)
  • <a id="adhyAya-N"></a> anchors before each ## N. ChapterTitle heading
  • 12 chapter anchors for book 28 (adhyAya-1 through adhyAya-12)
  • 11 chapter anchors for book 29 (adhyAya-1 through adhyAya-11)
  • Cross-navigation links: [English →](en#chapter-N--...) | [Eke →](kn-eke#anchor) before each chapter heading

en.md anchors + cross-links (books 28 + 29)

Both -en.md files updated with:

  • 13 <a id="chapter-N--..."> anchors in book 28 (chapters 1–12 + key-terms-glossary)
  • 12 <a id="chapter-N--..."> anchors in book 29 (chapters 1–11 + key-terms-glossary)
  • All [ಕನ್ನಡ →] links updated from bare -book targets to -kn#adhyAya-N fragment URLs

kannada-ocr-cleaner skill created

New Claude skill at .claude/skills/kannada-ocr-cleaner/SKILL.md documenting all four classes of error and the methodology:

  • Class 1: Vowel-sign/consonant garbling (ಯ, ಯೇ, ಯ್, ಙ್, ಬಹು patterns)
  • Class 2: Arka-ottu reversal (global-safe + word-specific replacements)
  • Class 3: English text garbled through legacy font (bibliography, titles)
  • Class 4: OCR page-break structural artifacts (orphaned fragments + running headers) — added in this phase, with the full three-pass fix script pattern

Phase 12 — mahAprana (Aspirate) Eke Correction (2026-03-16)

Corrected a systematic error in the Eke romanisation rule for aspirated consonants. All kn-eke.md files had been generated with the wrong rule “drop aspirates” (ಭ→b, ಧ→d, ಖ→k, ಥ→t, ಫ→p, ಭ→b). The correct rule is “preserve aspirates with h marker” (ಭ→bh, ಧ→dh, ಖ→kh, ಥ→th, ಫ→ph).

The guiding principle: Eke romanises what is written in the source. If the source uses ಭ (aspirated labial), the romanisation must write bh, not b — to faithfully represent what DNS Bhat wrote. (Separately, the word-coining philosophy of ellara kannaDa avoids mahapranas in new coinages, since they don’t occur in native Dravidian speech — but that’s a word-formation rule, not a transcription rule.)

Specific fixes applied (Python re.sub + replace):

Wrong Eke Correct Eke Source consonant Instances
bAShe bhAShe ಭ (ಭಾಷೆ) ~18 files
bAga bhAga ಭ (ಭಾಗ) multiple
sankara baT sankara bhaT ಭ (ಭಟ್) all files
adyAya adhyAya ಧ (ಅಧ್ಯಾಯ) multiple
sambanda sambandha multiple
\badika adhika multiple
\bmukya mukhya ಖ (ಮುಖ್ಯ) multiple
leKana lEkhana ಖ (ಲೇಖನ) multiple
lEkakaru lEkhakaru ಖ (ಲೇಖಕರು) multiple
\barta artha ಥ (ಅರ್ಥ) multiple
\bpATa pATha ಠ (ಪಾಠ) multiple
bAgya bhAgya ಭ (ಭಾಗ್ಯ) 1
dIrga dIrgha ಘ (ದೀರ್ಘ) 2

Commit: 907ac31 “eke: fix all mahAprANa (aspirate) romanization” — 22 files, 398 insertions/deletions

Skill files and PROJECT-RECAP also updated in this phase to reflect the corrected rule.


Phase 11 — GitHub Pages / Jekyll Deployment (2026-03-14–15)

Set up a public-facing website at https://vwulf.github.io/ettuge/ using Jekyll with the just-the-docs theme, served via GitHub Pages. The entire pipeline is in .github/workflows/pages.yml — no pre-generated files are checked in; everything is built fresh from src/ on every push to master.

Infrastructure built:

Component Details
docs/_config.yml theme: just-the-docs, color_scheme: light, search_enabled: true, nav_fold: true, defaults: layout: default
docs/Gemfile gem "just-the-docs" pinned for GitHub Actions
.github/workflows/pages.yml Build + deploy pipeline: copy src/ → docs/dnsbhat/, generate nav, add front matter, Jekyll build, deploy
.claude/launch.json Local preview server entry: bundle exec jekyll serve --livereload --baseurl ""

Bugs encountered and fixed:

Bug Root cause Fix
404 on all pages configure-pages@v5 step had no id:base_path was empty → Jekyll built without baseurl → all links used /dnsbhat/... instead of /ettuge/dnsbhat/... Added id: setup-pages to the step
Content files (en/kn/eke) returned 404 -en.md, -kn.md, -kn-eke.md lacked Jekyll front matter → Jekyll treated them as static assets, never built HTML Workflow step prepends ---\nnav_exclude: true\n--- to every .md missing front matter
Sidebar flooded with transcripts, prompts, metadata Empty ---\n--- front matter made all files sidebar-visible Changed injected front matter to nav_exclude: true
Accidentally committed docs/dnsbhat/ files .gitignore pattern docs/dnsbhat/*/ only excluded subdirs, not root-level loose files git rm --cached; added docs/dnsbhat/*.md exception (!docs/dnsbhat/index.md) to .gitignore

Format-first sidebar redesign:

The original sidebar showed books by number, and each book’s entry pointed to the raw README metadata — noisy and not useful for readers. Replaced with a format-oriented three-section sidebar generated by a Python script in the workflow:

English Summaries        (nav_order: 10)
  └── Book 02 — Hosapadagalannu Kattuva Bage
  └── Book 03 — Padagala Olarachane
  └── ...  (en.md preferred; blog.md fallback; YouTube transcript last resort)

Kannada Text             (nav_order: 20)
  └── Book 02, 03, 07, 08, 14, ...  (kn.md or book.md)

Eke Transliteration      (nav_order: 30)
  └── Book 02, 03, 04, ...  (kn-eke.md)

The Python script (inline python3 - <<'PYEOF' in the workflow):

  1. Creates three section index pages (english-summaries.md, kannada-text.md, eke-transliteration.md) with has_children: true
  2. For each book directory, finds the best available file per format (en > blog > fallback)
  3. Rewrites each content file’s front matter with parent: "English Summaries" (or Kannada Text / Eke Transliteration) and nav_order: {book_num} — placing it in the correct sidebar section
  4. Rewrites the book’s main landing page body as a clean summary: Kannada title, English title, description blockquote, status, and a links table — no more raw README metadata or YouTube TOC at top level
  5. Books with a dedicated en.md get nav_exclude: true on their landing page (the en.md IS the sidebar entry); transcript-only books get parent: "English Summaries" on their landing page (it IS the entry)

Result: All 29 books appear in the English Summaries section; 16 appear in Kannada Text; 14 appear in Eke Transliteration. Book landing pages are clean reader-facing pages, not developer metadata dumps.


Phase 10 — Book 08 varSa OCR Correction (2026-03-14)

Targeted fix for a systematic OCR misread in book 08 (Kannadakke Mahaprana Yake Beda). The word ವರ್ಷ (“year”, Eke: varSa) was rendered in four wrong patterns by the OCR engine — corrected in both the Eke romanisation file and the structured Kannada file.

Wrong form Correct Count
varragaL* / varradoLage varSagaL* / varSadoLage 6
varmagaL* varSagaL* 3
varyagaL* varSagaL* 2
varka varSa 2

Files fixed: 08-kannaDakke-mahAprANa-yAke-bEDa-kn-eke.md (13 instances) and 08-kannaDakke-mahAprANa-yAke-bEDa-kn.md (matching Kannada script corrections: ವರ್ರ, ವರ್ಮ, ವರ್ಯ, ವರ್ಕ → ವರ್ಷ). No other books contained these patterns.


Phase 9 — Eke Romanisation Bug-Fix Passes (2026-03-13)

Systematic correction of four romanisation errors propagated across all processed kn-eke.md files. Errors originated from LLM generation using HK-adjacent conventions instead of Eke rules.

Pass Error pattern Correct Eke Files Instances
1 — Anusvara (M) M used for anusvāra (HK style) Assimilated nasal+C: mb, mp, nk, ng, nc, nt, nd…never standalone M 14 files ~1,511
2 — N as anusvara N (retroflex ಣ) used before stop consonants n before stop consonants; N reserved exclusively for ಣ 12 files ~69
3 — R as ರ R (exclusively ಱ) incorrectly used for common ರ Lowercase r for all ರ; R kept exclusively for ಱ 9 files ~230
4 — Vocalic ṛ as ri/ru kRi/sri/kri/kru used for ೃ/ಋ (vocalic ṛ) Eke x (short ṛ = ಋ/ೃ); Eke X (long ṝ = ೠ/ೄ) all files ~400+

Pass 4 specific word families fixed:

Wrong Correct Kannada Count
samskrita / samskruta / samskrta samskxta ಸಂಸ್ಕೃತ ~400
sriSTi sxSTi ಸೃಷ್ಟಿ ~30
driSTi dxSTi ದೃಷ್ಟಿ ~10
driSya dxSya ದೃಶ್ಯ 3
srijana sxjana ಸೃಜನ 1
mrita mxta ಮೃತ 1
tritIya txtIya ತೃತೀಯ 2

Caution applied: Words where original Kannada genuinely has ಕ್ರ + ಉ/ಇ (consonant r + vowel) were not changed. Examples: krutaka (ಕ್ರುತಕ), krudanta, krullingagaLa, kriyA (ಕ್ರಿಯಾ) — all verified to use consonant r, not ೃ sign.

Skill files updated — both dns-bhat-book-summarizer/SKILL.md and dns-bhat-transcript-summarizer/SKILL.md vowel/consonant tables corrected with all four rules. ellara-kannada-word-coiner/SKILL.md and references/eke-romanization.md also updated.


Cross-link audit — systematic review of all processed books revealed inconsistent cross-linking in en.md files:

Issue Books affected Fix applied
[ಕನ್ನಡ →] links present but no [Eke →] links 02, 08, 14 Appended \| [Eke →](kn-eke.md#anchor) to every existing section link (43 + 33 + 61 links)
No cross-links at all (bulk-processed books) 07, 17, 25, 27, 28, 29 Inserted [ಕನ್ನಡ →](book.md) \| [Eke →](kn-eke.md#anchor) after each chapter heading (60 total links across 6 files)
Broken links to non-existent kn.md 03 Retargeted 9 links to book.md (bare) + added [Eke →] with correct kn-eke.md anchors; fixed footer reference

All en.md files now follow the Book 14 template: every chapter/section heading has a [ಕನ್ನಡ →] | [Eke →] cross-link pair on the line immediately below.

Book 15 kn-eke.md restructured — original file mixed analytical pattern sections (N+N compounds, suffix tables, etc.) derived from en.md work with source-text romanisation. Restructured to be a proper romanisation of the source text:

  • Removed: analytical kaTTaNe pattern tables (those belong in en.md)
  • Retained: Eke romanisation of actual munnuDi (preface), irusarikegaLu (conventions), and all A–Az dictionary entries with usage examples
  • Title updated to ingliS-kannaDa padanerake — Eke mUla to make the source-text nature explicit

Transcript books 04, 05, 09 fully processed — first systematic processing of YouTube-transcript-only books. Each had only a raw .md transcript and a website stub; now all three have a complete set of structured files:

Book Source Quality en.md kn-eke.md claude-prompt.md
04 — Mathu Matthu Barahada Gondala ~25/44 parts (~57%) ✅ 7 themes ✅ key passages ✅ AI primer
05 — Mathina Olaguttu ~27/37 parts (~73%) ✅ 8 themes ✅ key passages ✅ AI primer
09 — Havyaka Kannada ~72/88 slots (~82%) ✅ 5 themes ✅ key passages ✅ AI primer

Methodology for transcript books: Unlike OCR books, transcript books have no continuous chapter text — only partial lecture recordings with gaps. The en.md files use a “Thematic Structure (Replacing Table of Contents)” format with coverage notes per section indicating which parts are readable vs. garbled. The kn-eke.md files extract and romanise the best available passages (rather than attempting whole-book coverage). The claude-prompt.md files follow the standard template adapted for transcript-quality sources, including explicit source limitation notes.