Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5
Phases 14–8 — OCR Cleanup: Books 07–29, English Summaries
2026-03 (earlier)
Phase 14 — Bulk OCR Cleanup: Books 03, 07, 17, 25, 27 (2026-03-17)
Character-level and structural cleanup of the five remaining uncleaned OCR books, producing -kn.md files for each and regenerating all six -kn-eke.md files. All books now have 0 residual Kannada characters in their Eke output.
Two OCR error classes, two fix scripts:
| Book(s) | OCR source | Error type | Fix script |
|---|---|---|---|
| 03, 27 | Sarvam Vision OCR | Structural artifacts only (page numbers, --- separators, running headers) | fix_books_sarvam.py |
| 07, 17, 25 | Sarvam OCR + WX-decode | Character-level garbling + structural artifacts | fix_books_wx.py |
WX character-level errors fixed (books 07, 17, 25):
| Error class | Pattern | Scope |
|---|---|---|
| Arka-ottu reversal | ಣ್ರ→ರ್ಣ, ಥ್ರ→ರ್ಥ, ಮ್ರ→ರ್ಮ, ಯ್ರ→ರ್ಯ, ಧ್ರ→ರ್ಧ | All 3 books |
| Ç-fix (garbled aa-mathrā U+00C7) | \u0CC6\u00C7\u0CBF → \u0CCB (oo-sign); \u0CC6\u00C7 → \u0CCA (o-sign) | All 3 books |
| Ya-garble | 0iÉ (U+0030+U+0069+U+00C9) → ಯ; 159 occurrences | Book 17 only |
| Word-specific | ನಿದ್ರಿಷ್ಟ→ನಿರ್ದಿಷ್ಟ, ನಿದ್ರೇಶ→ನಿರ್ದೇಶ, ¦ÃpPÉ→ಪೀಠಿಕೆ, ದೀಘ್ರ→ದೀರ್ಘ | Book 25 only |
Key decisions on safe vs. unsafe replacements:
ದ್ರ→ರ್ದblanket fix is UNSAFE — legitimate inಕೇಂದ್ರ,ದ್ರಾವಿಡ,ಚಂದ್ರetc.; targeted word-level fixes onlyನ್ರinಏನ್ರಿis legitimate colloquial Kannada — not an arka-ottu errorಧ್ರin books 03/27 (Sarvam OCR) is legitimate (ಆಂಧ್ರ,ಉತ್ತರಧ್ರುವ) — Sarvam OCR was correct
Multi-pass dependency fix: The ಸಾಮಥ್ಯ್ರ problem: ಯ್ರ→ರ್ಯ creates ಥ್ರ after ಥ್ರ→ರ್ಥ has already run. Solved by apply_char_fixes() iterating up to max_passes=3 until text is stable. A single pass was insufficient for this class of chained reversal.
Critical structural insight: Char fixes must be applied to the entire file (including TOC, acknowledgements, index sections before the first <a id="adhyAya-N"> anchor), not only the body. An earlier version that split header/body first and fixed only the body left hundreds of errors in front matter and prefatory sections.
Results — book.md → kn.md line counts:
| Book | book.md | kn.md | Lines removed | Notes |
|---|---|---|---|---|
| 03 — Padagala Olarachane | 12,319 | 11,437 | 882 | Sarvam OCR; 653 structural lines deleted |
| 07 vol1 — Sollarime | 24,861 | 20,475 | 4,386 | WX; Ç-fix |
| 07 vol2 — Sollarime | 15,324 | 13,928 | 1,396 | WX; Ç-fix |
| 17 — Nudi Nadedu Banda Dari | 22,312 | 16,883 | 5,429 | WX; ya-fix + Ç-fix + 1,539 zero-Kannada lines removed |
| 25 — Vakyagala Olarachane | 14,485 | 11,676 | 2,809 | WX; word-level fixes + zero-Kannada lines removed |
| 27 — Baasheya Bagge | 9,138 | 8,245 | 893 | Sarvam OCR; 545 structural lines deleted |
kn-eke.md generation (generic gen_kn_eke.py):
A single generic transliterator replaced the book-28-specific kn_to_eke.py. Key fix over earlier stubs: <td> and <th> table cell content is now transliterated (matched by regex >[^<]*< inside HTML lines) rather than passed through verbatim. This eliminated 7,201 residual Kannada chars in book 03’s previous stub (book 03 is table-heavy). All 6 books now output 0 residual Kannada characters.
Commit: d8e037a “Phase 14: OCR cleanup + kn.md + kn-eke.md for books 03, 07, 17, 25, 27” — 12 files (6 new kn.md + 6 kn-eke.md regenerated)
Phase 13 — Book 28 OCR Deep-Clean + kn.md + Anchors (2026-03-17)
Full multi-pass OCR cleanup of book 28 (ಕನ್ನಡಕ್ಕೆ ಬೇಕು ಕನ್ನಡದ್ದೇ ವ್ಯಾಕರಣ), creation of a clean kn.md (both books 28 and 29), anchor scaffolding, and cross-link updates. Also documented the full OCR-cleaning methodology in a new kannada-ocr-cleaner skill.
OCR character-level fixes — book 28 (three-pass pipeline)
The raw -book.md OCR text used the legacy Nudi/KGP font encoding (WX-decoded to Unicode by wx_decode.py, but with residual character-level errors). Three targeted fix scripts ran in sequence:
| Script | What it fixed | Instances |
|---|---|---|
fix_arka_ottu.py | arka-ottu reversals: ಣ್ರ→ರ್ಣ, ಥ್ರ→ರ್ಥ, ಮ್ರ→ರ್ಮ, ಯ್ರ→ರ್ಯ, ದೀಘ್ರ→ದೀರ್ಘ; word-specific ತ್ರ/ಧ್ರ cases; ಙõï/ÂÐ→ಙ್; ಬಹÅ→ಬಹು | ~500+ |
fix_residual_ocr.py | Residual long-e ಯುೀ→ಯೇ (59×); archaic diphthong 0iÀiï→ಯ್ (2×); proper name ತಾರಾಪೆÇರೆವಾಲಾ→ತಾರಾಪೋರ್ವಾಲಾ; bibliography English garbling (ಖಿhe → The, ಆeಟhi → Delhi, etc.) | ~75 |
fix_page_fragments.py | Structural: orphaned fragments before section headings (12 cases); standalone running chapter headers (8 cases — 4 with fragments, 4 without) | 96 lines deleted |
OCR structural artifacts (new class discovered)
The most interesting class of errors was purely structural, not character-level: page-break artifacts where the last words of a print page ended up as isolated blank-line-separated lines just before the next subsection heading. Example before fix:
ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ.
ವ್ಯಾಕರಣವೆಂಬುದು
ನುಡಿಯಿಂದ
ನುದಿಗೆ
1.2 ವ್ಯಾಕರಣದ ಉದ್ದೇಶ
After fix: ಹೇಳಿ ಕೊಡುವುದು ಹೇಗೆ ತಪ್ಪಾಗುತ್ತದೆಯೋ ಹಾಗೆಯೇ ಇದೂ ಕೂಡ. ವ್ಯಾಕರಣವೆಂಬುದು ನುಡಿಯಿಂದ ನುದಿಗೆ
Running chapter headers (chapter titles printed at the top of each print page, OCR’d into the body) were deleted — the 12 chapter names (ಮುನ್ನೋಟ, ಸೇರಿಕೆಯ ನಿಯಮಗಳು, …, ಮುಕ್ತಾಯ) were built into a RUNNING_HEADERS set and matched exactly. The key detection rule: a line is an orphaned fragment only if preceded by a blank line — this guards against misidentifying normal wrapped paragraph lines (which are not preceded by blanks).
Total effect: 9,613 → 9,517 lines (96 lines removed, 16 modified).
kn.md creation (books 28 + 29)
New kn.md files created from the cleaned OCR for both books, with:
- Jekyll front matter (nav_order, title, parent, redirect_from)
<a id="adhyAya-N"></a>anchors before each## N. ChapterTitleheading- 12 chapter anchors for book 28 (
adhyAya-1throughadhyAya-12) - 11 chapter anchors for book 29 (
adhyAya-1throughadhyAya-11) - Cross-navigation links:
[English →](en#chapter-N--...) | [Eke →](kn-eke#anchor)before each chapter heading
en.md anchors + cross-links (books 28 + 29)
Both -en.md files updated with:
- 13
<a id="chapter-N--...">anchors in book 28 (chapters 1–12 + key-terms-glossary) - 12
<a id="chapter-N--...">anchors in book 29 (chapters 1–11 + key-terms-glossary) - All
[ಕನ್ನಡ →]links updated from bare-booktargets to-kn#adhyAya-Nfragment URLs
kannada-ocr-cleaner skill created
New Claude skill at .claude/skills/kannada-ocr-cleaner/SKILL.md documenting all four classes of error and the methodology:
- Class 1: Vowel-sign/consonant garbling (ಯ, ಯೇ, ಯ್, ಙ್, ಬಹು patterns)
- Class 2: Arka-ottu reversal (global-safe + word-specific replacements)
- Class 3: English text garbled through legacy font (bibliography, titles)
- Class 4: OCR page-break structural artifacts (orphaned fragments + running headers) — added in this phase, with the full three-pass fix script pattern
Phase 12 — mahAprana (Aspirate) Eke Correction (2026-03-16)
Corrected a systematic error in the Eke romanisation rule for aspirated consonants. All kn-eke.md files had been generated with the wrong rule “drop aspirates” (ಭ→b, ಧ→d, ಖ→k, ಥ→t, ಫ→p, ಭ→b). The correct rule is “preserve aspirates with h marker” (ಭ→bh, ಧ→dh, ಖ→kh, ಥ→th, ಫ→ph).
The guiding principle: Eke romanises what is written in the source. If the source uses ಭ (aspirated labial), the romanisation must write bh, not b — to faithfully represent what DNS Bhat wrote. (Separately, the word-coining philosophy of ellara kannaDa avoids mahapranas in new coinages, since they don’t occur in native Dravidian speech — but that’s a word-formation rule, not a transcription rule.)
Specific fixes applied (Python re.sub + replace):
| Wrong Eke | Correct Eke | Source consonant | Instances |
|---|---|---|---|
bAShe | bhAShe | ಭ (ಭಾಷೆ) | ~18 files |
bAga | bhAga | ಭ (ಭಾಗ) | multiple |
sankara baT | sankara bhaT | ಭ (ಭಟ್) | all files |
adyAya | adhyAya | ಧ (ಅಧ್ಯಾಯ) | multiple |
sambanda | sambandha | ಧ | multiple |
\badika | adhika | ಧ | multiple |
\bmukya | mukhya | ಖ (ಮುಖ್ಯ) | multiple |
leKana | lEkhana | ಖ (ಲೇಖನ) | multiple |
lEkakaru | lEkhakaru | ಖ (ಲೇಖಕರು) | multiple |
\barta | artha | ಥ (ಅರ್ಥ) | multiple |
\bpATa | pATha | ಠ (ಪಾಠ) | multiple |
bAgya | bhAgya | ಭ (ಭಾಗ್ಯ) | 1 |
dIrga | dIrgha | ಘ (ದೀರ್ಘ) | 2 |
Commit: 907ac31 “eke: fix all mahAprANa (aspirate) romanization” — 22 files, 398 insertions/deletions
Skill files and PROJECT-RECAP also updated in this phase to reflect the corrected rule.
Phase 11 — GitHub Pages / Jekyll Deployment (2026-03-14–15)
Set up a public-facing website at https://vwulf.github.io/ettuge/ using Jekyll with the just-the-docs theme, served via GitHub Pages. The entire pipeline is in .github/workflows/pages.yml — no pre-generated files are checked in; everything is built fresh from src/ on every push to master.
Infrastructure built:
| Component | Details |
|---|---|
docs/_config.yml | theme: just-the-docs, color_scheme: light, search_enabled: true, nav_fold: true, defaults: layout: default |
docs/Gemfile | gem "just-the-docs" pinned for GitHub Actions |
.github/workflows/pages.yml | Build + deploy pipeline: copy src/ → docs/dnsbhat/, generate nav, add front matter, Jekyll build, deploy |
.claude/launch.json | Local preview server entry: bundle exec jekyll serve --livereload --baseurl "" |
Bugs encountered and fixed:
| Bug | Root cause | Fix |
|---|---|---|
| 404 on all pages | configure-pages@v5 step had no id: → base_path was empty → Jekyll built without baseurl → all links used /dnsbhat/... instead of /ettuge/dnsbhat/... | Added id: setup-pages to the step |
| Content files (en/kn/eke) returned 404 | -en.md, -kn.md, -kn-eke.md lacked Jekyll front matter → Jekyll treated them as static assets, never built HTML | Workflow step prepends ---\nnav_exclude: true\n--- to every .md missing front matter |
| Sidebar flooded with transcripts, prompts, metadata | Empty ---\n--- front matter made all files sidebar-visible | Changed injected front matter to nav_exclude: true |
Accidentally committed docs/dnsbhat/ files | .gitignore pattern docs/dnsbhat/*/ only excluded subdirs, not root-level loose files | git rm --cached; added docs/dnsbhat/*.md exception (!docs/dnsbhat/index.md) to .gitignore |
Format-first sidebar redesign:
The original sidebar showed books by number, and each book’s entry pointed to the raw README metadata — noisy and not useful for readers. Replaced with a format-oriented three-section sidebar generated by a Python script in the workflow:
English Summaries (nav_order: 10)
└── Book 02 — Hosapadagalannu Kattuva Bage
└── Book 03 — Padagala Olarachane
└── ... (en.md preferred; blog.md fallback; YouTube transcript last resort)
Kannada Text (nav_order: 20)
└── Book 02, 03, 07, 08, 14, ... (kn.md or book.md)
Eke Transliteration (nav_order: 30)
└── Book 02, 03, 04, ... (kn-eke.md)
The Python script (inline python3 - <<'PYEOF' in the workflow):
- Creates three section index pages (
english-summaries.md,kannada-text.md,eke-transliteration.md) withhas_children: true - For each book directory, finds the best available file per format (en > blog > fallback)
- Rewrites each content file’s front matter with
parent: "English Summaries"(or Kannada Text / Eke Transliteration) andnav_order: {book_num}— placing it in the correct sidebar section - Rewrites the book’s main landing page body as a clean summary: Kannada title, English title, description blockquote, status, and a links table — no more raw README metadata or YouTube TOC at top level
- Books with a dedicated
en.mdgetnav_exclude: trueon their landing page (the en.md IS the sidebar entry); transcript-only books getparent: "English Summaries"on their landing page (it IS the entry)
Result: All 29 books appear in the English Summaries section; 16 appear in Kannada Text; 14 appear in Eke Transliteration. Book landing pages are clean reader-facing pages, not developer metadata dumps.
Phase 10 — Book 08 varSa OCR Correction (2026-03-14)
Targeted fix for a systematic OCR misread in book 08 (Kannadakke Mahaprana Yake Beda). The word ವರ್ಷ (“year”, Eke: varSa) was rendered in four wrong patterns by the OCR engine — corrected in both the Eke romanisation file and the structured Kannada file.
| Wrong form | Correct | Count |
|---|---|---|
varragaL* / varradoLage | varSagaL* / varSadoLage | 6 |
varmagaL* | varSagaL* | 3 |
varyagaL* | varSagaL* | 2 |
varka | varSa | 2 |
Files fixed: 08-kannaDakke-mahAprANa-yAke-bEDa-kn-eke.md (13 instances) and 08-kannaDakke-mahAprANa-yAke-bEDa-kn.md (matching Kannada script corrections: ವರ್ರ, ವರ್ಮ, ವರ್ಯ, ವರ್ಕ → ವರ್ಷ). No other books contained these patterns.
Phase 9 — Eke Romanisation Bug-Fix Passes (2026-03-13)
Systematic correction of four romanisation errors propagated across all processed kn-eke.md files. Errors originated from LLM generation using HK-adjacent conventions instead of Eke rules.
| Pass | Error pattern | Correct Eke | Files | Instances |
|---|---|---|---|---|
| 1 — Anusvara (M) | M used for anusvāra (HK style) | Assimilated nasal+C: mb, mp, nk, ng, nc, nt, nd… — never standalone M | 14 files | ~1,511 |
| 2 — N as anusvara | N (retroflex ಣ) used before stop consonants | n before stop consonants; N reserved exclusively for ಣ | 12 files | ~69 |
| 3 — R as ರ | R (exclusively ಱ) incorrectly used for common ರ | Lowercase r for all ರ; R kept exclusively for ಱ | 9 files | ~230 |
| 4 — Vocalic ṛ as ri/ru | kRi/sri/kri/kru used for ೃ/ಋ (vocalic ṛ) | Eke x (short ṛ = ಋ/ೃ); Eke X (long ṝ = ೠ/ೄ) | all files | ~400+ |
Pass 4 specific word families fixed:
| Wrong | Correct | Kannada | Count |
|---|---|---|---|
samskrita / samskruta / samskrta | samskxta | ಸಂಸ್ಕೃತ | ~400 |
sriSTi | sxSTi | ಸೃಷ್ಟಿ | ~30 |
driSTi | dxSTi | ದೃಷ್ಟಿ | ~10 |
driSya | dxSya | ದೃಶ್ಯ | 3 |
srijana | sxjana | ಸೃಜನ | 1 |
mrita | mxta | ಮೃತ | 1 |
tritIya | txtIya | ತೃತೀಯ | 2 |
Caution applied: Words where original Kannada genuinely has ಕ್ರ + ಉ/ಇ (consonant r + vowel) were not changed. Examples: krutaka (ಕ್ರುತಕ), krudanta, krullingagaLa, kriyA (ಕ್ರಿಯಾ) — all verified to use consonant r, not ೃ sign.
Skill files updated — both dns-bhat-book-summarizer/SKILL.md and dns-bhat-transcript-summarizer/SKILL.md vowel/consonant tables corrected with all four rules. ellara-kannada-word-coiner/SKILL.md and references/eke-romanization.md also updated.
Phase 8 — Cross-link Audit & Fixes; Book 15 kn-eke Restructure; Transcript Book Processing (2026-03-13)
Cross-link audit — systematic review of all processed books revealed inconsistent cross-linking in en.md files:
| Issue | Books affected | Fix applied |
|---|---|---|
[ಕನ್ನಡ →] links present but no [Eke →] links | 02, 08, 14 | Appended \| [Eke →](kn-eke.md#anchor) to every existing section link (43 + 33 + 61 links) |
| No cross-links at all (bulk-processed books) | 07, 17, 25, 27, 28, 29 | Inserted [ಕನ್ನಡ →](book.md) \| [Eke →](kn-eke.md#anchor) after each chapter heading (60 total links across 6 files) |
Broken links to non-existent kn.md | 03 | Retargeted 9 links to book.md (bare) + added [Eke →] with correct kn-eke.md anchors; fixed footer reference |
All en.md files now follow the Book 14 template: every chapter/section heading has a [ಕನ್ನಡ →] | [Eke →] cross-link pair on the line immediately below.
Book 15 kn-eke.md restructured — original file mixed analytical pattern sections (N+N compounds, suffix tables, etc.) derived from en.md work with source-text romanisation. Restructured to be a proper romanisation of the source text:
- Removed: analytical kaTTaNe pattern tables (those belong in
en.md) - Retained: Eke romanisation of actual munnuDi (preface), irusarikegaLu (conventions), and all A–Az dictionary entries with usage examples
- Title updated to
ingliS-kannaDa padanerake — Eke mUlato make the source-text nature explicit
Transcript books 04, 05, 09 fully processed — first systematic processing of YouTube-transcript-only books. Each had only a raw .md transcript and a website stub; now all three have a complete set of structured files:
| Book | Source Quality | en.md | kn-eke.md | claude-prompt.md |
|---|---|---|---|---|
| 04 — Mathu Matthu Barahada Gondala | ~25/44 parts (~57%) | ✅ 7 themes | ✅ key passages | ✅ AI primer |
| 05 — Mathina Olaguttu | ~27/37 parts (~73%) | ✅ 8 themes | ✅ key passages | ✅ AI primer |
| 09 — Havyaka Kannada | ~72/88 slots (~82%) | ✅ 5 themes | ✅ key passages | ✅ AI primer |
Methodology for transcript books: Unlike OCR books, transcript books have no continuous chapter text — only partial lecture recordings with gaps. The en.md files use a “Thematic Structure (Replacing Table of Contents)” format with coverage notes per section indicating which parts are readable vs. garbled. The kn-eke.md files extract and romanise the best available passages (rather than attempting whole-book coverage). The claude-prompt.md files follow the standard template adapted for transcript-quality sources, including explicit source limitation notes.