Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5
Phases 27–20 — Book 31 Decode, PDFs, Taxonomy Migration, Transcript Fixes
2026-03-20 to 2026-03-23
Phase 27 — Book 31 Full CID Decode: Baraha Dictionary (2026-03-23)
Book 31 — ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಕನ್ನಡದ್ದೇ ಪದಗಳು (487pp, A–Z dictionary) — all 12,121 English headwords fully decoded. The initial Phase 24 full.md had 5,411 garbled + 495 partial entries because the source PDF used an embedded Baraha font with CID-encoded glyphs.
The CID decode problem: pdfplumber could not map glyph IDs to Unicode — it emitted (cid:N) tokens instead of characters. Standard Nudi/WX pipelines didn’t apply here because the font’s codepoint mapping was Baraha cp1252, not WX.
Three key discoveries:
| Discovery | Detail |
|---|---|
| Three-range CID offset rule | CIDs 1–96: offset +31 (ASCII); CIDs 97–114: offset +57 (0xA0–0xAB Baraha high zone); CIDs 115+: offset +61 (0xB2–0xFF Baraha extended). Boundary at 114 (not 113 as initially assumed). |
CID 114 → « = ವಿ | With offset +57: byte 171 = 0xAB = « → maps to ವಿ in wx_decode MAPPING. Critical for locative/instrumental forms. Proved by abet → ಕಳವಿನಲ್ಲಿ. |
± (0xB1) as ಲ, not ಶ | In this PDF font, CID 116 decodes to ± but is used as ಲ base (3,196 of 3,207 occurrences). Only 12 are genuine ಶ (patterns ±À = ಶ, ±É = ಶೆ). Fix: re.sub(r'±(?![ÀÉ])', '®', text). |
³ (0xB3) = ಲ್ಲಿ trigger | CID 116 produces ³; ³è sequences encode the locative suffix. Fix: replace ³→° before wx_decode → VATTAKSHARA rearrangement produces ಲ + ್ + ಲ + ಿ = ಲ್ಲಿ. Affects 580 locative forms. |
Additional preprocess fixes:
¯À→®(standalone ಲ; the¯ + Àpair was not in MAPPING)¯before non-combo chars →®(handles all unmapped¯uses)\xad(soft hyphen) →©(ಬಿ single-glyph form)
Result: 12,121 entries decoded; 2 remaining garbled (OCR-artifact headwords: wAr) and XYZ — spurious, not real dictionary entries). All genuine dictionary entries are fully readable Kannada.
Batch script: /tmp/decode_book31_batch.py — full decode pipeline with three-range CID rule, preprocess substitutions, wx_decode.convert_chunk(), postprocess cleanup, entry parser, and full.md generator.
Commit: 4751e65 — feat(book31): fully decode Baraha CID dictionary — Phase 27
Phase 26 — Sollarime Vol.3 + Vol.4 from PDF (2026-03-23)
Book 07 — ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ vols 3 and 4 (syntax volumes) extracted from PDF and processed.
Source: Google Drive PDFs — vol3 (241pp), vol4 (274pp). Both use Baraha/WX encoding; decoded via wx_decode.py pipeline.
Vol3 — ವಾಕ್ಯರಚನೆ ಭಾಗ ೧ (Chapters 7–8):
- Chapter 7: ಎಸಕಪದದ ಪಾಂಗುಗಳು (Verbal Arguments) — verb valency, transitivity, argument frames
- Chapter 8: ಪಾಂಗಿಟ್ಟಳದಲ್ಲಿ ಮಾರ್ಪಾಡುಗಳು (Argument Frame Alternations) — passivisation, causativisation
Vol4 — ವಾಕ್ಯರಚನೆ ಭಾಗ ೨ (Chapters 9–10):
- Chapter 9: ಆಡುಪದಗಳು (Personal Pronouns) — person/number/gender in Kannada pronoun system
- Chapter 10: ತೋರುಪದಗಳು (Demonstratives) — proximal/distal deixis, anaphora
Files produced per volume:
book/volN/kn/raw.md— extracted OCR sourcebook/volN/kn/full.md— structured Kannada with TOC and<a id="adhyAya-N">anchorsbook/volN/eke/full.md— Eke romanisationbook/volN/en/summary.md— English chapter summary
Multi-volume index book/kn/full.md updated: vol3 and vol4 entries added; vol1/vol2 entries corrected; vols 5–7 marked as pending.
Commits: (see git log)
Phase 25 — YouTube Summaries for Books 01–13 + Book 33 Split (2026-03-22)
Motivation: Two tasks: (1) provide English overviews for the 6 YouTube-only books that lacked any English entry; (2) correctly separate the two distinct sollarime works that had been conflated in Book 07’s folder.
25a — YouTube English summaries for books 01, 06, 10, 11, 12, 13
Six books had YouTube transcripts (or stubs) but no English summary. Created youtube/en/summary.md + updated claude-prompt.md for each:
| Book | Title | Quality | Notes |
|---|---|---|---|
| 01 | idu kannaDaddE vyAkaraNa | ✅ Excellent | Malati Bhat reading; 3 parts; full 19-chapter TOC read aloud in Part 1 |
| 06 | kalikenuDi mattu nuDikalike | ⚠️ Garbled | Live-lecture ASR; keyword signals only |
| 10 | kannaDa nuDiya hinnaDavaLi | ⚠️ Garbled | Live-lecture ASR; content drawn from website description |
| 11 | kannaDa barahada padasamasye | ⚠️ Garbled | 25-part lecture; may be earlier monograph later condensed into Book 30 |
| 12 | kannaDa bhASheya kalpita caritre | ✅ Good | Malati Bhat reading; 2 parts; comparative method section well-preserved |
| 13 | dArege doDDavaru | 🎓 Symposium | Third-person references to Bhat; panels praising his work — felicitation proceedings, not Bhat reading his own book |
All 6 books now have youtube/en/summary.md with explicit quality assessment and part-level availability table. claude-prompt.md updated to include the new English file.
Commits: (see git log)
25b — Book 33 split from Book 07 (ಕನ್ನಡ ಸೊಲ್ಲರಿಮೆ vs ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ)
Discovery: The youtube/kn/full.md inside Book 07’s folder was titled KANNADA SOLLARIME (no “baraha”) — a different, shorter work than Book 07’s KANNADA BARAHADA SOLLARIME (7-volume grammar of written Kannada). The Book 07 README had carried a ⚠️ YouTube note since Phase 22 flagging the mismatch.
Action:
git mvBook 07’syoutube/kn/full.md→ new folder33-kannaDa-sollarime/youtube/kn/full.mdnav_orderupdated 107→133;redirect_from: /dnsbhat/07-kannadada-sollarime/07-kannadada-sollarimepreserved for URL continuity- New Book 33 folder:
README.md,claude-prompt.md,youtube/en/summary.mdcreated - Book 07
README.mdupdated: YouTube row removed, clean cross-reference link to Book 33 added dnsbhat/README.md: Book 33 entry added (Section M — YouTube-Only Book Split from Book 07), collection stats updated 32→33
Identity note: ಕನ್ನಡ ಸೊಲ್ಲರಿಮೆ (no “baraha”) may be an earlier or more general grammar covering both spoken and written Kannada — the omission of “baraha” (writing) distinguishes it from Book 07 (ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ, 7-volume grammar of written Kannada). The precise relationship cannot be determined without a PDF.
Stats after Phase 25: All 33 DNS Bhat books catalogued; all 33 have claude-prompt.md.
Commits: (see git log)
Phase 24 — Books 30–32 Added from PDF Extraction (2026-03-22)
Motivation: Three additional DNS Bhat works identified from Google Drive PDFs not previously catalogued.
Book 30 — ಕನ್ನಡ ಬರಹವನ್ನು ಸರಿಪಡಿಸೋಣ (382 pages, Nudi encoding)
Full 4-file set produced via wx_decode.py (Nudi→Unicode lookup, same pipeline as Books 07/17/25/28/29):
book/kn/raw.md— decoded Kannada source (Nudi WX → Unicode)book/kn/full.md— structured Kannada with 10-chapter TOC and<a id="adhyAya-N">anchorsbook/en/summary.md— English chapter-by-chapter summary with cross-linksbook/eke/full.md— Eke romanisation matching full.md structureclaude-prompt.md— AI context primer
Subject: Practical guide to correcting Kannada script usage — aspirate simplification, anusvara normalisation, vowel-length marking. Arguments for hosa baraha (simplified writing) applied to print.
Book 31 — ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಕನ್ನಡದ್ದೇ ಪದಗಳು (487 pages, Nudi encoding, A–Z dictionary)
Partial processing — Nudi WX decode successful but English headword reconstruction partially garbled:
book/kn/raw.md— decoded Kannada sourcebook/en/summary.md— English summary (headwords + overview; garbling noted)claude-prompt.md
Subject: Comprehensive dictionary of native Kannada equivalents for English words (A–Z), structured as a lexical resource. Companion to Book 15 (ingliS-kannaDa padanerake).
Book 32 — The Prominence of Tense, Aspect and Mood (214 pages, clean English, John Benjamins)
Clean English PDF — pdfplumber extraction; no Nudi decoding required:
book/en/summary.md— chapter summaryclaude-prompt.md
Subject: Cross-linguistic typological monograph (English) on the TAM hierarchy across world languages. Bhat’s contribution to general linguistic theory; Kannada-informed but not Kannada-specific.
Stats after Phase 24: 32 books catalogued (sections A–L); all 32 have claude-prompt.md.
Commit: (see git log)
Phase 23 — Blog Sidebar Fix + Stubs Category + YouTube Transcript Cross-References (2026-03-21)
Motivation: Three separate improvements triggered by an audit of which YouTube transcripts could be enriched using existing high-quality sources (book/kn/full.md or web/kn/raw.md).
23a — Blog sidebar fallback fix
CANDIDATES[('web', 'kn')] in pages.yml was ['web/kn/full.md'] only. Books 14 and 18, which have web/kn/raw.md but no full.md, were missing from the Blog Kannada sidebar. Added 'web/kn/raw.md' as a fallback candidate.
Commit: 7678364
23b — Stubs sidebar category
16 of 29 books have youtube/kn/full.md files that are placeholder stubs (no ## Part or ### Part headers — just a description or link list). These were cluttering the YouTube sidebar.
Added a Stubs top-level sidebar section (nav_order: 40) with automatic reclassification: the is_youtube_stub() helper in pages.yml detects stub files and moves them from ('youtube', 'kn') to ('stub', 'kn') in Pass 1, dropping their associated YouTube en/eke entries. A dedicated Stubs section renders them separately.
Commit: 33d0247
23c — YouTube transcript cross-references (Books 02 and 03)
Observation: Books 02 and 03 have both a YouTube transcript file and a high-quality Kannada source (web blog for Book 02, scanned book for Book 03). Keyword-overlap analysis (TF-IDF-style, 5+ char Kannada words, overlap score ratio) identified which YouTube Parts correspond to which source sections.
Approach chosen: Rather than copying full source text inline (which duplicates content and inflates the YouTube file), each matched Part gets a link + 60-word excerpt pointing to the canonical source section:
*ಈ ಭಾಗ ಪುಸ್ತಕದ [೧.೫: ಪದದೊಟ್ಟುಗಳು ಮತ್ತು ಪದರಪದೊಟ್ಟುಗಳು](../../book/kn/full#sec-1-5) ಅನ್ನು ಆಧರಿಸಿದೆ.*
> ನುಡಿಗಳಲ್ಲಿ ಸಾಮಾನ್ಯವಾಗಿ ಎರಡು ಬಗೆಯ ಒಟ್ಟುಗಳು ಬಳಕೆಯಲ್ಲಿರುತ್ತವೆ:…
Links use the #sec-N-M / #sub-N-M-K anchors added in Phase 19.
Results:
- Book 02: 10 Parts (24, 26–30, 39–42) linked to
web/kn/raw.mdblog sections (ಭಾಗ 4–10) - Book 03: 33 of 55 Parts linked to
book/kn/full.mdsections using precisesec-N-Manchors; 22 Parts retained ASR (ambiguous overlap, ratio < 1.4×)
File sizes after: Book 02 youtube/kn/full.md: 1,342 lines; Book 03: 1,424 lines (vs 4K–6K if full text had been copied inline).
Scripts: /tmp/link_substitute_02.py, /tmp/link_substitute_03.py
Commits: cb466d7 (reverted), d5d82cf (reverted), 4678d9f (final link+excerpt approach)
Phase 22 — YouTube Transcript Restructuring: Paragraph Breaks + Chapter TOC (2026-03-21)
Scope: Books 01–13 (YouTube-only lecture series — no scanned book text).
Motivation: All youtube/kn/full.md files for books 01–13 contained raw ASR transcripts as flat single-paragraph blobs under ## Part N headers, with no visual separation between parts, no anchor links, and no chapter grouping. The files were unreadable as reference material.
Changes (script: /tmp/structure_v2.py):
## Part Nheaders demoted to### Part N(to nest under chapter##headings)<a id="part-N">anchor inserted before every Part- ASR transcript blobs split into ~80-word paragraphs for readability
- Lines with < 40% Kannada characters replaced with
> *[ಈ ಭಾಗದ ಭಾಷಾಂತರ ಲಭ್ಯವಿಲ್ಲ — transcript not available for this part]* [Watch on YouTube](URL)link inserted into each Part from the Table of Contents- ಪರಿವಿಡಿ TOC added at the top of each file linking to
#part-Nanchors - Book 03 additionally restructured with 9-chapter grouping (
## ಅಧ್ಯಾಯ Nheadings with<a id="adhyAya-N">anchors) per the book’s chapter structure
Books processed:
- Flat (no chapters): 01, 06, 07, 08, 10, 11, 12, 13
- Chapter-mapped: 03 (9 chapters, Parts 1–55)
- Earlier partial restructuring (separate commits): 02, 04, 05, 09
Commits: 304fc93, ad1b267, 0c9f7b9
Phase 21 — GitHub Pages Nested Sidebar + Bug Fixes (2026-03-20)
Motivation: After the taxonomy migration, the GitHub Pages site was broken: 404s throughout, no books showing in sidebar (the old pages.yml Python script used flat paths that no longer existed), and .md-suffix links causing 404s.
Sidebar redesign: Replaced the flat 3-section sidebar (English Summaries / Kannada Text / Eke Transliteration) with a source-first 3-level hierarchy:
▶ Books
▶ English — book/en/summary.md entries
▶ Kannada Text — book/kn/full.md entries
▶ Eke Transliteration — book/eke/full.md entries
▶ Blog
▶ Kannada Text — web/kn/full.md entries
…
▶ YouTube
▶ English — youtube/en/summary.md entries
▶ Kannada Text — youtube/kn/full.md entries
…
Each content file gets parent: "Kannada Text" + grand_parent: "Books" (etc.) using just-the-docs 3-level nesting. Sections only created when content exists.
pages.yml Python script completely rewritten in 6-pass architecture:
- Gather book metadata from
youtube/kn/full.md+ find content files via CANDIDATES dict - Create source section pages (
book.md,web.md,youtube.md) - Create language sub-section pages (
book-en.md,book-kn.md, etc.) - Write
parent/grand_parentfront matter to all content files - Write book landing pages as
index.md(so/NN-slug/directory URL resolves) - Mark all other
.mdfilesnav_exclude: true
Bug fixes:
docs/_config.yml: removedjekyll-relative-linksplugin (frozen Gemfile.lock); removed orphanedcollections:+just_the_docs: collections:blocks (were rendering a floating “DNS Bhat” label in sidebar)docs/Gemfile: removedjekyll-relative-linksgem- Books 14, 27, 28, 29
youtube/kn/full.md: fixed broken cross-links../en/summary→../../book/en/summaryand../eke/full→../../book/eke/full(youtube/ has no en/eke content for these books) - Book landing pages now written as
index.mdso directory URLs like/dnsbhat/28-kannaDakke-bEku/resolve correctly
Phase 20 — 4-Level Taxonomy Migration (2026-03-20)
Scope: All 29 book directories in src/main/md/kannada/dnsbhat/.
Motivation: The old flat layout (NN-slug-kn.md, NN-slug-en.md, etc.) mixed all content types in one directory. The new 4-level taxonomy separates by source (book/web/youtube), language (kn/eke/en), and type (full/summary/raw).
New structure:
NN-slug/
├── README.md # Book index landing page
├── claude-prompt.md # AI context primer
├── book/
│ ├── kn/full.md # Scanned book — structured Kannada (was NN-slug-kn.md)
│ ├── kn/raw.md # Raw OCR archive (was NN-slug-book.md)
│ ├── eke/full.md # Eke romanisation (was NN-slug-kn-eke.md)
│ └── en/summary.md # English chapter summary (was NN-slug-en.md)
├── web/
│ ├── kn/raw.md # DNS Bhat blog posts (was NN-slug-blog.md)
│ └── en/summary.md # Blog-based English summary
└── youtube/
├── kn/full.md # Assembled transcript Kannada (was NN-slug.md — nav metadata carrier)
├── eke/full.md # Eke romanisation of transcript
└── en/summary.md # YouTube-based English summary
Migration: 129 git mv operations via script. All cross-links in kn.md and kn-eke.md files updated to new relative paths (depth 3: ../eke/full, ../../README; depth 4: ../../../README).
sync_docs.py rewrite: ettuge-sync/scripts/sync_docs.py rewritten to recursively walk src/ subdirectories and create matching structure in docs/; orphaned flat docs/ files deleted.