Ch 2 — Phases 27–20 — Book 31 Decode, PDFs, Taxonomy Migration, Transcript Fixes

← Ch 1 · Contents · Ch 3 →

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4 · Ch 5

Phases 27–20 — Book 31 Decode, PDFs, Taxonomy Migration, Transcript Fixes

2026-03-20 to 2026-03-23

Phase 27 — Book 31 Full CID Decode: Baraha Dictionary (2026-03-23)

Book 31 — ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಕನ್ನಡದ್ದೇ ಪದಗಳು (487pp, A–Z dictionary) — all 12,121 English headwords fully decoded. The initial Phase 24 full.md had 5,411 garbled + 495 partial entries because the source PDF used an embedded Baraha font with CID-encoded glyphs.

The CID decode problem: pdfplumber could not map glyph IDs to Unicode — it emitted (cid:N) tokens instead of characters. Standard Nudi/WX pipelines didn’t apply here because the font’s codepoint mapping was Baraha cp1252, not WX.

Three key discoveries:

Discovery	Detail
Three-range CID offset rule	CIDs 1–96: offset +31 (ASCII); CIDs 97–114: offset +57 (0xA0–0xAB Baraha high zone); CIDs 115+: offset +61 (0xB2–0xFF Baraha extended). Boundary at 114 (not 113 as initially assumed).
CID 114 → `«` = ವಿ	With offset +57: byte 171 = 0xAB = `«` → maps to ವಿ in wx_decode MAPPING. Critical for locative/instrumental forms. Proved by `abet` → ಕಳವಿನಲ್ಲಿ.
`±` (0xB1) as ಲ, not ಶ	In this PDF font, CID 116 decodes to `±` but is used as ಲ base (3,196 of 3,207 occurrences). Only 12 are genuine ಶ (patterns `±À` = ಶ, `±É` = ಶೆ). Fix: `re.sub(r'±(?![ÀÉ])', '®', text)`.
`³` (0xB3) = ಲ್ಲಿ trigger	CID 116 produces `³`; `³è` sequences encode the locative suffix. Fix: replace `³→°` before wx_decode → VATTAKSHARA rearrangement produces ಲ + ್ + ಲ + ಿ = ಲ್ಲಿ. Affects 580 locative forms.

Additional preprocess fixes:

¯À → ® (standalone ಲ; the ¯ + À pair was not in MAPPING)
¯ before non-combo chars → ® (handles all unmapped ¯ uses)
\xad (soft hyphen) → © (ಬಿ single-glyph form)

Result: 12,121 entries decoded; 2 remaining garbled (OCR-artifact headwords: wAr) and XYZ — spurious, not real dictionary entries). All genuine dictionary entries are fully readable Kannada.

Batch script: /tmp/decode_book31_batch.py — full decode pipeline with three-range CID rule, preprocess substitutions, wx_decode.convert_chunk(), postprocess cleanup, entry parser, and full.md generator.

Commit: 4751e65 — feat(book31): fully decode Baraha CID dictionary — Phase 27

Phase 26 — Sollarime Vol.3 + Vol.4 from PDF (2026-03-23)

Book 07 — ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ vols 3 and 4 (syntax volumes) extracted from PDF and processed.

Source: Google Drive PDFs — vol3 (241pp), vol4 (274pp). Both use Baraha/WX encoding; decoded via wx_decode.py pipeline.

Vol3 — ವಾಕ್ಯರಚನೆ ಭಾಗ ೧ (Chapters 7–8):

Chapter 7: ಎಸಕಪದದ ಪಾಂಗುಗಳು (Verbal Arguments) — verb valency, transitivity, argument frames
Chapter 8: ಪಾಂಗಿಟ್ಟಳದಲ್ಲಿ ಮಾರ್ಪಾಡುಗಳು (Argument Frame Alternations) — passivisation, causativisation

Vol4 — ವಾಕ್ಯರಚನೆ ಭಾಗ ೨ (Chapters 9–10):

Chapter 9: ಆಡುಪದಗಳು (Personal Pronouns) — person/number/gender in Kannada pronoun system
Chapter 10: ತೋರುಪದಗಳು (Demonstratives) — proximal/distal deixis, anaphora

Files produced per volume:

book/volN/kn/raw.md — extracted OCR source
book/volN/kn/full.md — structured Kannada with TOC and <a id="adhyAya-N"> anchors
book/volN/eke/full.md — Eke romanisation
book/volN/en/summary.md — English chapter summary

Multi-volume index book/kn/full.md updated: vol3 and vol4 entries added; vol1/vol2 entries corrected; vols 5–7 marked as pending.

Commits: (see git log)

Phase 25 — YouTube Summaries for Books 01–13 + Book 33 Split (2026-03-22)

Motivation: Two tasks: (1) provide English overviews for the 6 YouTube-only books that lacked any English entry; (2) correctly separate the two distinct sollarime works that had been conflated in Book 07’s folder.

25a — YouTube English summaries for books 01, 06, 10, 11, 12, 13

Six books had YouTube transcripts (or stubs) but no English summary. Created youtube/en/summary.md + updated claude-prompt.md for each:

Book	Title	Quality	Notes
01	idu kannaDaddE vyAkaraNa	✅ Excellent	Malati Bhat reading; 3 parts; full 19-chapter TOC read aloud in Part 1
06	kalikenuDi mattu nuDikalike	⚠️ Garbled	Live-lecture ASR; keyword signals only
10	kannaDa nuDiya hinnaDavaLi	⚠️ Garbled	Live-lecture ASR; content drawn from website description
11	kannaDa barahada padasamasye	⚠️ Garbled	25-part lecture; may be earlier monograph later condensed into Book 30
12	kannaDa bhASheya kalpita caritre	✅ Good	Malati Bhat reading; 2 parts; comparative method section well-preserved
13	dArege doDDavaru	🎓 Symposium	Third-person references to Bhat; panels praising his work — felicitation proceedings, not Bhat reading his own book

All 6 books now have youtube/en/summary.md with explicit quality assessment and part-level availability table. claude-prompt.md updated to include the new English file.

Commits: (see git log)

25b — Book 33 split from Book 07 (ಕನ್ನಡ ಸೊಲ್ಲರಿಮೆ vs ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ)

Discovery: The youtube/kn/full.md inside Book 07’s folder was titled KANNADA SOLLARIME (no “baraha”) — a different, shorter work than Book 07’s KANNADA BARAHADA SOLLARIME (7-volume grammar of written Kannada). The Book 07 README had carried a ⚠️ YouTube note since Phase 22 flagging the mismatch.

Action:

git mv Book 07’s youtube/kn/full.md → new folder 33-kannaDa-sollarime/youtube/kn/full.md
nav_order updated 107→133; redirect_from: /dnsbhat/07-kannadada-sollarime/07-kannadada-sollarime preserved for URL continuity
New Book 33 folder: README.md, claude-prompt.md, youtube/en/summary.md created
Book 07 README.md updated: YouTube row removed, clean cross-reference link to Book 33 added
dnsbhat/README.md: Book 33 entry added (Section M — YouTube-Only Book Split from Book 07), collection stats updated 32→33

Identity note: ಕನ್ನಡ ಸೊಲ್ಲರಿಮೆ (no “baraha”) may be an earlier or more general grammar covering both spoken and written Kannada — the omission of “baraha” (writing) distinguishes it from Book 07 (ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ, 7-volume grammar of written Kannada). The precise relationship cannot be determined without a PDF.

Stats after Phase 25: All 33 DNS Bhat books catalogued; all 33 have claude-prompt.md.

Commits: (see git log)

Phase 24 — Books 30–32 Added from PDF Extraction (2026-03-22)

Motivation: Three additional DNS Bhat works identified from Google Drive PDFs not previously catalogued.

Book 30 — ಕನ್ನಡ ಬರಹವನ್ನು ಸರಿಪಡಿಸೋಣ (382 pages, Nudi encoding)

Full 4-file set produced via wx_decode.py (Nudi→Unicode lookup, same pipeline as Books 07/17/25/28/29):

book/kn/raw.md — decoded Kannada source (Nudi WX → Unicode)
book/kn/full.md — structured Kannada with 10-chapter TOC and <a id="adhyAya-N"> anchors
book/en/summary.md — English chapter-by-chapter summary with cross-links
book/eke/full.md — Eke romanisation matching full.md structure
claude-prompt.md — AI context primer

Subject: Practical guide to correcting Kannada script usage — aspirate simplification, anusvara normalisation, vowel-length marking. Arguments for hosa baraha (simplified writing) applied to print.

Book 31 — ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಕನ್ನಡದ್ದೇ ಪದಗಳು (487 pages, Nudi encoding, A–Z dictionary)

Partial processing — Nudi WX decode successful but English headword reconstruction partially garbled:

book/kn/raw.md — decoded Kannada source
book/en/summary.md — English summary (headwords + overview; garbling noted)
claude-prompt.md

Subject: Comprehensive dictionary of native Kannada equivalents for English words (A–Z), structured as a lexical resource. Companion to Book 15 (ingliS-kannaDa padanerake).

Book 32 — The Prominence of Tense, Aspect and Mood (214 pages, clean English, John Benjamins)

Clean English PDF — pdfplumber extraction; no Nudi decoding required:

book/en/summary.md — chapter summary
claude-prompt.md

Subject: Cross-linguistic typological monograph (English) on the TAM hierarchy across world languages. Bhat’s contribution to general linguistic theory; Kannada-informed but not Kannada-specific.

Stats after Phase 24: 32 books catalogued (sections A–L); all 32 have claude-prompt.md.

Commit: (see git log)

Motivation: Three separate improvements triggered by an audit of which YouTube transcripts could be enriched using existing high-quality sources (book/kn/full.md or web/kn/raw.md).

CANDIDATES[('web', 'kn')] in pages.yml was ['web/kn/full.md'] only. Books 14 and 18, which have web/kn/raw.md but no full.md, were missing from the Blog Kannada sidebar. Added 'web/kn/raw.md' as a fallback candidate.

Commit: 7678364

16 of 29 books have youtube/kn/full.md files that are placeholder stubs (no ## Part or ### Part headers — just a description or link list). These were cluttering the YouTube sidebar.

Added a Stubs top-level sidebar section (nav_order: 40) with automatic reclassification: the is_youtube_stub() helper in pages.yml detects stub files and moves them from ('youtube', 'kn') to ('stub', 'kn') in Pass 1, dropping their associated YouTube en/eke entries. A dedicated Stubs section renders them separately.

Commit: 33d0247

23c — YouTube transcript cross-references (Books 02 and 03)

Observation: Books 02 and 03 have both a YouTube transcript file and a high-quality Kannada source (web blog for Book 02, scanned book for Book 03). Keyword-overlap analysis (TF-IDF-style, 5+ char Kannada words, overlap score ratio) identified which YouTube Parts correspond to which source sections.

Approach chosen: Rather than copying full source text inline (which duplicates content and inflates the YouTube file), each matched Part gets a link + 60-word excerpt pointing to the canonical source section:

*ಈ ಭಾಗ ಪುಸ್ತಕದ [೧.೫: ಪದದೊಟ್ಟುಗಳು ಮತ್ತು ಪದರಪದೊಟ್ಟುಗಳು](../../book/kn/full#sec-1-5) ಅನ್ನು ಆಧರಿಸಿದೆ.*

> ನುಡಿಗಳಲ್ಲಿ ಸಾಮಾನ್ಯವಾಗಿ ಎರಡು ಬಗೆಯ ಒಟ್ಟುಗಳು ಬಳಕೆಯಲ್ಲಿರುತ್ತವೆ:…

Links use the #sec-N-M / #sub-N-M-K anchors added in Phase 19.

Results:

Book 02: 10 Parts (24, 26–30, 39–42) linked to web/kn/raw.md blog sections (ಭಾಗ 4–10)
Book 03: 33 of 55 Parts linked to book/kn/full.md sections using precise sec-N-M anchors; 22 Parts retained ASR (ambiguous overlap, ratio < 1.4×)

File sizes after: Book 02 youtube/kn/full.md: 1,342 lines; Book 03: 1,424 lines (vs 4K–6K if full text had been copied inline).

Scripts: /tmp/link_substitute_02.py, /tmp/link_substitute_03.py

Commits: cb466d7 (reverted), d5d82cf (reverted), 4678d9f (final link+excerpt approach)

Phase 22 — YouTube Transcript Restructuring: Paragraph Breaks + Chapter TOC (2026-03-21)

Scope: Books 01–13 (YouTube-only lecture series — no scanned book text).

Motivation: All youtube/kn/full.md files for books 01–13 contained raw ASR transcripts as flat single-paragraph blobs under ## Part N headers, with no visual separation between parts, no anchor links, and no chapter grouping. The files were unreadable as reference material.

Changes (script: /tmp/structure_v2.py):

## Part N headers demoted to ### Part N (to nest under chapter ## headings)
<a id="part-N"> anchor inserted before every Part
ASR transcript blobs split into ~80-word paragraphs for readability
Lines with < 40% Kannada characters replaced with > *[ಈ ಭಾಗದ ಭಾಷಾಂತರ ಲಭ್ಯವಿಲ್ಲ — transcript not available for this part]*
[Watch on YouTube](URL) link inserted into each Part from the Table of Contents
ಪರಿವಿಡಿ TOC added at the top of each file linking to #part-N anchors
Book 03 additionally restructured with 9-chapter grouping (## ಅಧ್ಯಾಯ N headings with <a id="adhyAya-N"> anchors) per the book’s chapter structure

Books processed:

Flat (no chapters): 01, 06, 07, 08, 10, 11, 12, 13
Chapter-mapped: 03 (9 chapters, Parts 1–55)
Earlier partial restructuring (separate commits): 02, 04, 05, 09

Commits: 304fc93, ad1b267, 0c9f7b9

Motivation: After the taxonomy migration, the GitHub Pages site was broken: 404s throughout, no books showing in sidebar (the old pages.yml Python script used flat paths that no longer existed), and .md-suffix links causing 404s.

Sidebar redesign: Replaced the flat 3-section sidebar (English Summaries / Kannada Text / Eke Transliteration) with a source-first 3-level hierarchy:

▶ Books
    ▶ English        — book/en/summary.md entries
    ▶ Kannada Text   — book/kn/full.md entries
    ▶ Eke Transliteration — book/eke/full.md entries
▶ Blog
    ▶ Kannada Text   — web/kn/full.md entries
    …
▶ YouTube
    ▶ English        — youtube/en/summary.md entries
    ▶ Kannada Text   — youtube/kn/full.md entries
    …

Each content file gets parent: "Kannada Text" + grand_parent: "Books" (etc.) using just-the-docs 3-level nesting. Sections only created when content exists.

pages.yml Python script completely rewritten in 6-pass architecture:

Gather book metadata from youtube/kn/full.md + find content files via CANDIDATES dict
Create source section pages (book.md, web.md, youtube.md)
Create language sub-section pages (book-en.md, book-kn.md, etc.)
Write parent/grand_parent front matter to all content files
Write book landing pages as index.md (so /NN-slug/ directory URL resolves)
Mark all other .md files nav_exclude: true

Bug fixes:

docs/_config.yml: removed jekyll-relative-links plugin (frozen Gemfile.lock); removed orphaned collections: + just_the_docs: collections: blocks (were rendering a floating “DNS Bhat” label in sidebar)
docs/Gemfile: removed jekyll-relative-links gem
Books 14, 27, 28, 29 youtube/kn/full.md: fixed broken cross-links ../en/summary → ../../book/en/summary and ../eke/full → ../../book/eke/full (youtube/ has no en/eke content for these books)
Book landing pages now written as index.md so directory URLs like /dnsbhat/28-kannaDakke-bEku/ resolve correctly

Phase 20 — 4-Level Taxonomy Migration (2026-03-20)

Scope: All 29 book directories in src/main/md/kannada/dnsbhat/.

Motivation: The old flat layout (NN-slug-kn.md, NN-slug-en.md, etc.) mixed all content types in one directory. The new 4-level taxonomy separates by source (book/web/youtube), language (kn/eke/en), and type (full/summary/raw).

New structure:

NN-slug/
├── README.md              # Book index landing page
├── claude-prompt.md       # AI context primer
├── book/
│   ├── kn/full.md         # Scanned book — structured Kannada (was NN-slug-kn.md)
│   ├── kn/raw.md          # Raw OCR archive (was NN-slug-book.md)
│   ├── eke/full.md        # Eke romanisation (was NN-slug-kn-eke.md)
│   └── en/summary.md      # English chapter summary (was NN-slug-en.md)
├── web/
│   ├── kn/raw.md          # DNS Bhat blog posts (was NN-slug-blog.md)
│   └── en/summary.md      # Blog-based English summary
└── youtube/
    ├── kn/full.md         # Assembled transcript Kannada (was NN-slug.md — nav metadata carrier)
    ├── eke/full.md        # Eke romanisation of transcript
    └── en/summary.md      # YouTube-based English summary

Migration: 129 git mv operations via script. All cross-links in kn.md and kn-eke.md files updated to new relative paths (depth 3: ../eke/full, ../../README; depth 4: ../../../README).

sync_docs.py rewrite: ettuge-sync/scripts/sync_docs.py rewritten to recursively walk src/ subdirectories and create matching structure in docs/; orphaned flat docs/ files deleted.

Phases 27–20 — Book 31 Decode, PDFs, Taxonomy Migration, Transcript Fixes

Phase 27 — Book 31 Full CID Decode: Baraha Dictionary (2026-03-23)

Phase 26 — Sollarime Vol.3 + Vol.4 from PDF (2026-03-23)

Phase 25 — YouTube Summaries for Books 01–13 + Book 33 Split (2026-03-22)

25a — YouTube English summaries for books 01, 06, 10, 11, 12, 13

25b — Book 33 split from Book 07 (ಕನ್ನಡ ಸೊಲ್ಲರಿಮೆ vs ಕನ್ನಡ ಬರಹದ ಸೊಲ್ಲರಿಮೆ)

Phase 24 — Books 30–32 Added from PDF Extraction (2026-03-22)

Book 30 — ಕನ್ನಡ ಬರಹವನ್ನು ಸರಿಪಡಿಸೋಣ (382 pages, Nudi encoding)

Book 31 — ಇಂಗ್ಲಿಶ್ ಪದಗಳಿಗೆ ಕನ್ನಡದ್ದೇ ಪದಗಳು (487 pages, Nudi encoding, A–Z dictionary)

Book 32 — The Prominence of Tense, Aspect and Mood (214 pages, clean English, John Benjamins)

Phase 23 — Blog Sidebar Fix + Stubs Category + YouTube Transcript Cross-References (2026-03-21)

23a — Blog sidebar fallback fix

23b — Stubs sidebar category

23c — YouTube transcript cross-references (Books 02 and 03)

Phase 22 — YouTube Transcript Restructuring: Paragraph Breaks + Chapter TOC (2026-03-21)

Phase 21 — GitHub Pages Nested Sidebar + Bug Fixes (2026-03-20)

Phase 20 — 4-Level Taxonomy Migration (2026-03-20)