Ch 3 — Spark / Big Data | ettuge

← Ch 2 · Contents · Ch 4 →

Chapters: Ch 1 · Ch 2 · Ch 3 · Ch 4

Spark / Big Data

Apache Spark (general reference) incubator-nlpcraft project (Apache NLPCraft NLP framework)

Apache Spark is the dominant distributed data processing framework for large-scale ETL, ML pipelines, and analytics. Referenced here in the context of the NLPCraft project — which uses Spark for distributed NLP workload processing across the incubator-nlpcraft codebase. [→ data-engineering; machine-learning-ai]

Kannada Data Pipeline (alar.ink)

Nushell command: parse alar.txt linkchecker output to JSON:

open alar.txt | parse -r '(((URL *\x60(?P<Url>[^\x27]*)\x27)\n)|((Name *\x60(?P<Name>.*)\x27)\n)|(Parent URL *(?P<Purl>(?P<phttp>[^,]*), line (?P<pline>[^,]*), col (?P<pcol>[^\n]*))\n)|((Real URL *(?P<Rurl>.*))\n)){4}' | select Url Name Rurl phttp pline pcol| to json | save -f "alar-url.txt"

A Nushell pipeline for parsing the text output of a link checker run against the alar.ink Kannada dictionary. Extracts URL, name, real URL, and parent URL fields using a complex regex into structured JSON. An example of Nushell’s data-oriented shell scripting for text processing tasks. [→ data-engineering; kannada-language-linguistics; scala-functional-programming]

Nushell Kannada character class variables (for alar.ink 5-syllable word extraction):

$env.vowelfull = $env.vowel + $env.vowelend
$env.consnooth = $env.cons + $env.consend
$env.consotth1 = $env.cons + $env.hal + $env.cons + $env.consend
$env.consotth2 = $env.cons + $env.hal + $env.cons + $env.hal + $env.cons + $env.consend
$env.block = $"\(\(($env.vowelfull)\)|\(($env.consnooth)\)|\(($env.consotth1)\)|\(($env.consotth2)\)\)"
egrep $"\^($env.block){5}[್]?\$" alar-212-sorted-uniq-2.txt | save -f alar-5.txt
# Result: 21416 words of 5+ syllables

Nushell environment variables encoding Kannada phonological character classes (vowels, consonant clusters, halant sequences) as regex components. The pipeline extracts words of exactly 5+ syllables from the alar.ink dictionary, yielding 21,416 results — used for the Wordalla Kannada word game project. [→ data-engineering; kannada-language-linguistics]

Benchmarking

IndicGenBench — multilingual benchmark, 29 Indic languages arxiv 2404.16816 https://arxiv.org/abs/2404.16816

IndicGenBench is a comprehensive benchmark for evaluating LLM performance on generation tasks across 29 Indian languages, covering translation, summarisation, and question answering. An important resource for assessing Indic NLP progress and identifying gaps in coverage. [→ machine-learning-ai; kannada-language-linguistics]

GitHub: google-research-datasets/indic-gen-bench https://github.com/google-research-datasets/indic-gen-bench

The Google Research Datasets repository for IndicGenBench — containing the benchmark data, evaluation scripts, and model outputs for the 29-language Indic generation benchmark. [→ machine-learning-ai; kannada-language-linguistics]

Educational

Spintronics — Build Mechanical Circuits (upperstory.com) https://upperstory.com/

Spintronics is a physical game/kit that lets you build working mechanical analogues of electronic circuits — gears and axles stand in for electrons and wires. A tactile way to learn circuit fundamentals and the relationship between mechanical and electrical systems. [→ algorithms-data-structures; mathematics-science]

DEXA scan info (dexafit.com pre-test protocols)

Notes on DEXA (Dual-Energy X-ray Absorptiometry) scan protocols from DexaFit, a commercial DEXA scanning service. DEXA is the gold standard for body composition measurement, providing visceral fat, lean mass, and bone density data. [→ health-fitness]