LERAGE: A Computational Framework for Mining Clinical Insights from Misophonia Podcast Transcripts

Executive Summary

Misophonia research faces a challenge: while clinical studies provide controlled data, they often miss the rich, naturalistic experiences shared by sufferers in conversational settings. Podcast interviews represent an untapped reservoir of phenomenological data, containing detailed first-person accounts of triggers, coping mechanisms, disease progression, and lived experiences across diverse populations.

LERAGE (Lived Experience Retrieval Augmented Generation Engine) is a computational framework that applies natural language processing, mathematical diversity optimization, and weighted bias control techniques to systematically analyze misophonia podcast transcripts. By combining Maximal Marginal Relevance (MMR) diversity optimization, hybrid search strategies, weighted bias control, and statistical analysis, LERAGE transforms unstructured conversational data into research insights while maintaining methodological standards.

The system includes: (1) MMR diversity optimization for qualitative health research, (2) weighted bias control that reduces researcher influence without information loss, (3) hybrid search architecture combining semantic, temporal, and causal retrieval strategies, and (4) automated generation of statistics with Wilson confidence intervals and effect sizes.

1. Introduction

1.1 The Challenge of Misophonia Research

Misophonia, characterized by intense emotional and physiological responses to specific sounds, presents challenges for researchers:

Limited Clinical Access: Many sufferers do not seek clinical help, creating selection bias in traditional studies
Phenomenological Complexity: The subjective nature of triggers and responses requires rich, detailed accounts
Developmental Trajectories: Understanding progression from childhood to adulthood requires longitudinal perspectives
Coping Strategy Diversity: Effective interventions vary significantly between individuals
Scale Limitations: Manual analysis of qualitative data becomes impractical beyond small sample sizes

1.2 The Computational Opportunity

Podcast interviews with misophonia sufferers offer several advantages over traditional research methods:

Naturalistic Disclosure: Conversational settings encourage detailed personal narratives
Diverse Populations: Podcasts reach sufferers who may not participate in clinical studies
Temporal Depth: Long-form interviews capture developmental histories and progression patterns
Rich Context: Discussion of triggers, emotions, social impacts, and coping strategies in natural context
Scale Potential: Computational analysis enables systematic processing of hundreds of interviews

Analyzing large volumes of conversational data introduces challenges: researcher bias in interpretation, information overload, lack of systematic diversity controls, and the need for statistical analysis. LERAGE addresses these challenges through computational methods designed for qualitative health research.

1.3 Core Components

LERAGE implements four computational components:

MMR Diversity Optimization: Mathematical framework ensuring source diversity in qualitative analysis
Weighted Bias Control: Reduction of researcher influence through score weighting rather than binary exclusion
Hybrid Search Architecture: Multi-strategy retrieval combining semantic, temporal, and causal search approaches
Automated Research Statistics: Generation of statistical analysis with appropriate tests and confidence intervals

2. Theoretical Framework

2.1 Enhanced Retrieval-Augmented Generation (RAG) with Bias Control

LERAGE extends the traditional RAG paradigm with enhancements designed for research integrity:

Multi-Strategy Information Retrieval: Hybrid search combining semantic similarity, temporal relevance, and causal relationship detection
Mathematical Diversity Optimization: MMR-based selection ensuring broad source representation
Weighted Bias Mitigation: Systematic reduction of researcher influence through configurable score adjustments
Research-Grade Synthesis: LLM-powered analysis with confidence scoring and source attribution. The system is routable to any LLM including local models for privacy and control.

2.2 Maximal Marginal Relevance (MMR) Framework

MMR balances relevance and diversity through the mathematical formula:

MMR = λ × Sim(query, doc) - (1-λ) × max(Sim(doc, selected_doc))

Where:

λ controls the trade-off between relevance (1.0) and diversity (0.0)
Sim represents cosine similarity in the embedding space
The algorithm iteratively selects documents maximizing this combined score

LERAGE implements MMR for qualitative research by calculating relevance scores, selecting the highest relevance document first, then iteratively selecting diverse documents based on combined MMR scores that balance query relevance against similarity to already-selected content.

2.3 Hybrid Search Architecture

LERAGE employs three complementary search strategies, each optimized for different research questions:

2.3.1 Semantic Search

Vector similarity using BGE-large-en-v1.5 embeddings
Identifies conceptually related content across interviews
Weight: 0.3-0.8 depending on query characteristics

2.3.2 Temporal Search

Targets age-specific and developmental content
Activated by temporal keywords: "childhood", "onset", "progression"
Weight: 0.1-0.5 based on query temporal relevance

2.3.3 Causal Search

Focuses on cause-effect relationships and trigger mechanisms
Triggered by causal language: "cause", "trigger", "lead to"
Weight: 0.1-0.5 based on causal query indicators

The hybrid approach dynamically weights these strategies based on query analysis, adjusting weights for temporal queries, causal queries, or maintaining balance for general questions.

2.4 Weighted Bias Control Framework

LERAGE implements weighted bias control that preserves information while reducing influence. The system detects speculation levels, applies contributor downweighting when appropriate, and adjusts relevance scores based on configured weights (typically 0.5 for speculative content from non-sufferers).

This approach enables researchers to:

Reduce (but not eliminate) researcher speculation
Maintain transparency through weight reporting
Configure bias control strength (0.1-1.0)
Preserve potentially valuable insights while controlling influence

2.5 Research Quality Framework

2.5.1 Statistical Rigor

LERAGE implements publication-quality statistical methods:

Wilson Score Intervals: More accurate than normal approximation for proportions
Chi-Square Tests: Association analysis with appropriate effect sizes
Cramér's V: Standardized effect size measurement
Odds Ratios: Quantification of association strength

2.5.2 Confidence Scoring

Multi-factor confidence assessment combines base confidence from average scores, volume bonus from number of chunks, and diversity bonus from unique episodes to produce overall confidence metrics.

3. Implementation Architecture

3.1 Data Processing Pipeline

3.1.1 HTML Transcript Parsing

LERAGE processes HTML transcripts with structured speaker attribution, maintaining temporal context while normalizing speaker roles for consistent analysis. The system extracts speaker turns with timestamps, normalizes Host/Guest roles, preserves original attribution, and generates deterministic episode IDs.

3.1.2 Multi-Strategy Chunking

Three parallel chunking approaches ensure comprehensive coverage:

Conversation Chunking: Preserves speaker turns and dialogue flow
Topic Chunking: Groups thematically related content across temporal boundaries
Sliding Window Chunking: Overlapping 600-token windows with 150-token overlap

Each chunk receives deterministic IDs enabling consistent processing across system restarts.

3.2 Hybrid Retrieval System

The system determines search strategy weights based on query characteristics, performs weighted searches across semantic, temporal, and causal dimensions as needed, then merges and reranks results. It dynamically adapts search strategies based on query characteristics, ensuring optimal retrieval for different research questions.

3.3 Bias Control Implementation

The bias control system detects speculation through pattern matching of phrases like "I think", "I believe", "it seems", calculating speculation density per chunk. It identifies primary sources through experiential language patterns like "I experience", "when I hear", "my triggers", scoring content based on first-person accounts versus third-party speculation.

3.4 Research Analytics Service

The analytics service calculates prevalence with Wilson confidence intervals, providing accurate bounds for proportions. It implements chi-square tests for associations, Cramér's V for effect sizes, and odds ratios for relationship strength, all following standard statistical methodologies.

4. System Architecture

Drag to pan • Scroll to zoom

4.1 Bias Control Pipeline

Drag to pan • Scroll to zoom

5. Research Applications and Capabilities

5.1 Query Processing Capabilities

LERAGE supports multiple research query types:

Trigger Analysis: "What are the most common misophonia triggers?"
Coping Strategies: "How do people cope with misophonia symptoms?"
Causal Investigation: "What causes misophonia to develop?"
Temporal Analysis: "How does misophonia change from childhood to adulthood?"
Hypothesis Testing: "Misophonia symptoms worsen with stress"

Each query type activates appropriate search strategies and bias controls.

5.2 Statistical Analysis Features

The system provides prevalence calculation with confidence intervals, association testing with effect sizes, and generation of publication-ready tables through the command-line interface.

5.3 Bias Control Options

Researchers can exclude specific contributors completely, reduce contributor influence by weighting, filter for primary sources only (excluding speculation), and control diversity versus relevance balance through the MMR lambda parameter.

6. Methodological Approach

6.1 MMR for Qualitative Research

This work demonstrates the application of Maximal Marginal Relevance to qualitative health research, providing:

Mathematical Framework: Objective balance between relevance and source diversity
Configurable Trade-offs: Researchers adjust λ parameter for their specific needs
Reproducible Selection: Deterministic results given identical inputs
Source Distribution Metrics: Shannon entropy and diversity statistics

6.2 Weighted Bias Control

Unlike binary exclusion approaches, weighted bias control offers:

Information Preservation: Speculation downweighted, not eliminated
Configurable Influence: Researchers set appropriate weight factors
Transparency: All adjustments logged and reported
Nuanced Control: Different weights for different types of bias

6.3 Hybrid Search for Conversational Data

The combination of semantic, temporal, and causal search strategies provides:

Query-Adaptive Weighting: Automatic strategy selection based on query characteristics
Complementary Coverage: Each strategy captures different aspects of the data
Unified Ranking: Combined scoring across all search modalities
Research Question Optimization: Tailored retrieval for different research needs

7. Ethical Considerations

7.1 Privacy Protection

Aggregate analysis only with source anonymization
No extraction or storage of personal identifiers
Episode-level attribution without individual identification

7.2 Bias Transparency

All bias adjustments explicitly documented in outputs
Contributor exclusions and weight factors reported
Speculation detection confidence scores provided
Clear methodology for all bias control decisions

8. Technical Specifications

8.1 Core Technologies

Embedding Model: BAAI/bge-large-en-v1.5 (1.3GB)
Vector Database: ChromaDB with cosine similarity indexing
Language Model: Routable to any LLM including local models for privacy and control
Statistical Computing: SciPy, NumPy for statistical analysis
Web Framework: FastAPI for RESTful API services

8.2 Deployment Architecture

The system uses Docker Compose for service orchestration, with ChromaDB for vector storage and a FastAPI-based API service. Configuration is managed through environment variables including API keys and model specifications.

9. Command-Line Interface

9.1 Core Commands

# Data Management
lerage data stats                              # System statistics
lerage data ingest /path --exclude-researchers # Ingest with filtering
lerage data ingestion-status                   # Check processing status

# Analysis Commands
lerage analyze search "question"               # Basic search
lerage analyze search "question" --exclude-author name --mmr-lambda 0.7
lerage analyze prevalence feature1 feature2    # Prevalence analysis
lerage analyze association feature1 feature2   # Association testing
lerage analyze triggers --sufferers-only       # Common trigger analysis
lerage analyze coping --sufferers-only        # Coping strategy patterns
lerage analyze onset --sufferers-only         # Age of onset analysis
lerage analyze patterns                       # Comprehensive pattern extraction

# Content Commands
lerage content find-highlights <slug>         # AI clip discovery
lerage content extract-text-assets <slug>     # Quote generation
lerage content process-clips <slug>           # Video production
lerage content generate-schedule              # Social media calendar
lerage content seo-audit                      # SEO optimization

9.2 Bias Control Parameters

--exclude-author NAME       # Complete contributor exclusion
--downweight-author FACTOR  # Weight multiplier (0.1-1.0)
--mmr-lambda VALUE         # Diversity control (0.0-1.0)
--primary-sources          # Exclude speculation
--min-sources N            # Minimum unique episodes

10. Conclusion

LERAGE represents an advancement in computational qualitative research methodology, demonstrating how technical approaches can enhance traditional research methods. The system's components—MMR diversity optimization, weighted bias control, hybrid search strategies, and automated research statistics—address challenges in scaling qualitative analysis while maintaining scientific rigor.

Key contributions include:

MMR Application to Health Research: Mathematical framework for systematic source diversity in qualitative studies
Weighted Bias Control: Approach preserving information while reducing researcher influence
Hybrid Search Architecture: Multi-strategy retrieval optimized for different research questions
Research-Grade Automation: Statistical analysis with appropriate confidence intervals and effect sizes

The system has successfully processed extensive podcast content, enabling researchers to extract systematic insights from conversational data at scale. By making experiential data accessible for rigorous analysis, LERAGE opens possibilities for understanding misophonia and potentially other conditions where qualitative data exists in unstructured formats.

Appendix A: Configuration Reference

A.1 Environment Variables

# Core API Settings
OPENAI_API_KEY=your_api_key_here
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
LLM_MODEL=gpt-4

# Chunking Parameters
CHUNK_SIZE=600
CHUNK_OVERLAP=150

# Bias Control Defaults
DEFAULT_DOWNWEIGHT_FACTOR=0.5
DEFAULT_MMR_LAMBDA=0.7
DEFAULT_TOP_K=40

# Statistical Settings
CONFIDENCE_LEVEL=0.95

A.2 Bias Control Schema

The bias configuration includes options for excluding contributors, setting downweight factors, adjusting MMR lambda, filtering for primary sources only, and excluding speculation.

Appendix B: Statistical Formulas

B.1 Wilson Score Confidence Interval

CI = (p̂ + z²/2n ± z√(p̂(1-p̂)/n + z²/4n²)) / (1 + z²/n)

Where p̂ is the sample proportion, z is the critical value, and n is the sample size.

B.2 Cramér's V Effect Size

V = √(χ²/(n × (min(rows,cols) - 1)))

Standardized measure of association strength between categorical variables.

B.3 MMR Selection Score

MMR(Di) = λ × Sim(Di, Q) - (1-λ) × max[Sim(Di, Dj)]

Where Dj represents already selected documents, balancing relevance and diversity.

LERAGE: Advancing misophonia research through computationally-enhanced, bias-controlled analysis of lived experiences.