Fetch RSS feeds from 14 AI/ML news sources and write a structured JSON snapshot to disk.
- RSS snapshot exists
- 14 feeds fetched
- No HTTP errors
Feed Sources
The pipeline ingests from curated AI/ML publications including Hacker News, arXiv, MIT Technology Review, and selected newsletters. Each source is validated for RSS validity and rate-limit headers on startup.
Error Handling
Transient HTTP 429 responses trigger exponential backoff (max 3 retries). Persistent failures are logged and the source is marked unavailable without failing the overall task.
Apply topic modeling to the ingested articles using lightweight embedding + k-means clustering. Assign cluster labels and store the partition.
- All articles assigned to a cluster
- ≥ 8 clusters non-empty
- Silhouette score > 0.3
- Output files written
The clustering step uses sentence-transformers for embeddings and scikit-learn's MiniBatchKMeans for efficiency. Articles are deduplicated before clustering using SimHash to catch near-duplicates across feeds.
Spawns one script-generation task per cluster. Each child receives cluster articles and writes a narrative script in Markdown.
Narrative summary of the AI Research cluster — 22 articles covering LLM advances, alignment research, and safety.
- File written > 500 chars
- Contains 4+ article citations
Narrative summary of the Industry News cluster — 31 articles covering product launches, funding rounds, and market moves.
- File written > 500 chars
- Contains 6+ article citations
Narrative summary of the Open Source cluster — 17 articles covering library releases, community milestones, and tooling.
Run linting, quality scoring, and completeness checks against all generated scripts. Fails if any script scores below the quality threshold.
- All scripts pass lint
- Quality score > 0.7
Dispatch the curated digest to 847 subscribers via the configured email provider. Uses batched API calls to respect rate limits.
- SMTP connection successful
- All 847 emails accepted by provider
Failure Analysis
The SMTP relay rejected the batch submission with 550 5.7.1 Message rejected due to content policy. The digest HTML likely contains a flagged pattern. Retry with sanitized HTML or bypass the SMTP check for dry-run mode.