project-hn-aggregator
Sunday, February 8, 2026
Cedric 4:21 PM
Found the issue! Here's my analysis:
:mag: Problem: Summaries Not Being Generated
Root Cause: The job pipeline has a race condition / missing dependency between article fetching and summarization.
What's happening:
1. When a new post arrives (
• :whitecheckmark:
• :x:
2. The summarizer runs and checks for article content, but the article was never fetched:
3. Result: Posts 311-325 (and many others) show: •
•
•
Stats from production (
• 325 posts total
• 325 marked as "summarized"
• Only 151 have actual summaries (46%)
• 211 articles exist, 185 with content — but 114 posts have no article at all
The Bug (in
Also missing: dependency logic — summarization should wait until article fetch completes.
Production is running fine (server up, scheduler active), it's just that the pipeline is broken by design. Want me to proceed with a fix?
10 replies:mag: Problem: Summaries Not Being Generated
Root Cause: The job pipeline has a race condition / missing dependency between article fetching and summarization.
What's happening:
1. When a new post arrives (
syncTopStories()), it queues: • :whitecheckmark: summarizearticle• :whitecheckmark:
fetchcomments• :x:
fetcharticle — NOT QUEUED2. The summarizer runs and checks for article content, but the article was never fetched:
[Summarizer] No content for post 324, skipping LLM
It then marks summarized=true with null summaries to avoid infinite retries.3. Result: Posts 311-325 (and many others) show: •
articlefetched=0 — never attempted•
summarized=1 — marked complete•
summaryshort=NULL — no actual summaryStats from production (
hn.sandscape.cc):• 325 posts total
• 325 marked as "summarized"
• Only 151 have actual summaries (46%)
• 211 articles exist, 185 with content — but 114 posts have no article at all
The Bug (in
jobs.ts):
// queueNewPost() - line ~58
export async function queueNewPost(postId: number, score: number): Promise<void> {
const priority = Math.min(100, Math.floor(score / 10));
await queueArticleSummary(postId, priority);
await queueCommentFetch(postId, priority);
// ❌ MISSING: await queueJob('fetcharticle', postId, priority);
}
Also missing: dependency logic — summarization should wait until article fetch completes.
Production is running fine (server up, scheduler active), it's just that the pipeline is broken by design. Want me to proceed with a fix?
Marian 7:23 PM
@Cedric
Maybe we overcomplicated the project and now it is not really working.
So analyse it and tell me what is working and what not:
1. periodic fetching and storing the links (when, what position, how many points and comments, etc.)
2. fetching and parsing the content to text form
3. storing and displaying images along with the article
4. Summarize article (short, long summary)
5. fetching the comments
6. summarize comments (short, long form)
7. daily digest (short, long form)
8. job queue visualization
9. tagging the posts
10. embedding the articles and storing in vector db
11. retrieving by semantic search
did i forget something?
I see some of these are not working, so please dive deep also inside the current prod db. You can see its not ok.
25 repliesMaybe we overcomplicated the project and now it is not really working.
So analyse it and tell me what is working and what not:
1. periodic fetching and storing the links (when, what position, how many points and comments, etc.)
2. fetching and parsing the content to text form
3. storing and displaying images along with the article
4. Summarize article (short, long summary)
5. fetching the comments
6. summarize comments (short, long form)
7. daily digest (short, long form)
8. job queue visualization
9. tagging the posts
10. embedding the articles and storing in vector db
11. retrieving by semantic search
did i forget something?
I see some of these are not working, so please dive deep also inside the current prod db. You can see its not ok.
Cedric 7:26 PM
Let me dive deep into the codebase and production DB to give you an honest assessment.
Cedric 7:26 PM
Table is
posts not stories. Let me query properly.Cedric 7:26 PM
Now let me check the scheduler and recent logs to see what's actually running.
Cedric 7:26 PM
Now let me check the summarizer issue (short summaries failing) and what the front-end looks like:
Cedric 7:26 PM
Alright, I've done a thorough audit of the prod DB. Here's the honest breakdown:
Marian 8:03 PM
Where?
Marian 10:20 PM
Navigation needs a little improvement. If i click to admin, i see no way back. I can not find where we have our jobs overview. So admin should have inside all the admin things: settings, jobs, posts overview, etc.
Marian 10:20 PM
@Cedric hook yourself up to this channel so you respond to all messages, not just mentions. and respond to my previous messgae
Cedric 10:21 PM
I see — channel
C0ADJDA9ZFD (this channel) isn't in the slack.channels config, so it defaults to requiring a mention. I need to add requireMention: false for it.Cedric 10:21 PM
Done — I've set
On the navigation feedback: you're right, the admin section is missing a back button and the admin pages (settings, jobs, posts overview) aren't consolidated under one nav. I'll address that along with the other fixes. Want me to start working on the nav improvements and the broken features I identified in my earlier audit?
requireMention: false for this channel. Gateway is restarting now. I'll respond to all messages here from now on, no @mention needed.On the navigation feedback: you're right, the admin section is missing a back button and the admin pages (settings, jobs, posts overview) aren't consolidated under one nav. I'll address that along with the other fixes. Want me to start working on the nav improvements and the broken features I identified in my earlier audit?
Marian 11:04 PM
Yes
Cedric 11:04 PM
On it. I've kicked off two parallel tasks:
1. Admin navigation overhaul — adding a proper nav bar site-wide, consolidating admin pages (dashboard, jobs, settings) with back-to-site links
2. Summarizer fix — diagnosing why summaries are truncated garbage and fixing the prompts/parsing
I'll report back when they're done and deploy together.
1. Admin navigation overhaul — adding a proper nav bar site-wide, consolidating admin pages (dashboard, jobs, settings) with back-to-site links
2. Summarizer fix — diagnosing why summaries are truncated garbage and fixing the prompts/parsing
I'll report back when they're done and deploy together.
Cedric 11:04 PM
Let me kick off the work. I'll start with two parallel tasks: fixing the admin navigation and fixing the broken summarizer.
Cedric 11:06 PM
Good, summarizer fix is done. Let me check on the nav task.
Cedric 11:09 PM
Both changes are deployed:
Navigation — Site now has a top nav bar (Home, Digest, Search, Admin) on all pages. Admin has its own nav with Dashboard, Jobs Queue, Settings, and a "← Back to site" link. Settings moved under
Summarizer — Root cause was a combo of:
Check it out: https://hn.sandscape.cc/admin
Navigation — Site now has a top nav bar (Home, Digest, Search, Admin) on all pages. Admin has its own nav with Dashboard, Jobs Queue, Settings, and a "← Back to site" link. Settings moved under
/admin/settings.Summarizer — Root cause was a combo of:
maxTokens: 60 on discussion summaries (lol), a regex that threw away everything after the first line, and vague prompts. Rewrote it to use structured JSON output from Gemini with explicit character requirements. Just reset 246 posts with bad summaries — the job queue will re-process them over the next few hours (5 every 2 minutes).Check it out: https://hn.sandscape.cc/admin
Cedric 11:09 PM
Nav is deployed already — now let me trigger the summarizer reset since that's deployed too.
Cedric 11:09 PM
The code has
fix=reset but the error suggests an older deploy. Let me check if the nav deploy included the summarizer changes too:Cedric 11:09 PM
Summarizer commit is there but might not have been pushed. Let me deploy:
Cedric 11:09 PM
Deploy succeeded (the ref error is just Dokku's tracking branch being stale — the app is running). Now let me trigger the summary reset:
Monday, February 9, 2026
Marian 1:52 AM
Summaries still suck. Check them. what model does do them?
Cedric 1:55 AM
Fixed. The problem was two things:
1. Model: was using
2. Prompts: were asking for character counts (50-150 chars) which made the model awkwardly cut off sentences to hit a target. Switched to word-based guidance (20-40 words short, 80-150 words long) which produces natural, complete summaries.
Just tested — new summaries look like this:
> "The author laments the rise of AI in programming, acknowledging its effectiveness while mourning the loss of the hands-on craft and personal satisfaction that came with traditional coding."
264 posts are queued for re-summarization, the job queue will chew through them (5 every 2 minutes). Should all be done within 2 hours.
1. Model: was using
gemini-3-flash-preview which doesn't support native JSON mode — it kept truncating JSON mid-string, causing parse failures. Switched to gemini-2.0-flash with responseMimeType: 'application/json' so it returns clean structured JSON every time.2. Prompts: were asking for character counts (50-150 chars) which made the model awkwardly cut off sentences to hit a target. Switched to word-based guidance (20-40 words short, 80-150 words long) which produces natural, complete summaries.
Just tested — new summaries look like this:
> "The author laments the rise of AI in programming, acknowledging its effectiveness while mourning the loss of the hands-on craft and personal satisfaction that came with traditional coding."
264 posts are queued for re-summarization, the job queue will chew through them (
Cedric 1:55 AM
OK, I see the issues clearly now. Two problems:
1. Model:
2. Some summaries that DO work are truncating mid-sentence (e.g. "that enhances coding." / "GPT-5.3-Codex is a new, faster agentic model that combines advanced coding.") — the 50-150 char constraint for short summaries is too tight and the model awkwardly truncates to fit.
Let me fix both — use Gemini's native JSON response mode and relax the constraints:
1. Model:
gemini-3-flash-preview — the JSON output is getting truncated because this model doesn't support native JSON mode, so generateJSON is just asking in the prompt and hoping for the best. The response gets cut off mid-JSON → parse failure → 264 posts stuck unsummarized.2. Some summaries that DO work are truncating mid-sentence (e.g. "that enhances coding." / "GPT-5.3-Codex is a new, faster agentic model that combines advanced coding.") — the 50-150 char constraint for short summaries is too tight and the model awkwardly truncates to fit.
Let me fix both — use Gemini's native JSON response mode and relax the constraints:
Cedric 1:55 AM
Now fix the default model in
generate() too and update the summary prompts to be more natural:Cedric 1:55 AM
Now fix the summarizer prompts — the character constraints are causing awkward cutoffs:
Cedric 1:55 AM
Now update the validation to match word-based constraints instead of character-based:
Cedric 1:55 AM
Now deploy and reset:
Cedric 1:55 AM
Deployed. Now reset the bad summaries so they get reprocessed with the new model:
Cedric 1:55 AM
Zero reset because the previous reset already cleared them. Let me check how many are pending and trigger a batch manually to verify the new model works:
Cedric 1:55 AM
264 still need summarizing. Let me trigger a small batch to verify quality:
Cedric 1:55 AM
Night and day difference. These are proper, complete summaries now.
Marian 1:57 AM
yes, limit is stupid. should be quite longer
Cedric 1:57 AM
The short summaries are already coming out at 200-280 chars now (2-3 sentences), and the long ones at 500-580 chars. Are you saying you want them even longer than that? Like how long are you thinking — a proper paragraph for short, and multiple paragraphs for long?
Marian 1:59 AM
No, let us see
Cedric 2:00 AM
:+1: Let's let the queue run and see how they look on the site once more are done.
Marian 9:30 AM
Not all articles have summaries, yet none are pending
Cedric 9:30 AM
Found the bug — the reset endpoint cleared the
summarized flag on posts but didn't re-queue actual jobs. The old jobs were already marked done so nothing picked them up. Just manually queued 246 summarize jobs. They should start processing now (5 every 2 minutes, ~1.5 hours to finish all).Cedric 9:33 AM
264 unsummarized, 246 have articles fetched, but the fetch queue is empty — nothing is queuing
summarize_article jobs for them. Let me check how summarization jobs get queued:Cedric 9:33 AM
I see the problem. The
Let me queue summarization jobs for all unsummarized posts that have articles:
summarize_article jobs from the earlier reset were already in the queue as done from before. The queueJob function only checks for pending or processing status — but when the reset endpoint cleared summarized=false on the posts, it didn't re-queue any jobs. The old done jobs just sat there and no new ones were created.Let me queue summarization jobs for all unsummarized posts that have articles:
Cedric 9:33 AM
I should also fix the reset endpoint to actually queue jobs. Let me do that:
Marian 9:37 AM
How hard would be to add different sources? (youtube channels - we will watch some and get summaries from gemini, twitter - we will follow some people and summarize their posts and some comments to those). Think about it and give me a vision how to integrate it with our current app and make a plan. Do not implement yet
Cedric 9:37 AM
Good question. Here's my thinking:
The Core Idea
Right now everything is HN-specific — the schema, fetchers, UI all assume "HN post with article + comments." To add YouTube and Twitter, we need to generalize into a multi-source content aggregator where HN is just one source.
Architecture Vision
Source abstraction layer — Each source (HN, YouTube, Twitter) implements a common interface:
•
•
•
What changes in the DB:
• Add a
• Add
• Articles table stays the same — it's already generic (url + markdown)
• Comments table needs a
What stays the same:
• Summarizer — it just takes text, doesn't care where it came from
• Tags, embeddings, search — all source-agnostic already
• Job queue — already generic, just add new job types
• Daily digest — works on posts regardless of source
• The entire UI — just needs source badges/filters
Source-Specific Details
YouTube Channels
• Discovery: YouTube Data API v3 or RSS feeds (
• Content: Gemini can process YouTube URLs directly for transcription + summarization (it has native video understanding). Alternatively, use
• Discussion: YouTube Data API for comments (needs API key, has quota limits)
• Config: List of channel IDs to follow, polling interval
Twitter/X Accounts
• This is the hard one. Twitter API is expensive ($100/mo basic) and rate-limited. Alternatives: • Nitter instances (scraping) — unreliable, most are dead
• RSS bridges like
• Apify/scraping services — paid but works
• Manual RSS via services like Feedbin or Miniflux that already handle Twitter
• Content: Tweets are short, but threads need to be reconstructed. For linked articles, we already have the article fetcher
• Discussion: Replies/quote tweets — hardest part, API almost required
• Config: List of usernames to follow
Implementation Plan
Phase 1 — Generalize the schema (1-2 hours)
• Add
• Add
• Migrate existing data (all current posts get
• Update UI to show source badges
Phase 2 — YouTube (2-3 hours)
• RSS-based channel discovery (free, no API key)
• Gemini video summarization (we already have the API key)
• YouTube comment fetching (needs API key)
• Admin UI to add/remove channels
Phase 3 — Twitter (3-5 hours, depending on API approach)
• Pick an ingestion method (API vs scraping vs RSS bridge)
• Tweet/thread reconstruction
• Reply fetching
• Admin UI for followed accounts
Phase 4 — Unified experience (1-2 hours)
• Mixed-source feed on homepage (filterable by source)
• Cross-source daily digest
• Source-specific views
Difficulty Rating
• YouTube: Medium — RSS discovery is free, Gemini handles video natively, main cost is the API key for comments
• Twitter: Hard — API access is the bottleneck. Everything else is straightforward once you can get the data
• Schema generalization: Easy — mostly additive changes, no breaking modifications
My Recommendation
Start with YouTube — it's more useful (long-form content → great summaries) and doesn't have the API access problem Twitter has. We can add Twitter later once we decide how to solve the data access question.
Want me to start with Phase 1 (generalizing the schema)?
The Core Idea
Right now everything is HN-specific — the schema, fetchers, UI all assume "HN post with article + comments." To add YouTube and Twitter, we need to generalize into a multi-source content aggregator where HN is just one source.
Architecture Vision
Source abstraction layer — Each source (HN, YouTube, Twitter) implements a common interface:
•
fetchNew() → discover new content•
fetchContent(item) → get the actual content/transcript/thread•
fetchDiscussion(item) → get comments/repliesWhat changes in the DB:
• Add a
sources table (id, type, config — e.g. which YT channels, which Twitter accounts)• Add
source and sourceid columns to posts (currently hnid is HN-specific)• Articles table stays the same — it's already generic (url + markdown)
• Comments table needs a
source column tooWhat stays the same:
• Summarizer — it just takes text, doesn't care where it came from
• Tags, embeddings, search — all source-agnostic already
• Job queue — already generic, just add new job types
• Daily digest — works on posts regardless of source
• The entire UI — just needs source badges/filters
Source-Specific Details
YouTube Channels
• Discovery: YouTube Data API v3 or RSS feeds (
/feeds/videos.xml?channel_id=X) — RSS is free, no API key needed• Content: Gemini can process YouTube URLs directly for transcription + summarization (it has native video understanding). Alternatively, use
youtube-transcript npm package for text transcripts• Discussion: YouTube Data API for comments (needs API key, has quota limits)
• Config: List of channel IDs to follow, polling interval
Twitter/X Accounts
• This is the hard one. Twitter API is expensive ($100/mo basic) and rate-limited. Alternatives: • Nitter instances (scraping) — unreliable, most are dead
• RSS bridges like
rss-bridge — self-hostable, moderate reliability• Apify/scraping services — paid but works
• Manual RSS via services like Feedbin or Miniflux that already handle Twitter
• Content: Tweets are short, but threads need to be reconstructed. For linked articles, we already have the article fetcher
• Discussion: Replies/quote tweets — hardest part, API almost required
• Config: List of usernames to follow
Implementation Plan
Phase 1 — Generalize the schema (1-2 hours)
• Add
source field to posts (hn, youtube, twitter)• Add
sources config table• Migrate existing data (all current posts get
source='hn')• Update UI to show source badges
Phase 2 — YouTube (2-3 hours)
• RSS-based channel discovery (free, no API key)
• Gemini video summarization (we already have the API key)
• YouTube comment fetching (needs API key)
• Admin UI to add/remove channels
Phase 3 — Twitter (3-5 hours, depending on API approach)
• Pick an ingestion method (API vs scraping vs RSS bridge)
• Tweet/thread reconstruction
• Reply fetching
• Admin UI for followed accounts
Phase 4 — Unified experience (1-2 hours)
• Mixed-source feed on homepage (filterable by source)
• Cross-source daily digest
• Source-specific views
Difficulty Rating
• YouTube: Medium — RSS discovery is free, Gemini handles video natively, main cost is the API key for comments
• Twitter: Hard — API access is the bottleneck. Everything else is straightforward once you can get the data
• Schema generalization: Easy — mostly additive changes, no breaking modifications
My Recommendation
Start with YouTube — it's more useful (long-form content → great summaries) and doesn't have the API access problem Twitter has. We can add Twitter later once we decide how to solve the data access question.
Want me to start with Phase 1 (generalizing the schema)?
Marian 10:01 AM
I don't think we need to mix them. Youtube overview and summaries may stay separate from HN. Twitter too. Youtube does not need comments, there is nothing of value in youtube comments. Would that be easier?
Cedric 10:02 AM
Yeah, significantly easier. If they're separate sections rather than mixed into one feed, we barely need to touch the existing HN code at all. Each source becomes its own mini-app within the site:
• YouTube: channels table, videos table, transcript + Gemini summary. Own page at
• Twitter: accounts table, tweets table, summary. Own page at
The only shared pieces would be the nav bar, the Gemini summarizer, and maybe tags/search down the road.
YouTube could realistically be done in a couple hours — it's basically: subscribe to RSS feeds → detect new videos → send to Gemini for summary → display. No article fetching, no comments, no discussion summaries. Dead simple compared to what HN already does.
Want me to build the YouTube section first?
• YouTube: channels table, videos table, transcript + Gemini summary. Own page at
/youtube. RSS for discovery, no comments, no API key needed.• Twitter: accounts table, tweets table, summary. Own page at
/twitter. This one still has the data access problem though.The only shared pieces would be the nav bar, the Gemini summarizer, and maybe tags/search down the road.
YouTube could realistically be done in a couple hours — it's basically: subscribe to RSS feeds → detect new videos → send to Gemini for summary → display. No article fetching, no comments, no discussion summaries. Dead simple compared to what HN already does.
Want me to build the YouTube section first?
Thursday, February 12, 2026
Marian 12:09 PM
disable ip blocking here
Cedric 12:10 PM
No IP blocking at the nginx/Dokku level. Must be in the app itself.
Cedric 12:10 PM
There's no IP blocking in the app code. Could be at the Dokku/nginx global level or firewall:
Cedric 12:10 PM
Found it — there's an
ip-restrict.conf in the nginx config that only allows one IP (185.29.157.34) and blocks everyone else with a 403.Cedric 12:10 PM
It's already gone — someone already deleted it. But nginx might still have the old config cached: