project-antrum-2
Thursday, February 19, 2026
Cedric 1:13 AM
Zero detail requests in scrapelogs! The album detail fetching goes through
The real issue:
Let me check how many discovery workers are running:
Bandcamp.fetch() which does its own Req.get — it doesn't go through Http.post which writes the scrape logs. So the detail workers ARE checking out proxies but they're logged differently... or not at all.The real issue:
Bandcamp.fetch() handles proxy checkin itself but doesn't write scrapelogs. But more importantly — 0 detail requests means the workers aren't even making it to the HTTP request. They're stuck on proxy checkout because discovery is hogging all the proxies.Let me check how many discovery workers are running:
Cedric 1:13 AM
59 discovery workers executing + 12 detail workers = 71 workers all competing for proxy checkout from a single GenServer. The GenServer is the bottleneck — it's serializing 71+ concurrent checkout/checkin calls.
But also — 59 discovery workers is way too many. That's eating all the proxy slots. The queue limit should be capping this. Let me check the actual Oban queue limits:
But also — 59 discovery workers is way too many. That's eating all the proxy slots. The queue limit should be capping this. Let me check the actual Oban queue limits:
Cedric 1:13 AM
Concurrency is 10 but 59 are executing — the queue scaling is broken again. The scheduler sets queue limit to 80% of pool size (~480), so Oban allows 480 concurrent discovery jobs. Way too high.
The fix: detail workers need their own dedicated proxy slots, not competing with discovery. Simplest approach: reduce the discovery queue limit drastically and give detail workers priority. Let me cap the discovery Oban queue properly:
The fix: detail workers need their own dedicated proxy slots, not competing with discovery. Simplest approach: reduce the discovery queue limit drastically and give detail workers priority. Let me cap the discovery Oban queue properly:
Cedric 1:13 AM
Also fix the same in
setdiscoveryconcurrency:Cedric 1:13 AM
Now let me verify the fix is working — wait 30s for it to stabilize:
Cedric 1:13 AM
89 discovery executing?! It got worse. The old workers from before the deploy are still running (Oban keeps executing in the old process until drain timeout). But also — with concurrency=10 and 6 sources, queue limit should be 60 max. Something's off.
Actually, the issue is that the deploy didn't drain old jobs. The 89 executing workers are from the pre-deploy container that still has 60s to shut down. Let me wait for old container shutdown and check again:
Actually, the issue is that the deploy didn't drain old jobs. The 89 executing workers are from the pre-deploy container that still has 60s to shut down. Let me wait for old container shutdown and check again:
Cedric 1:13 AM
The
sleep && ssh pattern seems to hang. Let me try without sleep:Cedric 1:13 AM
Still 87 discovery workers. The queue limit was set on startup but these jobs were already in "executing" state from the previous deployment. They won't go away until they finish. But Oban should respect the concurrency limit going forward.
29 detailed in last 60s is much better though — about 1 every 2 seconds with 9 workers. Let me reply:
29 detailed in last 60s is much better though — about 1 every 2 seconds with 9 workers. Let me reply:
Cedric 1:14 AM
The active tasks aren't showing because they finish too fast between UI updates. The broadcast fires but the task completes before the next LiveView render. I need to keep tasks visible for a minimum time so they actually appear.
Cedric 1:14 AM
Now update the task display to show done tasks differently:
Cedric 1:14 AM
Deployed. Active tasks should now stay visible for 3 seconds after completion (fading to 50% opacity when done). Check the dashboard — you should see multiple tasks listed under Album Details now.
Marian 1:18 AM
sometimes task appears. but more often no active tasks visible, just recent tasks update with new items
1 replyCedric 1:20 AM
The PubSub broadcasts are racing — tasks complete before the LiveView even processes the start message. Let me switch to just querying the DB for what's actually in progress. The
enriching status is already set by claimreleasefor_enrichment.Cedric 1:20 AM
Now remove the PubSub-based task tracking and replace with DB query in
assign_data:Cedric 1:20 AM
Now add a DB query for active detail tasks in
assign_data:Cedric 1:20 AM
Now add the
fetchactivedetail_tasks function and update the render:Cedric 1:20 AM
Now update the render to use
@activedetailtasks instead of @enrichment_tasks:Cedric 1:20 AM
Also reduce the refresh interval to 3s for snappier updates:
Cedric 1:20 AM
Let me verify there are releases in "enriching" state:
Cedric 1:20 AM
11 releases in "enriching" state — those should show up on the dashboard now. Check it.
Marian 1:23 AM
do we still have 10s cooldown for album details fetch? it seems there are 12 active tasks just lingering. We don't need the cooldown if the IPs are correctly assigned/cooled down
1 replyCedric 1:25 AM
Some have been "enriching" for 10+ minutes. They're stuck — the worker claimed them, then crashed or the proxy request timed out, and
The fix: don't rely on worker cleanup. Add a timeout — any release in 'enriching' for more than 60s gets reset back to 'discovered'.
resetrelease never ran. The after block in the try/rescue isn't catching all failure modes (like when checkoutproxy blocks forever).The fix: don't rely on worker cleanup. Add a timeout — any release in 'enriching' for more than 60s gets reset back to 'discovered'.
Cedric 1:25 AM
Now let me add automatic cleanup to the scheduler so this doesn't happen again, and also fix the worker to not get stuck:
Cedric 1:25 AM
Now let me also fix the worker's
checkoutwithwait timeout. The worker blocks in Http.checkout_proxy() which can loop up to 120 times with 200ms+ sleeps — that's potentially 24+ seconds of blocking. With the Oban job timeout, the process might get killed without running the after block:Cedric 1:25 AM
The problem: if
The scheduler reclaim handles this now. Let me compile and deploy:
checkout_proxy blocks for 20s and then the request takes 10s, the release has been "enriching" for 30s but the worker is still alive. The after block only runs if the function completes. If the BEAM kills the process (e.g. Oban timeout), after never runs.The scheduler reclaim handles this now. Let me compile and deploy:
Marian 1:49 AM
active tasks does not work - at least display. They appear in recent, but not show up as active
image.pngCedric 1:51 AM
The releases are completing so fast they never show up as "enriching" in the DB when the 3s poll happens. The workers claim → fetch → complete in under 3 seconds. Let me check:
Cedric 1:51 AM
0 in "enriching" at any given moment, 13 done in 30s. The workers are fast enough that they're never caught in the "enriching" state by the poll. So the parallelism IS working now (~26/min) — the display just can't keep up.
The only reliable approach: track recently-completed tasks in-memory via PubSub (like before), but keep them visible for longer. Let me combine both approaches — PubSub for real-time task tracking, but keep completed items in a rolling list:
The only reliable approach: track recently-completed tasks in-memory via PubSub (like before), but keep them visible for longer. Let me combine both approaches — PubSub for real-time task tracking, but keep completed items in a rolling list:
Cedric 1:51 AM
Now schedule the pruning on each refresh:
Cedric 1:51 AM
Now update the render to use
@detailtasks (PubSub-based) combined with @activedetail_tasks (DB-based), merging both:Cedric 1:51 AM
Remove the DB-based
activedetailtasks since we're using PubSub now:Cedric 2:08 AM
Sorry, was mid-deploy when you pinged. It's live now — refresh the dashboard. Active tasks should show up properly.
The bug was embarrassingly simple: one character difference between map update (
The bug was embarrassingly simple: one character difference between map update (
%{map | key: val}, crashes if key missing) and merge (Map.merge, adds new keys). Every time a task finished, the LiveView crashed and wiped the task list.Marian 2:10 AM
Still, after last deploy, still no active tasks visible.
The number of scraped albums is at 12555 and not moving, although new recent ones appear in the RECENT list. So either you scrape the ones we already have, or you do not update the counter. And I dont see the list of active tasks!!!!!!!!!
The number of scraped albums is at 12555 and not moving, although new recent ones appear in the RECENT list. So either you scrape the ones we already have, or you do not update the counter. And I dont see the list of active tasks!!!!!!!!!
image.pngCedric 2:14 AM
297 in 5 minutes — ~60/min! The counter IS going up but the dashboard isn't refreshing that number. And 0 in "enriching" state confirms tasks are fast.
Two issues:
1. Counter appears stuck because
Two issues:
1. Counter appears stuck because
assign_data calculates total = discovered + enriched but the "enriching" ones are excluded. Actually wait — there are 0 enriching. So total should be correct. Let me check the actual dashboard code for the counter:Cedric 2:14 AM
The counter should update on every 3s refresh. But 12555 matches the DB. If the recent list shows new items, the LiveView IS refreshing. So maybe the counter display is wrong?
Wait — 12555 enriched in DB, and the screenshot shows "12555 / 335115 albums detailed". Actually that IS updating — the total went from 310954 → 335115 (new discoveries). The enriched count might just be slow to visually change. But Marian says it's "not moving".
The real problem Marian cares about: active tasks not visible. Let me check if the PubSub broadcasts are actually being sent. Let me look at the worker:
Wait — 12555 enriched in DB, and the screenshot shows "12555 / 335115 albums detailed". Actually that IS updating — the total went from 310954 → 335115 (new discoveries). The enriched count might just be slow to visually change. But Marian says it's "not moving".
The real problem Marian cares about: active tasks not visible. Let me check if the PubSub broadcasts are actually being sent. Let me look at the worker:
Cedric 2:14 AM
The code looks correct. But... the
The issue is that the
Actually no — the prune only removes tasks where
But Marian says NO active tasks visible. So the
Actually — PubSub delivers to all subscribers including the same process. Both broadcasts go into the process mailbox in order.
Let me just look at the render more carefully:
:enrichmenttask broadcast with :fetching arrives, gets put in detailtasks, then almost immediately the :done broadcast arrives and marks it done. Then on the next prune (every 3s), tasks older than 5s get removed.The issue is that the
:done message sets finishedat and the :fetching message sets startedat. But the :done broadcast for releaseid fires BEFORE the LiveView even processes the :fetching message (both are in the mailbox). The :done handler checks Map.get(tasks, rid) — if :fetching hasn't been processed yet, the task isn't in the map, so it hits the nil branch and does nothing. Then :fetching arrives and adds it, but there's no :done to ever remove it... wait, that would make tasks linger forever. But the prune would clean them after 5s.Actually no — the prune only removes tasks where
state == :done. If :done was dropped (because task wasn't in map yet), the task stays as :fetching forever until the next prune, but prune only removes :done tasks. So tasks with :fetching state would accumulate.But Marian says NO active tasks visible. So the
:fetching broadcast is either not arriving, or the LiveView mailbox processes :done before :fetching somehow.Actually — PubSub delivers to all subscribers including the same process. Both broadcasts go into the process mailbox in order.
:fetching is sent first, :done second. They should be processed in order. Unless the worker's doenrich completes synchronously before the broadcast_task message is even delivered... which can't happen since broadcast is synchronous.Let me just look at the render more carefully:
Cedric 2:14 AM
The code looks fine. I think the problem might be that the
Actually wait — looking at the screenshot again: it shows "RECENT" header and items with ✓. There's no "ACTIVE TASKS" header. The
The PubSub must not be reaching the LiveView. Maybe the LiveView is not subscribed? Let me verify:
recentlyenriched section (which Marian IS seeing) looks identical to what the active tasks section would look like with ✓ marks. Let me check what the "RECENT" section renders — maybe the active tasks ARE showing but blending into the recent section visually.Actually wait — looking at the screenshot again: it shows "RECENT" header and items with ✓. There's no "ACTIVE TASKS" header. The
sortedtasks is empty so :if={sorted_tasks != []} hides the whole section.The PubSub must not be reaching the LiveView. Maybe the LiveView is not subscribed? Let me verify:
Cedric 2:14 AM
Subscribed to
Oh wait — is the
"scraping:progress". The worker broadcasts on "scraping:progress".Oh wait — is the
{:enrichmentcomplete, ...} handler catching the message before {:enrichmenttask, ...}? Let me check the order of handle_info clauses:Cedric 2:14 AM
Line 72:
Hmm. Let me think about this differently. Maybe the worker's
I bet the problem is that
Actually, I wonder if the prune is too aggressive. Let me check:
This rejects
Let me just add logging and test it locally... actually no, let me just simplify radically. Instead of this complex PubSub dance, just keep a simple rolling list of recently-seen tasks:
def handleinfo(, socket) — the catch-all! But that shouldn't match {:enrichmenttask, ...} since there are specific clauses above it. Unless... the {:enrichmenttask, %{state: :done, releaseid: rid}} clause on line 41 matches ALL enrichmenttask messages because Elixir pattern matching is structural. Wait no — :fetching wouldn't match %{state: :done}.Hmm. Let me think about this differently. Maybe the worker's
broadcasttask is simply not being called. If Http.checkoutproxy() blocks for a long time, the broadcast never fires until the proxy is obtained. With 600 proxies and the workers competing, maybe checkout returns instantly...I bet the problem is that
broadcasttaskdone only sends %{releaseid: releaseid, state: :done} — no artist or title. So when the :done handler fires, Map.get(tasks, rid) returns nil (the task was already pruned or was never there), and it does nothing. But the :fetching handler DID fire and DID add the task. The prune should clean it.Actually, I wonder if the prune is too aggressive. Let me check:
tasks = socket.assigns.detailtasks
|> Enum.reject(fn {rid, t} -> t.state == :done and now - (t[:finishedat] || 0) > 5000 end)
This rejects
:done tasks older than 5s. But :fetching tasks are never removed (no :done ever arrives for them if the race condition happens). So they'd accumulate. But Marian sees ZERO active tasks.Let me just add logging and test it locally... actually no, let me just simplify radically. Instead of this complex PubSub dance, just keep a simple rolling list of recently-seen tasks:
Cedric 2:14 AM
Now remove the prunedetailtasks handler and the send in refresh, and add the helper:
Cedric 2:14 AM
Add the helper function near the other helpers:
Cedric 2:14 AM
Also need to remove the old
:prunedetailtasks handler reference. Let me check if it still exists:Cedric 2:14 AM
Good, removed. Now also fix the counter — the
Let me also add the
enriched count should include releases being enriched right now too, so total stays consistent. Actually, the counter is fine (12555 enriched in DB matches UI). Marian said it's "not moving" but 297 were added in 5 minutes. He might just not be watching long enough, or the page isn't re-rendering.Let me also add the
total_releases to include ALL statuses:Marian 2:22 AM
Still 12555 completed. See? No progress. Number of total albums discovered is increasing, because our tag scrapers work. But the album detail does not !!!!!!!
image.pngCedric 2:24 AM
Wait — 12555 enriched but 631 updated in the last 10 minutes? That means releases ARE being marked enriched... but the count hasn't changed. Unless they're being enriched AND then something sets them back to discovered?
No wait — 12555 total enriched, 631 updated in last 10 minutes. The count IS 12555. Let me check more carefully:
No wait — 12555 total enriched, 631 updated in last 10 minutes. The count IS 12555. Let me check more carefully:
Cedric 2:24 AM
Last enriched release was at 02:22:47 — that's an hour ago! The workers are running but nothing has been marked enriched since then. The recent items in the UI (Skies Underground, Aztec Records etc.) are from the
But 11 tasks ARE "executing" in Oban. Let me check what they're actually doing:
recentlyenriched query which orders by updatedat DESC — those are old ones.But 11 tasks ARE "executing" in Oban. Let me check what they're actually doing:
Cedric 2:24 AM
0 enriching, 11 executing jobs with no errors. The workers are running but not producing results. They must be stuck in
The discovery workers are hogging all the proxies:
Http.checkout_proxy() — blocking on the proxy pool.The discovery workers are hogging all the proxies:
Cedric 2:24 AM
74 discovery workers executing. They're STILL too many. My queue scaling fix didn't take effect because the old workers persisted. Let me force the queue limit NOW and also check why it's still 74: