project-antrum-2

Thursday, February 19, 2026

Cedric · 8:12 AM

So detailexec comes from Oban's executing count (line 455), while the visual list comes from @detailtasks which is populated via PubSub broadcasts. The mismatch: 11 Oban jobs show as "executing" but most are sitting in checkout_proxy() waiting for a proxy — they haven't broadcast yet because they haven't gotten a proxy.

The fix is simple: use the actual Oban count for the header, but show it honestly. Or better yet — just use the DB count of releases in "enriching" state, which matches actual work:

Cedric · 8:12 AM

The issue is clear. The "11 tasks running" is from Oban executing count — workers sitting in checkout_proxy(). The actual task list only shows tasks that have broadcast (i.e., got a proxy). Let me change the count to show the number of releases actually being enriched (from DB), which is the honest number:

Cedric · 8:12 AM

Now verify that @enriching is already assigned (from the status counts):

Cedric · 8:12 AM

@enriching isn't explicitly assigned — it's part of status_counts. Let me check:

Cedric · 8:12 AM

@enriching is not assigned. I need to add it:

Marian · 8:41 AM

show the proxy waiting tasks too in the list. does album detail jobs have priority now?

2 replies

Cedric · 8:45 AM

Currently the enrichment worker only broadcasts after getting a proxy. I need to broadcast a "waiting for proxy" state before checkout:

Cedric · 8:45 AM

Now add the broadcasttaskwaiting function:

Cedric · 8:45 AM

Now update the dashboard to handle the :waiting_proxy state and display it:

Cedric · 8:45 AM

Good, the generic handler at line 57 already handles any state including :waiting_proxy. Now update the task display template to show waiting state:

Cedric · 8:45 AM

Also update the header count to include both actively fetching and waiting:

Marian · 11:38 AM

somethings not right. detail fetches are way too slow. analyse and explain

Cedric · 11:40 AM

Looked through the code. Here's what's happening:

The proxy pool has no real priority — it's first-come-first-served.

The "high/low priority" in Http.checkoutproxy only controls retry behavior (how many times and how fast a worker retries if no proxy is free). But at the GenServer level, ProxyPool.checkout() is a plain GenServer.call — whoever calls first gets the proxy. There's no queue priority.

The numbers make it obvious:

• Discovery: 10 concurrency × 9 active sources = ~~90 workers~~ competing for proxies
• Detail fetching: 10 workers
• Pool capacity: 822 proxies / 5s cooldown ≈ 164 checkouts/sec
Discovery is taking ~90% of the proxy bandwidth. Detail workers spend most of their time in checkoutwith_wait loops waiting for a free proxy.

On top of that, the enrichment worker chain adds latency: claim release → wait for proxy → fetch → checkin → insert next Oban job → Oban picks it up → repeat. Each hop through Oban adds scheduling overhead.

Three things to fix (this is task 3 from your list):

1. Give enrichment workers actual priority at the pool level (priority queue in the GenServer, not just retry tuning)
2. Reduce discovery concurrency when there's a detail backlog — or pause discovery entirely while details are catching up
3. Optionally: reserve a portion of proxies for detail fetching so discovery can never starve it
Want me to go ahead and implement this?

Cedric · 11:40 AM

Now I have the full picture. Here's the analysis:

Marian · 8:27 PM

look at current state. its all messed up

Cedric · 8:29 PM

Found it. The dashboard says "1006 active proxies" but that's just the DB count. The actual running pool has only ~~30 working proxies (10 webshare +~~ 20 free + 1 direct). The other 970 fail health checks every refresh cycle.

With 30 proxies at 5s cooldown = max 6 requests/sec capacity. But you've got 36 discovery tasks + 10 detail tasks = 46 workers all fighting for those 6 slots. Everything starves — detail fetching hasn't completed a single album in 10+ hours.

Why only 10 webshare proxies? The API call fetches pagesize=100 but only returns 10. Either the plan limit or the account only has 10.

Why the dashboard lies: It shows proxies WHERE active = true from the DB, but the pool auto-deactivates bad ones and the health check filters them. DB and pool are out of sync.

Immediate fixes I'd do:

1. Drop discovery concurrency to 2 (or pause discovery entirely) — detail fetching needs those proxies more
2. Fix the dashboard to show actual pool size_ not DB active count
3. Check the Webshare account — are there really only 10 proxies on the plan?
4. Longer term: the priority system I described earlier
Want me to pause discovery for now and push a fix for the misleading proxy count?

Cedric · 8:29 PM

Let me check the live dashboard.

Cedric · 8:29 PM

I can see it. Detail fetching is completely stalled (all "10h ago"), and the errors section is flooded with "no proxies available" for both discovery and detail tasks — despite having 1006 active proxies. This looks like a proxy pool leak again.

Let me check the live state:

Cedric · 8:29 PM

The multi-statement eval isn't working over SSH. Let me try a different approach:

Cedric · 8:29 PM

The run command starts a fresh instance. I need to connect to the running app. Let me check logs instead:

Cedric · 8:29 PM

There it is. The pool refresh is only finding 33 healthy proxies (10 webshare + 22 free + 1 direct), not 1006. The dashboard says 1006 active but that's the DB count — the actual running pool has only 33 proxies because the rest failed health checks.

With 33 proxies and 5s cooldown, capacity is ~6.6 checkouts/sec. But 36 discovery tasks + 10 enrichment tasks = 46 workers all competing. The pool is completely overwhelmed.

Marian · 10:10 PM

Ok, there was a mention in the docs, that we can use some amazon gateway for low cost. tell me about it

Cedric · 10:10 PM

That's AWS API Gateway used as a rotating proxy. It's a well-known technique:

How it works:
You create an HTTP API Gateway endpoint that proxies requests to Bandcamp. Each request goes through AWS infrastructure and gets a different outbound IP from Amazon's massive IP pool. No single IP gets rate-limited because you're effectively rotating across thousands of IPs automatically.

Cost:

• HTTP API Gateway: $1.00 per million requests
• REST API Gateway: $3.50 per million requests
• No per-proxy fees, no subscriptions
• At our scale (400k albums to detail-fetch), that's under $1 total
Why it's better than what we have:

• Our current pool is 30 working proxies. This gives you effectively unlimited IPs.
• No cooldown needed — every request is a fresh IP
• No health checks, no proxy management, no pool starvation
• AWS IPs are "clean" — rarely blocked by Bandcamp
• Latency is low (AWS → Bandcamp direct)
Setup:

1. Create an HTTP API in API Gateway (any AWS region)
2. Add a route like GET /{proxy+} that integrates with https://bandcamp.com/{proxy}
3. Deploy it — you get a URL like https://abc123.execute-api.us-east-1.amazonaws.com
4. Point our Http module at that URL instead of going through proxies
5. Optional: deploy across multiple regions for even more IP diversity
Downsides:

• AWS could theoretically flag high-volume scraping through their gateway (unlikely at our volume)
• Adds 50-100ms latency per request vs direct
• Need an AWS account
This would basically replace the entire proxy pool system for Bandcamp scraping. Want me to set it up?

Marian · 11:46 PM

yep. but keep the current setup and make this new way opt-in

Cedric · 11:46 PM

Got it. I'll add API Gateway as an opt-in proxy method — toggled via a setting/env var, current proxy pool stays the default. Will need an AWS account set up with API Gateway though. Do you have AWS credentials I should use, or should I walk you through creating the gateway manually?

Marian · 11:53 PM

walk me

Cedric · 11:53 PM

Here's the setup, step by step:

1. Go to API Gateway in AWS Console
https://console.aws.amazon.com/apigateway
(Create an AWS account first if you don't have one — free tier covers this)

2. Create an HTTP API

• Click "Create API"
• Choose "HTTP API" (not REST API — HTTP is cheaper at $1/million)
• Name it something like bandcamp-proxy
3. Add a route

• Route: GET /{proxy+}
• This catches all paths
4. Add an integration

• Integration type: "HTTP URI"
• URL: https://bandcamp.com/{proxy}
• Method: GET
5. Deploy

• It auto-creates a $default stage
• You'll get an invoke URL like:
https://abc123def.execute-api.eu-central-1.amazonaws.com
6. Test it
Open in browser:
https://abc123def.execute-api.eu-central-1.amazonaws.com/tag/synthwave
Should return the Bandcamp tag page.

7. Give me the URL
Once it works, send me the invoke URL. I'll add it as AWSGATEWAYURL env var on Dokku and wire it into the app as an opt-in proxy method.

That's it — should take about 5 minutes. Pick eu-central-1 (Frankfurt) for lowest latency to the Dokku server.

Marian · 11:55 PM

ok. explain rest vs http gateway. what is the primary purpose of this gateway?

Cedric · 11:55 PM

Primary purpose of API Gateway:
It's meant for building and exposing APIs — you put it in front of Lambda functions, backend services, etc. so clients hit a managed endpoint that handles auth, rate limiting, throttling, caching, CORS. It's the front door for your API.

We're abusing it as a proxy — it forwards our request to Bandcamp, and since AWS rotates the outbound IP per request, we get free IP rotation as a side effect.

REST API vs HTTP API:

REST API (v1, older):

• Full-featured: request/response transforms, API keys, usage plans, WAF, caching
• $3.50 per million requests
• Heavier, more latency
HTTP API (v2, newer):

• Stripped down: just routing + integrations, minimal features
• $1.00 per million requests (71% cheaper)
• Lower latency (~30% faster)
• No built-in caching or API key management
For us HTTP API is the obvious choice — we don't need any of the REST API features. We're just forwarding GET requests to Bandcamp. The cheaper and faster option does exactly what we need.

Marian · 11:57 PM

double check that this is possible

Cedric · 11:58 PM

Double-checked. The HTTP proxy integration with greedy path variables ({proxy+}) is confirmed in AWS docs — that part works exactly as I described.

However, I need to be honest about one thing I was too confident on: IP rotation is not guaranteed.

API Gateway makes outbound requests from a pool of IPs within the region, but:

• The IPs are from known AWS ranges (publicly listed at ip-ranges.amazonaws.com)
• There's no guarantee every request gets a different IP — it's a shared NAT pool, not a rotation service
• Bandcamp could block AWS IP ranges entirely if they wanted to (some sites do)
What this actually gives us:

• A small pool of rotating AWS IPs (better than our 30 free proxies, but not "thousands")
• Clean, fast, reliable connections (no dead proxies)
• No pool management overhead
• Very cheap
What it doesn't give us:

• Guaranteed unique IP per request
• Protection if Bandcamp blocks AWS ranges
It's still worth doing as an option alongside the existing proxy pool — just want to be upfront that it's not the magic bullet I made it sound like initially. The real fix for throughput is still the priority system + getting more working proxies (e.g. upgrading the Webshare plan).

Want to proceed anyway?

Cedric · 11:58 PM

Confirmed — HTTP proxy integration with greedy path variables works exactly as described. But let me verify one potential issue: whether API Gateway actually rotates outbound IPs or uses a fixed NAT.

Friday, February 20, 2026

Marian · 12:01 AM

how many ips?

Cedric · 12:01 AM

Honestly — I don't know the exact number and AWS doesn't publish it. API Gateway uses a shared NAT pool per region, not a dedicated set of IPs per account. The outbound IPs come from the region's API Gateway service range.

What I can tell you: per region, the API Gateway IP ranges listed in ip-ranges.amazonaws.com typically cover a few /16 to /20 blocks. But "listed IP range" != "IPs actually used for outbound NAT." AWS could route through a handful or hundreds — it's opaque.

The honest answer is: deploy it, make 50 requests, log the outbound IP (Bandcamp shows it in some headers, or we add a test endpoint), and count the unique IPs. That's the only way to know for real.

If the rotation turns out to be poor, we can deploy the same gateway across multiple regions (us-east-1, eu-west-1, ap-northeast-1, etc.) and round-robin between them. That would multiply whatever IP pool we get.

Cedric · 12:01 AM

AWS doesn't publish the exact number. Let me check what people have actually observed:

Marian · 12:34 AM

proxies page shows 1000 active ones and you are talking about 30. Please explain all the numbers regarding proxies

Cedric · 12:35 AM

There are two separate things that both say "proxies" but track different numbers:

1. The database (proxies table) — what the /admin/proxies page shows
Every proxy we've ever seen gets upserted here. There are ~~2600 rows total,~~ 1000 marked active = true. This is a historical record. A proxy stays active in the DB unless it hits 3 consecutive errors during actual scraping, at which point the pool deactivates it. But most of these proxies were never re-tested after their initial insert.

2. The in-memory pool (ProxyPool GenServer) — what actually serves requests
Every 10 minutes, the pool refreshes: it fetches fresh proxy lists from Webshare (API) and free sources (proxyscrape, geonode), then health-checks every single one by hitting bandcamp.com/tag/electronic. Only proxies that respond 200 AND are active = true in the DB make it into the pool.

The refresh logs show: 10 webshare + ~~20 free + 1 direct =~~ 30 healthy proxies. That's the real number.

Why the huge gap (1000 vs 30):

• Free proxy sources are garbage — they list hundreds of IPs but 90%+ are dead, slow, or blocked by Bandcamp
• The DB never cleans them out. They were marked active when first inserted and nothing retests old ones
• Webshare only returns 10 proxies (plan limit or account config)
• The dashboard and proxies page query the DB, not the live pool — so they show the inflated number
What each number means:

• "2609 total" = every proxy ever seen across all refreshes
• "1006 active" = DB rows with active = true (misleading — most are dead)
• "33 total healthy" (from logs) = actually working right now
• "10 webshare" = paid proxies, reliable
• "~20 free" = scraped from public lists, unreliable, changes every refresh
• "1 direct" = the server's own IP, no proxy
The proxies page needs to show the live pool count, not just the DB count. And the DB needs cleanup — those 970 "active" proxies that fail every health check should be deactivated.

Marian · 12:39 AM

seems you drifted a lot. Our proxy DB should be our source of truth. We may add more proxies on some "refreshes/refetches", if new found. We should try every one active from this table. If fails 3 times, lets consider it dead for good -> deactivate.
I don't see a reason why we would have 1000 untested proxies in our db and yet you would tell me we have 20 working ips

Cedric · 12:40 AM

You're right, that's a clear design flaw. Right now on every 10-minute refresh, the pool fetches fresh lists from Webshare/free sources, health-checks only those, and loads only the ones that pass. It completely ignores the 1000 active proxies already in the DB.

It should work like this:

1. On refresh, fetch new proxies from sources → upsert into DB (adds any new ones)
2. Load ALL active = true proxies from DB
3. Health-check all of them
4. Ones that pass → into the live pool
5. Track consecutive health-check failures per proxy. 3 strikes → active = false in DB, permanently out
That way the DB is the source of truth, we actually use what we have, and dead ones get cleaned out over time.

I'll fix this now.

Marian · 12:42 AM

why health check? Isn't health check to give it a real download job?

Cedric · 12:50 AM

You're right. The health check is pointless overhead — and it's actually the thing causing all the startup delays right now (trying to test 1176 proxies one by one on every refresh).

Better approach: just load all active proxies from DB into the pool. When a proxy fails during real work, the existing 3-strikes rule deactivates it. No separate health check needed. Dead proxies get filtered out naturally by actual usage.

On refresh: fetch new proxies from sources → upsert into DB → reload all active from DB into pool. That's it. Fast startup, no hanging.

Fixing now.

Cedric · 12:53 AM

Deployed. No more health checks — all active proxies from DB loaded directly into the pool.

Result: 957 proxies in pool (up from 30). Detail fetching is working again — 15 active tasks, 3 albums enriched in the first minute. Bad proxies are getting deactivated naturally by the 3-strikes rule during actual usage (already down from 1176 to 957 as dead ones get filtered out).

Dashboard now shows "957 active / 2811 total" which reflects reality.

Cedric · 12:53 AM

Let me wait for the current deploy to finish first:

Cedric · 12:53 AM

Good. Now let me strip out the health check entirely and just load from DB:

Cedric · 12:53 AM

Now remove the unused healthcheck function and the @healthcheck_timeout:

Cedric · 12:53 AM

961 proxies loaded instantly. No more waiting for health checks. Let me wait for the deploy to finish:

Cedric · 12:53 AM

Now let me verify things are working — check the logs for actual scraping activity:

Cedric · 12:53 AM

It's working. Proxies are being used, bad ones are getting deactivated after 3 failures, and detail fetching is actually completing (see "Enriched release #29588"). Let me check for more recent activity:

Cedric · 12:53 AM

Detail fetching is back in action. Let me check the dashboard:

Cedric · 12:53 AM

Big improvement:

• 957 active proxies in the pool (up from 30)
• Album Details: 15 fetching + 5 waiting — actively working again
• Enriched count: 14,579 (up from 14,576 — 3 new in the last minute)
• Recent: Shows fresh results (1s ago, 4s ago, 13s ago)
• Discovery and detail fetching both running
• Bad proxies getting deactivated naturally (down from 1176 to 957 already — ~200 dead ones filtered out by real usage)

Newer Page 2 of 21 Older