Load Testing Throughput
The plan for today was to start with payments, but now that I've though about it a bit more, that doesn't make sense yet. There is much to do before I would like anyone to use this, and the landing page will have to come before the payments do.
Yesterday I resolved the urgent frontend issues. Today I'll go through the backend tech debt and prioritize to see what needs to be done before mvp.
- clean up ingest contract
- yes
- rename view to event
- yes, contract change
- rename userId to distinctId
- yes, it's confusing and contract change
- extract domains on ingest
- yes
- load testing and optimization
- yes. I need some actual load to see how the queries will perform. I've purposefully left out creating indexes to see what exactly needs to be optimized and why
- save duration info during replay creation
- no, but will do afterwards. Causes single batch sessions to not display, but these are probably too short anyways for any valuable insight.
- support cities in locations
- no. really low priority, won't do unless requested
- integration testing of all queries
- no, but high priority afterwards. I need to go through all of my queries and make sure they work as expected. They seem to work, but it's a core feature and I want it to be rock solid.
Tech Debt
Okay, that's actually not too bad. Nearly all of it is important, but there isn't much and I'm fairly confident I can do it all today. I'll start with the low hanging fruit to get the juices going.
- clean up ingest contract
- done
- rename view to event
- done
- rename userId to distinctId
- done
- extract domains on ingest
- done
Load Test and Optimization
Alright, now for the load test and optimization. I've done something like this before, it shouldn't be hard.
I will use k6 to hammer the local instance ingest first to see if anything blows up. I'll measure throughput and make fixes and adjustments as necessary. Once everything seems fine locally, I'll do the same on prod.
I'll make the tests realistic:
- 5000 concurrent users
- gradual ramp up/ramp down
- 5 seconds between requests (my actual batch frequency)
- initial request 15kB to 60kB compressed (actual prod data for session replays)
- subsequent batches ~ 3 kB
Now this is true for my websites, but it could differ in production with more complex websites.
First run locally:
http_req_duration: avg=2.43ms min=485.62µs med=1.04ms max=196.76ms p(90)=5.32ms p(95)=8.21ms
http_req_failed..: 0.04% 99 out of 212357
http_reqs........: 212357 698.135728/s
data_sent........: 638 MB 2.1 MB/s
More than enough and no obvious issues. Let's try in prod.
Tried with 5000 - api container froze up. docker stats was showing empty output, and docker logs for the api was unresponsive.
After I ran it, I saw that my postgres and api were at 180 and 250% cpu respectively, then i started getting 502 and now docker stats just show dashes
I'll try again with 1000.
http_req_duration..............: avg=68.87ms min=31.24ms med=51.8ms max=449.09ms p(90)=128.95ms p(95)=154.29ms
http_req_failed................: 0.00% 0 out of 39825
http_reqs......................: 39825 130.715561/s
data_sent......................: 106 MB 348 kB/s
Nothing failed, but it still locks up when I try to restart. It's using no memory and no cpu. Hmmm
So it looks like this was the old container. It also has a weird name. Could it be that one of the deployment happened while the other one was under load and that messed things up? No idea.
Killed it for now, trying 5000 again.
No more locking up, but it looks like we hit the breaking point.
{"time":"2025-11-16T22:55:05.968428548Z","level":"ERROR","msg":"internal error","error":"post view\ninsert view: pq: sorry, too many clients already"}
{"time":"2025-11-16T22:55:06.15594035Z","level":"ERROR","msg":"failed to insert view","error":"insert view: pq: remaining connection slots are reserved for roles with the SUPERUSER attribute"}
I can see again that the api is shooting up to 230% CPU, and postgres is hovering at 160%. I also observe that the queries are getting slower and slower. They take about 2 seconds now, and we're on ~170k events.

The breaking point is at about 3.3k concurrent users.
I had this exact same issue on my day job two days ago, also under high load. It has to do with how connection pooling works in Go.
Let's try and set up the pooling to 90 connections.
I tried to deploy the pooling fix, but the container hung again. I noticed when checking the docker logs that the logs from the load test are still coming in, and the deploy script was trying to read the logs. Probably related. Anyways, can't deal with that now - I'll just restart again.
Okay, the load is much lower for postgres now, not sure how? Were all the extra connection requests affecting performance?

When I look at the top for all cores at full load of 5000 users, I can see that we're barely tickling the machine

Seems like there's a lot of room for improvement here. Is the postgres connection limit the bottleneck?
I'll tackle that next time, too tired to think now.