We are back

It’s a lot to explain and I’ve explained it too many times but TL;DR
It was way worse post migration due to limitations of the discourse software.
After reverting the performance got much better because I re-implemented the old infra but with some differences.
The performance is hit or miss as it was prior to the migration. The site has bursts of bad performance like before.
Going forward we are optimising still but not as before. We might do some experimentation on trying to scale better as discourse was not designed for the level of traffic we get.

3 Likes

Started donating (again). Thank you Vlad!

Is there any chance to make the setup open source on GitHub or anything similar so users here can look into it and look for improvements?
I think it would make problems/improvements more trackable too.

3 Likes

Sorry, I must have missed your previous posts on this topic. If I understand you correctly, you’re saying that you are working on the topic as time allows, but without a specific schedule. Fair enough since all of that is done on a voluntary basis.
However, I have to disagree with you. Since the migration, the site’s performance during Central European Time evening hours has been significantly worse than before. But maybe I’m the only one experiencing this or it’s purely subjective. I assume the tests you certainly conducted with the new platform beforehand didn’t reveal this. Anyway, I appreciate the voluntary and unpaid work behind it. Thank you very much for that! Eroscripts has a fantastic community! I’m keeping my fingers crossed that it stays that way!

Thank you for all your hard work, Vlad!

rails db:migrate runs migrations that are not already registered in the schema_migrations table (via time stamp). These migrations run on the same database connection used by Rails when starting the application server, so if the app runs, but the migrations does not, there’s likely bad data or a fucked up configuration (env vars differing and so on). If all migrations have already been executed, db:migrate will intentionally do nothing, and i think this should be the case here.

Maybe the external databases you tried to migrate didn’t include the eroscripts Discourse backup (and were empty)? If so, a migration exception caused by bad data would not have been triggered. Are you sure that the database content inside the dockerized pg was identical to the one used in your tests outside?

I wonder why migrations are due to run in the first place — this usually only happens when the application code has been updated and developers have changed the database layout, etc. Are you trying to update Discourse to a newer version while moving the system? If so, I’d recommend updating the sources first on a known-good database/system, and then restoring the up-to-date DB backup to the new system.

I agree that 16GB RAM and 8 CPU cores should be enough to handle <10k unique users per day with Discourse, and I’m surprised this is causing the system to go down. I’ve deployed installations of similar size on weaker hardware and didn’t encounter this.

btw: db:migrate does not optimize the database per se (at all) (though an optimization might be included by developers as part of a migration — usually as a bugfix for missed or wrong indices, and not randomly). So I think it’s kind of unrelated that you couldn’t successfully run the command on external DBs. Or are you talking about a migration you created yourself, adding indices, etc.?

Anyways, it sounds like you’ve identified PG as the overall bottleneck (and not the application server being overloaded).
You could make sure of this by checking that requests don’t pile up (depending on your app server: passenger, puma). Keep in mind though, that requests can also accumulate with blocked I/O (e.g. by PG not responding fast enough)
Maybe it’s worth taking the time to double-check the assumption that the db is slow with an optimally configured managed one, like the one DigitalOcean provides — just to make sure you’re not heading in the wrong direction. You’d probably only need it for a few hours, so it shouldn’t be too expensive.

If everything runs smoothly on a hosted DB, then maybe:

a) the current DB container is not optimally configured (e.g., effective_cache_size, work_mem),
b) the DB is very fragmented (VACUUM FULL) — though most of that would be fixed upon restoring from a dump, or
c) DigitalOcean’s VMs are just not very fast (which they aren’t) and have slow I/O (which they do :wink:).

Since you’re already on the max tier of DO with ~$100, it’s probably worth it to explore running on non virtualized compute (hetzner, ovh for example) after the problem is resolved, because then you know that you’re not running on a totally overcommited hypervisor. But thats step 2 I think.

You got this! :crossed_fingers::flexed_biceps:

4 Likes

This is stellar info, I’m going to read it later I’m a bit busy.
Tomorrow im going to be doing things in my lab, would you be able to join me in VC on the discord?
btw we get more than 10k unique users daily. There’s a screenshot in the discord from cloudflare showing how much traffic we get.

3 Likes

I didn’t realize the site had become so popular! Decided to donate based on that discovery.

1 Like