Hello all,
The mods team have been investigating and working on a lot of things in the background.
We are trying to solve a number of problems im sure many of you are aware of. Most of this I will summarise but I will include technical details for those more knowledgeable in the bottom. We don’t have a timeline on when we are doing these changes, but we will let everyone know when changes are coming.
Problem 1 - Site Performance
We are pretty much always at high resource utilization. CPU usage is always above 50%, and often peaks at 100% at peak times. Memory is always above 80% and nearly always above 90%
This would be easily resolved by upgrading resources but just so you guys know we are a very popular community. We average 180K+ views daily while the server uses:
16GB Memory and 8 CPU Cores (Virtualized)
This is a massive server. Most applications at this scale have better infrastructure. But we survive on donations alone. No advertising. And it will stay that way.
The only way we can fix this, is to throw more money at it, or make cost saving measures.
I have found some cost saving measures that should improve the performance, and give us some better infrastructure. This will require moving the server to a new hosting provider. Part of that migration will likely take the server down for some time, that is currently unknown. Before we do it, there will be notice, and we will do our best to minimize downtime. There will also be a second migration, as we will be doing the migration in a couple steps.
Problem 2 - Minimal infrastructure
We don’t have all the bells and whistles that commercial applications have. Security infrastructure, monitoring infrastructure, backup infrastructure. We lack a lot of things.
What we do have, is basic backups. But as many of you might know we did lose a couple days of data when the major outage happened.
Part of the migration is going to include some better infrastructure that I am making. This better infrastructure means in the event of shit hitting the fan, we will not lose ANY data.
Another part of the migration will include monitoring of the server. We will have more visibility into what is going on with the server, so we can take action before things go wrong.
I also wanted to take a minute to praise @hugecat and @defucilis for what they’ve accomplished so far. Although I’ve been here near the beginning, I wasn’t at all involved with setting things up. Even though it left some work to be done, they did make this work despite not being very technical and definitely not SysAdmins.
Technical bits
Currently
We have a very large virtual machine running Ubuntu Jammy (22.04). Discourse is running in a docker container with postgres as the database.
The backups run occasionally creating a database dump and uploading it to an S3 storage.
All uploads are also pushed to that same S3 storage so even as much resources as we are using, there is a lot of storage being pushed elsewhere.
This is the extent to our infrastructure now.
Where we want to go
#1 - NixOS
We plan to change the operating system to NixOS. This provides us a lot of superpowers that traditional systems don’t offer us. We no longer need system-wide backups. This is because if you know what NixOS is, you would know NixOS is Stateless. You can always revert to an older system, and upgrade safely. If your system fails to update, no problem, we simply won’t change anything.
#2 - Backup
We are changing the backups to now ignore discourse internal backups. They did their job but we can do a better job. Postgres and many other SQL relational databases have something called a transaction log or write ahead log. These are a “backup” of things happening on the server. You create a base backup (database dump) and record all transactions from the moment the base backup was made. This allows you to restore a base backup, and replay transactions up until a specific second to restore to the very second before shit hit the fan. Meaning no data loss.
#3 - Monitoring
I personally have a lot of experience with netdata. I’ve been using it in my lab and have had great success and a few stories on how it saved my ass. Before I hear from the technical crowd about “heavy agent services” Netdata, can be stripped of most of it features taking very little resources. As an example in my lab, the heavest netdata instance I have is using 59.7MB of memory and 2.87% of CPU
#4 - Security
Right now we have no security infrastructure, and honestly most of the time, that’s fine as long as you follow good digital hygiene. But, we could put security infrastructure into the site. I have a few ideas that I generally add to all my servers. Namely IDS/IPS (suricata), WAF (cloudflare or something similar), and a number of simple security tools like fail2ban. For those who are ready to recommend Wazuh. No. the ELK stack is very resource intensive, and the most important component in a siem is the IDS/IPS. SIEM is overkill for what we are doing, maybe one day, but not today.
#5 - High Availability (Not Immediately, and requires more money)
Right now if the server went down everything would be down. I don’t like that. With enough money, I can build better infrastructure. Where even if 1 server fails, the site can stay up. This is called High Availability. But it requires a lot more hardware.
Finally
Although I am confident in my technical skills and I know I can do this. This is a community. I want to hear your feedback on what you think we can do better and if you have ideas what you want to see in this forum. Criticisms are also welcome we are not perfect, though we may try.