UPDATE: Emails should be working again! If you were trying to create/recover an account, you may need to try requesting another confirmation email.
Hi Everyone,
We are back online now and the site is open to be used again.
This post is to be transparent and detail what happened and how we fixed it.
In a later post I will go over how we want to move forward.
There were a few issues we had to fix and I will go into deep technical details.
-
What was originally broken. We still do not know. I presume the issue was the host on which the forum was on. we resolved this by moving to a new server.
-
The things we knew were broken was a few permission issues when building the image. The site could not communicate to postgres and redis because the owner of the folder/files didn’t exist. They had mismatched IDs from a previous docker image.
-
Needed to update the host OS to have the correct version of Ruby. I changed the package repos from focal to jammy and ran the update, this would ultimately not fix the dependancy issues. but it did fix a failing build of the image.
-
Unicorn webserver was throwing an error about a secret_base_key that was missing. We don’t handle any sensitive tokens in the webserver so no idea why this was being thrown. A friendly reddit user helped me fix this issue.
-
When we were considering moving to a new server we had an issue with implementation. The DNS records were set with a 24 Hour TTL, which meant moving to the new server would take 24 hours. We ultimately did wait for this and now the TTL is much more reasonable.
-
The discourse forum version The way discourse runs is it depends on lot of things. the ruby version on the host, the version of discourse itself, and anything the plugins depend on. working through dependancy hell was fixed with a combination of moving from the master branch to the main branch and changing the host OS from Ubuntu Focal to Ubuntu Jammy.
-
Restoring from backup. Some data might have been lost. we restored from backup as of date and time 2024-01-05-060527
-
Website assets. The site uses a CDN provider to serve assets that don’t change much. While fixing issues we purged the cache to re-populate the cache to the backup version. This took a while but after some time the site no longer had any cache misses and no error 500s. This is considered resolved now. Hence why we opened the site.
This was the reddit thread that I posted my updates on:
https://www.reddit.com/r/theHandy/comments/18zroyc/comment/kgs7itz/?context=3
I wanted to specifically thank pascaruchan for giving me some assistance with the ruby on rails issue I was not familiar with.
There is still things to do to make the site more resilient to failures. But I will be discussing with the team on things we can do.
Finally, please remember, we are all volunteers. The site makes no real money, the only thing keeping it running is the patreon. Please consider donating so we can make the site better.
This thread is open for questions if anyone has any.