Gitlab.gnome.org down?

xry111 · September 20, 2023, 7:02am

When I access gitlab.gnome.org, it shows an " Application is not available" page for me. And I’ve not seen an announcement for maintenance here.

jensgeorg · September 20, 2023, 8:27am

There is https://status.gnome.org/ for the current status which says “yes, its down”

pwithnall · September 20, 2023, 3:26pm

Is there an estimate of when it might be back up again? Seems like a significant outage, which I’m guessing might need significant work to fix?

ebassi · September 20, 2023, 4:23pm

It’s slowly coming back up.

danny-levinson · September 20, 2023, 4:49pm

Thank you! Your efforts are appreciated.

Sid · September 21, 2023, 6:36pm

All systems seem to be operational now.

Any update on why we faced this outage ?

averi · September 21, 2023, 7:21pm

We haven’t had the time to dig deeper into this as we mainly focused on recovering the cluster but we’re confident in saying what happened yesterday may have the same root cause of the outage we had during the Openshift 4.10->4.11 upgrade with the underlying Ceph storage misbehaving taking down the hosted VMs and pods using a network attached storage. When the pods/VMs migrate to a separate host the network mount attempt to migrate as well but they cannot as they have a watcher / lock still around (that’s either a plain RBD lock or a crashed Ceph client watcher) and this causes the pods to fail to start and throw hundreds of errors related to the lock trying to be forced. What complicates things is when network attached storage is impacted the CPU accumulates IOWait which in turn has consequences on other hosted services.

When we designed the GNOME OCP cluster we also had to consider budget and currently we’re running on an hyperconverged setup with the storage control plane running on top of the same nodes serving the actual network shares, the same nodes also host a set of VMs, service pods and OCP control plane. We also have a separate Ceph cluster but it runs on top of HDDs and that would limit us in terms of storage performances.

We’ll keep an eye on this and eventually open up a support case for Ceph and/or re-architect a bit the storage pieces to some extent. I’m devoting my free time these days and it’s always challenging to find more of it in between the multitude of other tasks.

Sid · September 21, 2023, 8:04pm

Thanks for looking into this, Andrea.

Providing best solutions with budget constraints is not always possible, and that challenge will always remain in the open source world. Should we possibly address this at the budget level, rather than trying to solve it technically ?

Also, how many admins do we have to address an outage when there is one ? Are they distributed across the globe ? It wouldn’t be fair for the devs / users / translators to keep waiting for hours, and it wouldn’t be fair to put the load on admin(s) to work in odd hours addressing outages.

It was a bit disappointing that https://www.gnome.org/ should go down few hours before the GNOME 45 release. Hope the GNOME BoD have better plans to address this issue going ahead.

Cheers!

averi · September 21, 2023, 8:34pm

Sid, very good questions, this specific problem started occurring recently specifically when we transitioned to OCP 4.11 (and the newer OCP Data Foundations OCP 4.11 ships with), it may well be a bug which is worth opening a support case against. From a budget perspective having 3 additional nodes we could use as workers would definitely help so that we could finally split the control plane pieces from the actual workloads.

These days it’s mainly me and Bart with myself contributing when I can on my free time and Bart being particularly busy with his job as well. The expertise that is required to manage the GNOME Infrastructure is non trivial and if you attempt to hire someone it’ll charge you a massive amount of money the Foundation doesn’t have. In terms of timezones Bart is in EMEA, I’m in NA so we get pretty good coverage. That doesn’t mean getting a page means someone will have a look ASAP, there are no SLAs and literally no obligations to resolve an outage right away.

I agree with you the timeline of the event was extremely unfortunate but there’s not much the BoD can do at this point other than waiting for us to understand what exactly happened and apply a permanent fix (either it be an architecture change, budget request, support case).

Sid · September 21, 2023, 8:42pm

Thanks for the detailed update.

Sounds good.