[Planned global outage] 31th of January 2023, 8 AM - 12 PM EST

Hi,

There will be a full network switch stack maintenance and upgrade happening between 8 AM and 12 PM on the 31th of January 2023. While the outage window has been set to 4h, our expectation for the entire procedure is 2h with the 2 additional hours left for any possible additional issue troubleshooting that may arise.

During the initial couple of hours (8 - 10 AM EST) all the GNOME services including GitLab and Discourse will be inaccessible.

2 Likes

I find that Gitlab login isn’t working when I try to login with social providers, is it because of this?? the error I get is:

Could not authenticate you from GoogleOauth2 because "Actioncontroller::invalidauthenticitytoken". 

Right after the maintenance the cluster started manifesting brief packet losses at the L2 level. OCP cluster services, especially etcd, were failing to connect to each other due to these brief interruptions causing a full control plane outage. During the troubleshooting session we specifically found out one of the NIC firmware/drivers misbehaving with:

Jan 31 22:22:08 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: i40e 0000:31:00.0 eno12399: tx_timeout: VSI_seid: 399, Q 52, NTC: 0x1d6, HWB: 0x1d6, NTU: 0x1d6, TAIL: 0x1d6, INT: 0x1 Jan 31 22:22:08 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: i40e 0000:31:00.0 eno12399: tx_timeout recovery level 1, txqueue 52 Jan 31 22:22:09 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: bond1: (slave eno12399): link status definitely down, disabling slave Jan 31 22:22:09 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: i40e 0000:31:00.0 eno12399: port 4789 already offloaded Jan 31 22:22:09 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: i40e 0000:31:00.0 eno12399: port 4789 already offloaded Jan 31 22:22:09 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: bond1: (slave eno12399): link status definitely up, 10000 Mbps full duplex Jan 31 22:22:09 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) NetworkManager[3988]: <info> [1675203729.4323] device (eno12399): carrier: link connected Jan 31 22:22:41 [master1.openshift4.gnome.org](http://master1.openshift4.gnome.org/) kernel: i40e 0000:31:00.0: eno12399 is entering allmulti mode.

Interfaces on the internal bridge were bouncing every few minutes as well. The culprit was related to a set of misbehaving SFPs on the 10G interfaces which we use for shuffling intra-cluster traffic packets. These SFPs and associated fiber cablings have been replaced.