It has been nearly 2 years since we migrated our infrastructure to PTM Data Centre 2, and there was almost no service disruption since then. On 2 August 2023, we had our first major service blackout 2 years after migration. Although we were able to resolve most of the previous issues within a reasonable timeframe, this new problem left us completely puzzled.

What happened exactly?

One of the core components of all the virtualisation servers (Proxmox servers) used to host important Virtual Machines (VM) for services such as Identity Services, VPN services, Web Server and more, malfunctioned. This component, known as bridge-utils (a tool for managing Linux network bridges) is crucial for handling network interface bridges within Proxmox, enabling communication and connection between all the services.

Without the bridge component, none of the bridge could forward network traffic in and out, directly impacting the Proxmox servers’ ability to connect to its Internal Ceph storage, which store all the VM data. As a result, none of the services can be brought online, leading to a massive blackout for all DICC services until the bridge issue was resolved.

What caused the delay in resolution?

Initially, we never imagined that the bridge component, which has been running flawlessly for years without any issues, would suddenly break down on its own without any system changes. Since the establishment of DICC in 2017, we have been running Proxmox servers and have never encountered such issues. Despite trying various other solutions, including reconfiguring our network switches, reconfiguring Proxmox server network, replacing the network interface card with a brand new card, replacing the cable connecting the interface card and network switches, but none of them proved effective.

So we decided to explore the Proxmox community forum, and came across an old post that seems to be very similar to our issues. Although the post did not offer a direct solution, one of the comments suggested reinstalling the bridge component on the Proxmox servers, surprisingly resolving the issues on our side. However, we had to perform some reconfigurations as part of the previous configuration no longer worked as it should.

Could this problem happen again?

Unfortunately, given our current budget and infrastructure, there is little we can do to prevent such incidents. Many data centers have multiple sets of hardware to handle such situations, allowing affected services to be quickly migrated to unaffected spare hardware, minimizing downtime. However, due to budget constraints, we cannot afford to maintain additional standby hardware at all times to handle these unexpected moments.

As long as our Proxmox servers remain stable, our services can offer close to 100% uptime without a doubt. This bridge failure caused a blackout of our services for approximately 15 hours. It took us almost 8 hours to identify the root cause of the blackout, and we spent another 6-7 hours patching the systems to ensure there were no other issues. The cluster was back online at 12:00 am the next day.

Which services were impacted?

Basically all the services hosted by DICC were affected including but not limited to the following:

  • Single Sign On (SSO) service
  • VPN services
  • JIRA Service Desk
  • Confluence Documentation
  • HPC Related Services
  • UM Research Data Repository
  • DICC Drive service
  • Monitoring services

If you have data stored in any of our services, please check and validate your data, and do let us know if there is any issue with your data.

Categories: HPCIncidentNews