Prorenata Journal (journal.prorenata.se) svarar inte
Incident Report for Prorenata Journal
Postmortem

Prorenata Journal was fully unavailable for the majority of the incident.

This incident has been identified as a problem in the network system of our provider Elastx. Elastx have implemented a fix to this problem. They have provided a full incident report which we publish here:

Elastx Incident report

Reason (known and possible)

Deployment of a fix for a known floating IP Mac collision issue accidentally upgraded and
restarted a crucial network agent on the hypervisors. This removed all network rules on
affected hypervisors, effectively dropping connections to any VM being hosted there.
A restart of the related services was performed to re-populate the agent with rule sets which
solved the issue on all but one hypervisor which was later found to contain a virtual instance
with invalid network configuration which caused the hypervisor in question to fail restart of the
services. Once the faulty instance was removed all network traffic resumed normally.

Actions taken:

Short term:

  • Restart network agents on all affected hypervisors that rebuilt network rules
  • Shutdown an instance that caused the network service to fail rule rebuild on the last
    hypervisor

Long term:

  • Make sure our provisioning workflow does not upgrade packages unintentionally when
    deploying code changes.
  • Pin the network agent package to never be deployed unintentionally as it causes
    disturbances at upgrade
  • Disable the network feature that caused the last hypervisor to fail its rule rebuild

Timeline

  • 2024-04-03 16:10, Fix rolled out in the test environment without any issues and
    resolved the floating ip issue instantly at roll out. Decision taken to roll out fix in
    production the next day without issuing a maintenance as the fix should only impact
    machines already affected by the floating ip issue.
  • 2024-04-04 09:02-12:00, Fix for the floating IP issue is rolled out in API services and
    about 30% of hypervisors, no issues or alerts reported after this initial roll out. This
    rollout was done in one availability zone at a time.
  • 2024-04-04 14:15, Fix rollout continues on Sto1 hypervisors
  • 2024-04-04 15:15, We receive multiple reports of network issues that deviate from the
    expected impact of the floating IP fix.
  • 2024-04-04 15:24, Rollout of the fix is paused and engineers investigate the issues
    reported. Sto1 site was completed, sto2 and sto3 still have some unpatched nodes
  • 2024-04-04 16:04, Incident team is deployed to help with investigation.
  • 2024-04-04 16:19, Cause of the issues are located and a solution is implemented and
    rolled out to affected hypervisors.
  • 2024-04-04 17:56, Rollout of fix completed, things look better for most hypervisors but
    we are still aware of issues related to one single hypervisor.
  • 2024-04-04 18:05, Last hypervisor struggling with connectivity is fixed
Posted Apr 15, 2024 - 15:11 CEST

Resolved
De störningar som har drabbat användare av journal.prorenata.se är nu åtgärdade.
Posted Apr 04, 2024 - 16:47 CEST
Monitoring
En lösning har implementerats och vi monitorerar nu resultatet
Posted Apr 04, 2024 - 16:37 CEST
Update
Vi har ringat in den sannolika orsaken till driftstörningen och kommer inom kort att driftsätta en lösning.
Posted Apr 04, 2024 - 16:26 CEST
Update
We are continuing to work on a fix for this issue.
Posted Apr 04, 2024 - 16:06 CEST
Identified
Vår serverleverantör har identifierat problemet och håller på att ta fram en åtgärd.
Posted Apr 04, 2024 - 16:04 CEST
Investigating
Vi har problem med vår serverleverantör, vi felsöker och återkommer med mer information.
Posted Apr 04, 2024 - 15:27 CEST
This incident affected: Prorenata Journal and Prorenata Säkert Videomöte.