Service Status

Cluster 1 Hosting Update: Monday 25 November

The actions that we put in place on Friday appear to have maintained our services over the weekend.

This morning (Monday), one of the three sub-Clusters is seeing reasonably high CPU usage, currently peaking at just over 60%: this is 30% higher than the next busiest sub-Cluster, and 50% higher than the lowest.

There are only 5 installations on this busy sub-Cluster (A), and so we are now better able to focus on understanding where the root cause lies.

Until the busy period ends — usually around 11am — we will continue to monitor the Cluster, and take remedial action if required (by simply switching one or two of the sites to one of the less busy sub-Clusters).

Any further actions that we take will be agreed by the team, and actioned only after the morning traffic spike has levelled off.

We will continue to post updates here, as appropriate.

Read Cluster 1 Hosting Update: Monday 25 November…

Cluster 1 Hosting Issues: Saturday update

By 21:00 on Friday evening, we had enacted all of the measures that we outlined in the previous update, i.e.

  • doubling the size of Cluster 1 — increasing to 6x VMs;
  • using our Web Application Firewalls (WAF), we "soft" split Cluster 1 into 3x sub-Clusters of 2x VMs each;
  • all of the sites currently pointed at Cluster 1 were redistributed to one of the 3x sub-Clusters.

As of now, this seems to have been effective, with no more issues monitored on Saturday morning. We will continue to check in on Sunday morning too.

The real challenge will come with the massive increase in traffic that we see on a Monday, but we will be prepared and ready to counter any issues at that point.

We continue to scan for possible Root Causes, but have yet to establish anything definitive. We will continue to investigate, and communicate with our customers as appropriate.

Read Cluster 1 Hosting Issues: Saturday update…

Cluster 1 Server Issues, 22 November 2024: Actions

As some of you will know, we have had on-going issues with customer sites located on Cluster 1 (of 5 Clusters) in our Managed Cloud Services hosting environment:

  • Cluster 1 consists of 3x load-balanced "virtual machine" (VM) application servers;
  • on Thursday 21 November, at a few minutes past 09:00, the CPU (instruction processing chip) on all three VMs spiked to 100%;
  • when the server hits 100%, the load balancer stops sending traffic to the VM;
  • as traffic stopped to each VM, the CPU usage dropped to almost nothing — but the load increased on the other two VMs;
  • this is what leads to the massive up/down spikes that you can see in the image attached to this post.

This particular problem has not happened before, either on Cluster 1 or on any of the other Clusters.

Root Cause

Identifying the Root Cause Analysis has been challenging:

  • VerseOne's team has not made any changes to Cluster 1 — either to the VM configurations or to the site configurations — immediately prior to Thursday;
  • our analysis shows that there has not been any significant increase in total traffic to Cluster 1 (bearing in mind that we are logging approximately 1,000,000 requests per hour);
  • the Lucee Java application (which ultimately serves the sites) is something of a "black box", making analysis of CPU usage difficult.

Actions

Our priority is to stabilise the environment, keeping our customers' sites up and running and so giving us the time to understand where the Root Cause lies. As such, we are today taking the following actions:

  • we are doubling the size of Cluster 1 — increasing to 6x VMs;
  • using our Web Application Firewalls (WAF), we will "soft" split Cluster 1 into 3x sub-Clusters of 2x VMs each;
  • all of the sites currently pointed at Cluster 1 will be redistributed to one of the 3x sub-Clusters.

This approach aims to resolve three core issues:

  • we have more confidence that the sites will stay up;
  • if / when the issue recurs, then fewer VMs will be affected — so affecting fewer customers;
  • if / when the issue recurs, we can very quickly redirect traffic for individual sites to different sub-Clusters, so providing faster mitigation and keeping sites up; 
  • it will enable us to better pin-point which installations (if any) are causing the primary issue.

We will continue to monitor the environment over the weekend, and maintain the focus into next week.

Root Cause Analysis Investigation

In the meantime, we are refocusing our attempts to understand where the problem lies:

  • instead of analysing all traffic, we are now going to analyse traffic to the Lucee application — total traffic might not have significantly changed, but if the Lucee traffic has increased by, say, 20%, this would have a significant effect;
  • we are using some specialist analysis tools in order to delve into the Lucee Java application, in order to try to identify any "threads" that might be looping or over-running and so consuming near-infinite CPU resources.

As stated above, there is no obvious cause for this issue, and so analysis is likely to take some days. Our priority is to ensure that our customer sites continue to load and as swiftly as possible.

We are sorry that this issue is affecting your service, but please be assured that the efficiency of your service is our number one priority — we will keep working at this until we have solved the current issues.

We will continue to post updates here, and also to proactively contact you where appropriate.

Read Cluster 1 Server Issues, 22 November 2024: Actions…

Cluster 1 Server Issues update, 22 November 2024

We are still experiencing issues across Cluster 1, although we have improved stability through isolating some installations.

We are preparing some new servers to add into the Cluster 1 "pool" to try to mitigate the symptoms, but we are still trying to identify the Root Cause — and why it should only have manifested for the first time on Thursday morning.

We continue to investigate, and thank you for bearing with us.

Read Cluster 1 Server Issues update, 22 November 2024…

Cluster 1 Issues November 22, 2024

This morning we have seen a recurrence of the same CPU spikes that caused issues yesterday. Once again, it is limited to a single application server Cluster.

Cluster 1 CPU spikes

Despite working through millions of lines of logs yesterday, we have still not identified a Root Cause yet. However, we have formulated and enacted our mitigation strategies, so we should be able to restore full service much more quickly. We have also enabled more logging, in order to give us more information that will allow us to track down the current issues.

Please be assured that we are treating this as the highest priorities, and thank you for bearing with us.

ACTIONS

  • We are spinning up a new server in the Cluster;
  • using this, we will attempt to isolate each site in turn to identify and prove which are causing the problem;
  • we are continuing to prioritise stabilisation first, and Root Cause next.
Read Cluster 1 Issues November 22, 2024…