As some of you will know, we have had on-going issues with customer sites located on Cluster 1 (of 5 Clusters) in our Managed Cloud Services hosting environment:
- Cluster 1 consists of 3x load-balanced "virtual machine" (VM) application servers;
- on Thursday 21 November, at a few minutes past 09:00, the CPU (instruction processing chip) on all three VMs spiked to 100%;
- when the server hits 100%, the load balancer stops sending traffic to the VM;
- as traffic stopped to each VM, the CPU usage dropped to almost nothing — but the load increased on the other two VMs;
- this is what leads to the massive up/down spikes that you can see in the image attached to this post.
This particular problem has not happened before, either on Cluster 1 or on any of the other Clusters.
Root Cause
Identifying the Root Cause Analysis has been challenging:
- VerseOne's team has not made any changes to Cluster 1 — either to the VM configurations or to the site configurations — immediately prior to Thursday;
- our analysis shows that there has not been any significant increase in total traffic to Cluster 1 (bearing in mind that we are logging approximately 1,000,000 requests per hour);
- the Lucee Java application (which ultimately serves the sites) is something of a "black box", making analysis of CPU usage difficult.
Actions
Our priority is to stabilise the environment, keeping our customers' sites up and running and so giving us the time to understand where the Root Cause lies. As such, we are today taking the following actions:
- we are doubling the size of Cluster 1 — increasing to 6x VMs;
- using our Web Application Firewalls (WAF), we will "soft" split Cluster 1 into 3x sub-Clusters of 2x VMs each;
- all of the sites currently pointed at Cluster 1 will be redistributed to one of the 3x sub-Clusters.
This approach aims to resolve three core issues:
- we have more confidence that the sites will stay up;
- if / when the issue recurs, then fewer VMs will be affected — so affecting fewer customers;
- if / when the issue recurs, we can very quickly redirect traffic for individual sites to different sub-Clusters, so providing faster mitigation and keeping sites up;
- it will enable us to better pin-point which installations (if any) are causing the primary issue.
We will continue to monitor the environment over the weekend, and maintain the focus into next week.
Root Cause Analysis Investigation
In the meantime, we are refocusing our attempts to understand where the problem lies:
- instead of analysing all traffic, we are now going to analyse traffic to the Lucee application — total traffic might not have significantly changed, but if the Lucee traffic has increased by, say, 20%, this would have a significant effect;
- we are using some specialist analysis tools in order to delve into the Lucee Java application, in order to try to identify any "threads" that might be looping or over-running and so consuming near-infinite CPU resources.
As stated above, there is no obvious cause for this issue, and so analysis is likely to take some days. Our priority is to ensure that our customer sites continue to load and as swiftly as possible.
We are sorry that this issue is affecting your service, but please be assured that the efficiency of your service is our number one priority — we will keep working at this until we have solved the current issues.
We will continue to post updates here, and also to proactively contact you where appropriate.
Read Cluster 1 Server Issues, 22 November 2024: Actions…