As part of AIS Managed Services, we provide proactive management and reactive support of infrastructure and applications at a predictable monthly cost. Recently, during a routine infrastructure health check, we noticed that Azure was failing to take backups for a particular virtual machine. Why?
The client is a medium-sized outdoor equipment vendor. For this enterprise customer, we have configured Azure Recovery Services to take a daily backup of all the virtual machines in the production environment. The environment is set up with four domain controllers. Two of them are hosted in Azure while the other two are hosted on-premises. All domain controllers are running Windows Server 2008 R2. Both domain controllers hosted in Azure have 120GB System Drives attached to them, with only Active Directory Domain Services and DNS Server roles present on the server.
You could easily guess the purpose of the server involved in this case (psst, it’s the Domain Controller). It was important to ensure that a backup of this domain controller went through because there was an ongoing migration of the master domain controller (hosted on-site) and we wanted to ensure that there was no downtime.
When we checked the Error log for further details, Azure showed us the following:
Well, that was strange! We checked the network and everything was fine. The domain controller was also responding to PING requests properly.
Furthermore, Azure was not showing metrics like CPU and RAM for this server.
We logged into the server and started checking out the server and to our surprise, we saw that there was zero free disk space left on the System Drive. On checking around, we saw that the TEMP folder (C:\Windows\TEMP) was over 100GB in size and contained the following type of files:
After digging around further, we noticed that these files were being created by Windows Resource Protection, which is associated with SFC (System File Checker). We found an unusually large log file in the CBS (C:\Windows\Logs\CBS) Logs folder:
To ensure that logs are backed up, Windows Resource Protection creates a Cabinet Archive in the TEMP folder to archive logs and in this case, it was failing to create an archive because the file size was simply too big. It kept trying to create the cabinet file in the TEMP folder, which left a lot of orphan files.
- Move the CBS Logfiles to another drive with enough free space.
- Delete all the files inside C:\Windows\Temp. (Do not delete the folder but rather the files inside it!)
- Initiate a shutdown of the domain controller to make sure that everything gets re-initialised because on checking Event Log, there were still several errors because of a lack of free space.
- Shut down the VM in Azure.
- Start the Virtual Machine.
- Ensure that the TEMP folder does not contain those files and check the CBS folder to ensure that the log files are being created from scratch.
Our proactively-minded engineers then decided it was best to check other similar servers for the same problem, and sure enough, we found a server that was also on its way of having no free space on the System Drive. We performed the steps above for this domain controller and the issues were fixed. Azure completed the backup successfully and the metrics were now visible again.
Fortunately during this incident, the server did not crash. Otherwise, it would have failed to boot up because of a lack of free space and we would have resorted to mounting the drive on another VM…and then start troubleshooting. Having AIS on-hand as their Managed Services provider ensured this problem was detected before it impacted end users. and it also ensured a prompt resolution performed by a highly-qualified individual who understood the customer’s installation and had a large pool of reach-back expertise—all for a pre-set fixed price without additional hidden fees!