We generally hear lot of use cases about storage running out of space in vSan and add new drives or scale up/out the servers. Have you ever thought of the environment that is over provisioned and hardware vendor is just charging hundreds of dollars per month for a drive, your management is forcing you to bring some good savings with no over provisioning ? I have the solution here to remove disks from vSAN enabled hosts, without any hassle.. here we go :
Pre checks:
Lot of maths is involved here before starting to remove the extra DG from hosts in a specific cluster, here are the basic details for ref:
- Check the current free space in vSAN DS and there should be enough free space to sustain a host / DG failure in cluster. I would also suggest to consider future growth after discussing with different application teams. For ex: Assume you are having 41 TB free space of total 52 TB (20 % utilisation), total of 5 hosts with each host having 3 Disk groups (3 x 3.49 TB for capacity and 3 x 960 GB) in the cluster. So if we are planning to remove 1 DG from hosts, the total DS free space will reduce by 33 % therefore reducing the data store space from 52 TB to 35 TB. Based on FTT policy used in the environment, calculate the storage sizes accordingly and proceed further only if the cluster can support n number of failures after the disk groups are removed.
- Check the vSAN cluster health and DO NO PROCEED if there any errors/warnings.
- Check for any resync / inaccessible objects and DO NO PROCEED if there is any.
Which DG to be picked and steps for removing the disks:
Once the above prechecks are completed successfully, then proceed with below steps:
- Generally, check the health of the drives in host and prefer to pick the one that has bad history (or) media errors.
- Make sure to note the NAA ID of the drive, can be done in two ways :
- Just copy the disk NAA ID name from the vCenter disk management page . (or)
- SSH to host and run the command esxcli storage core device list , copy the device name by cross checking with the NAA ID shown in disk management.
3. Delete the specific Disk group or just the disk with full data migration and this should trigger resync based the data available in the drives.
4. Once the resync is completed, you can see the drives or disk groups reduced under Disk Management.
5. As marked the NAA ID in step 2, run below command in specific host to flash the LED – this LED flashing would be helpful for you to pin point the exact drive in the server.
esxcli storage core device set -d <NAA ID of drive> -l=locator -L=240
Note : Since the LED flash stops automatically after 240 sec, I would suggest to run the command again or automate.If you try this via the UI, the behaviour is a little strange. You’ll find the icons for turning on and off LEDs when you select a disk drive in the UI. However the task will report as complete whether it succeeded or failed. You will have to go into the events view to see whether it succeeded or not.
6.If the above command executed successfully then command simply returned to the prompt. If not, then error is seen and reason for the failure could be seen in hostd logs.
7. Move the host into MM with ensure accessibility : CLOMD to be changed to 120 or 180 mins. – Though most of the hardware vendors supports how removal of the drive , I would like to put the host into MM to ensure that vm’s are not affected even if wrong drives are pulled off the server.
8. Remove the drive physically from the server and do a rescan of storage devices : vCenter > host > configure > Storage > Storage devices
9. Exit the host from MM.
Post checks:
- Check the vSAN Health for any warning/errors/inaccessible objects/ resync.
If post checks are successful and if you have few more hosts in the same cluster due for removing the disk groups or disk, then just follow the same steps as mentioned above.