VMware vCloud Director multi-cell features

Products

VMware Cloud Director

Issue/Introduction

It is important for vCloud Director to be able to run services in case of hardware failures and load balance them on available cells as best as it can. This article describes how vCloud Director handles its critical component failures and makes them highly available.

Environment

VMware Cloud Director 1.5.x
VMware Cloud Director 1.0.x

Resolution

A vCloud Director infrastructure can contain one or more cells. Multi-cell communication is achieved by using a Message Bus.

With multiple cells, one cell requires a session aware load balancer. Even though all the cells continue to run all the services, cells can be given special roles and do certain services. Cells learn about other cells when they register to the same Oracle database.

Notes:

All cells in a multi-cell environment must be configured to use a centralized NTP server.
NTP synchronization also required between all ESX hosts.

VMware Cloud Director critical components (services) provide high availability and survive hardware failures.

Some of the critical clustered components are:

Monitoring Service
Heartbeat Service
Console Proxy
VC Proxy
Image Transfer Service
Activity Log Cleaner
LDAP Synchronizing Service

These services do not need to be clustered in VMware Cloud Director:

Console Proxy – This component runs on every cell and is stateless. All instances are capable of doing work so failure of any one component does not affect user requests as they are redirected by the load balancer.
Image Transfer Service – This component also runs on every cell and is stateless. All instances are capable of doing work so failure on any one component will not affect the user request.

Note: VMware vCloud Director cells are stateless. They can be restarted at any time without the risk of data loss. The only caveat is that current requests are interrupted.

VMware vCloud Director has two types of cells:

Coordinator (or primary) cells
- One cell is designated as the coordinator cell
- The coordinator cell designates which services run on the secondary cells
- The coordinator cell is responsible for ensuring that all required critical services are running on vCloud Director cells by monitoring them.
- The coordinator cell has these responsibilities:
- - Generate a task list of services which should be running on each cell and allocate tasks to respective cells. The cells then can start these services.
  - Monitor the "liveness" of each service started. In case of failure, it restarts them.
    
    Note: This is typically referred to as the failover service and is done by monitoring the heartbeat entry in the cell table in the database.
  - Detect new cells added to vCloud Director and load balances if applicable.
  - Report heartbeat so secondary cells can monitor.
Secondary cells
- All cells other than the coordinator cell are secondary cells.
- The secondary cell has these responsibilities:
- - Report heartbeat to the coordinator cell so it can determine if the secondary cell is alive by updating a table entry in the database
  - Periodically check if the coordinator cell is alive
  - Listen for messages submitted by the coordinator cell and perform actions based on them (such as which services to start)

If there is a process failure, the secondary cell is restarted by the watch-dog on the respective machine. The coordinator cell considers the time it takes for the watch-dog to restart services on the failed secondary cell. If the secondary cell is deemed to be dead, the coordinator chooses another cell on which to start the service.

Election of a new coordinator cell happens when secondary cells detect that the coordinator is dead. All secondary cells monitor the heartbeat of the coordinator. When the coordinator is detected as dead, secondary cells try to grab the lock for the coordinator. The newly elected coordinator determines if all required services are running and starts any if required.