|
Component
|
Errors
|
Possible Reasons
|
Possible Solutions
|
|
PassiveNodeUp
|
Failure
|
Active node cannot communicate with the passive nodeCluster service has stopped or failed on the passive nodePublic Network or Heartbeat network cannot reach the passive node
|
Validate that the cluster service is in a running state in Services MMC
If the cluster services start and stop frequently, please review the event viewer for error logs.
Validate that the network interface cards are connected and available in the cluster and you have a proper connection between the two nodes
For more information see the following link: http://msexchangeteam.com/archive/2008/04/03/448615.aspx
|
|
ClusterNetwork
|
Failure
|
Network connectivity outage
Switch Down
NIC hardware problem
NIC disabled in the OS
|
Check network connectivity between active-passive node
Validate you can communicate with your default gateway through the network switch
Validate there is no hardware error messages in the event viewer
Validate that all network cards used in the cluster are enabled and configured with a static IP address
|
|
QuorumGroup
|
Failure
|
Cluster IP address is in a failed state
IP Address conflict
File Share Witness is in a failed state (see below)
|
If there IP address fails, the Cluster Network will not be brought online, because it depends on the Cluster IP address (please the DnsRegistrationStatus) If the File Share Witness is in a failed state, please review the FileShareQuorum solutions and workarounds.
|
|
FileShareQuorum
|
Failure
|
Cluster cannot reach the server that hosts the file share witness.
The file share was created with the wrong shared permissions.
The file share was created with the wrong NTFS permissions.
|
Check if the server hosting the file share witness is available and if the folder still exists.
Validate if the computer accounts of the cluster has full access control over the shared folder and in NTFS
The everyone group must exists and have read permissions in the shared folder.
For more information on how to create the file share witness, please review the articles below:
http://technet.microsoft.com/en-us/library/bb676490(EXCHG.80).aspx
http://msexchangeteam.com/archive/2008/04/03/448615.aspx |
|
CmsGroup
|
Failure
|
If any of the Exchange CMS services are in a stopped state, this item will be in a critical state. I.E.: Microsoft Exchange Information Store, Microsoft Exchange System Attendant and any of the Storage Groups and Mailboxes databases.
|
Check which service is stopped and look into the event viewer for error messages. If both cluster nodes were shutdown unexpectedly when they are brought online it could result in a failed state. Try to stop one of the servers and start the services on the active node. If you are performing a failover, it is possible that this item will be in a critical state for a moment and then go back to normal when your Exchange CMS group is back online.
|
|
NodePaused
|
Failure
|
The cluster was put in a paused state by the administrator.
|
Go to cluster administration or failover cluster management tool and start the cluster service in the paused node.
|
|
DnsRegistrationStatus
|
Failure
|
Cluster service cannot register the DNS entry into the Active Directory DNS zone.
|
Validate that the DNS service is up and running.
Validate that the DNS entries on the public interface card can be reached from every node of the cluster.
Try to perform an ipconfig /flushdns && ipconfig /registerdns and also see if there are any errors related to DNS in the event viewer.
Try to ping the cluster name from another server on the network.
Validate that your active directory zone supports secure updates and there is no replication issues in your AD infrastructure.
NOTE: Cluster service uses Kerberos authentication to register your entry into AD/DNS zone.
|
|
ReplayService
|
Failure
|
Microsoft Exchange Replication service is not running.
|
Try to start the Microsoft Exchange Replication service manually through Services MMC, if the service stops repeatedly, please review the event viewer for more error messages. If the service starts it will trigger the replication of your Storage Group Copies from active to the passive node
|
|
DBMountedFailover
|
Failure
|
The database is not mounted.
|
If the cluster service is up and running and the replication services are started, go to Exchange Management Shell or Cluster Administrative Tool and try to mount the database.
If the mount fails, review the application and system logs for errors. Your database maybe be in a dirty shutdown state or corrupted. You may want to contact Microsoft PSS or call us for assistance.
|
|
SGCopySuspended
|
Checks if there are any storage groups copy in the 'Suspended' state.
|
Storage Group Copy was put in a suspended state manually or automatically.
|
Halting replication stops all propagation of the changes from the active storage group to the copy for the period of the suspension. Should a failover happen during this time, the storage group copy will not have the latest changes. Depending on the volume of changes that has occurred on the active node, the lack of recent updates is likely to prevent the system from mounting the copy on the passive node. Thus, you can either use the available version of the storage group on the passive node or wait until the original server recovers.
It is important to minimize the time that the replication is halted to minimize this exposure. Please review this article for more detail in how to handle Storage Group Copies: http://technet.microsoft.com/en-us/library/aa997676(EXCHG.80).aspx
|
|
SGCopyFailed
|
The passive node was shut down and when it was brought back online the SGcopy failed to initialize.
|
Microsoft Exchange Replication Service has stopped.
|
In this case you should reseed the whole storage group from the active node to the passive node. Please use the Update-StorageGroupCopy –Identity <StorageGroupName> -DeleteExistingFiles cmdlet with the right parameters to initiate a full reseed your storage group.
Please note that this command should be executed from the passive node of your cluster.
For more information, please see the following article: http://technet.microsoft.com/en-us/library/aa998853(EXCHG.80).aspx
|
|
SGInitializing
|
The passive node was shut down and when it was brought back online the SGcopy failed to initialize.
|
A node of the cluster was shutdown and brought back online.
The Microsoft Exchange information store was stopped and brought back online.
The Microsoft Replication Service was shutdown and brought back online.
|
Checks to see if any storage groups are in the Initializing state.
Verify if another administrator created a new storage group.
Verify if another administrator failed over the cluster with the Loss Less option selected.
If the initialization process does not progress please suspend the storage group copy with the Suspend-StorageGroupCopy –Identity <StorageGroupName> cmdlet and on passive node perform a full reseed of your Storage Group Copy using the Update-StorageGroupCopy –Identity <StorageGroupName> -DeleteExistingFiles
For more information review the following article: http://technet.microsoft.com/en-us/library/aa998182(EXCHG.80).aspx
|
|
SGCopyQueueLength
|
Failure
|
The copy queue length is above the warning or failure thresholds. There are many items in the queue waiting to be delivered to the passive node.
|
This item monitors how many log shipping are still pending to be delivered from the active to the passive node.
if exists many items on the Storage Group Copy queue, please review the following items:
Massive generation of logs could be a cause of a massive send/receive e-mail messages in the system.
Malware infection occurring inside the network and the e-mail system was infected.
Nom appropriate usage of e-mail system, sending massive amount of e-mail messages (e-mail marketing campaigns, internal applications, etc.).
SMTP Relay attacks passing though the security, getting permission to send large amount of e-mails.
|
|
SGReplayQueueLength
|
Failure
|
The replay queue length is above the warning or failure thresholds. There are many items in the replay queue waiting to be applied on passive node.
|
Checks to see if any storage group has a replication replay queue length greater than best practice thresholds. Currently, these thresholds are:
Warning Queue length is 30–59 log files.
Failure Queue length is 60 or more log files.
This is also an indication of queuing logs in the active node to be delivered to the passive node, please review the item SGCopyQueueLength for more details.
Restart the Microsoft Exchange Replication service on cluster nodes and review if the queue gets back to normal.
Review the event viewer for any events on cluster nodes.
If the queue does not goes down, suspend the replication, and perform a manual reseed of the database to resume the replication in a healthy state.
|