Wednesday, February 11, 2009

Exchange 2007 SP 1 High Availability and Disaster Recovery Options

With more organizations moving the new Exchange 2007 platform every day, exchange administrators now have more high availability and disaster recovery options available to them.

The goal of this article is to clearly define what built in options are made available in Exchange 2007. While many configurations exist, each has its pros and cons.

Single Copy Clusters [SCC];

I suppose the best way to describe an Exchange 2007 Single Copy Cluster (or SCC) is to think about it in the traditional sense of Exchange 2000 or Exchange 2003 clustering (although with knobs on).

Essentially in production an SCC requires a minimum of two nodes (you can have one, although it defeats the object of clustering), a private link between each node (hear beat) a public connection to your local LAN, and a shared storage array – the diagram below depicts a very basic SCC cluster configuration:

The traditional idea behind this model is that when the primary node fails for any reason, all of the services that the primary node was responsible for will be passed over to the passive node, and normal operation of the Exchange server will resume.

To all intents and purposes the model above looks exactly like the clustering format that was used by both Exchange 2003 and Exchange 2000 – however in Exchange 2007 Microsoft introduced the following improvements:

  • In Exchange 2003 when you had configured your Windows cluster, you would have the install and configure the clustered MSDTC – then install the Exchange 2003 binaries on the first node, then you would then have to manually in the Windows Cluster Administrator create the Exchange Virtual Server (EVS) IP address, Network Name, allocate storage and then create the Exchange Resources (MSExchangeSA) – however in Exchange 2007 SCC clusters – although you still need to have created an MSDTC resource – the rest of the process is fully automated.
  • In Exchange 2003 the management of the Exchange Virtual Server (for example starting and stopping services) was accomplished via the Windows Cluster Administrator – in Exchange 2007 you can accomplish all of these tasks via the EMS (Exchange Management Shell) – additionally in Exchange 2007 SP1 (due very soon) the Exchange Management Console (EMC) will also provide this functionality – cluster and application administration all in one place!
  • Again in Exchange 2003 when you had finally got you Exchange EVS up and running you would still have a number of little things that you needed to tweak – in Exchange 2007 all of this has been done for you (an example would be memory configuration – remember those “interesting” boot.ini and registry tweaks! – stand on one leg, recite the pledge of Allegiance, face north)…..

Some concept changes:

In Exchange 2003 the common term for a clustered Exchange Server would be “Exchange Virtual Server – or EVS” – in Exchange 2007 the term is replaced with “Clustered Mailbox Server” – the reason being that Exchange 2007 clusters do not support roles such as CAS, HUB or Unified Messaging – they are purely mailbox servers – where as in Exchange 2003 your Exchange EVS would also support direct MAPI, OWA, and SMTP.

Each node in the cluster can be in a position to take control of the “Clustered Mailbox Server” – but like Exchange 2003 they still have and retain their own network identity – in essence each node will have a NETBIOS name, and IP address – but they can also take over and support the Exchange Virtual Instance in the event of a fail-over (whether this is manual or as a result of a hardware issue).

Another welcome change is that the concept of Active / Active clusters has been abandoned in full for all forms of clustering in Exchange 2007 – you can no longer have an Active / Active SCC cluster (or CCR for that matter) – there are many reasons for this but essentially it boils down to scalability and performance – Exchange 2003 A/A clusters did not scale much beyond 1900 users, and could end up performing like a dog should one node fail – as Exchange 2007 is 64 bit (for production), you can pile power into your Primary and Passive nodes (for example one of my Primary Cluster nodes has 24GB of RAM and 8 processors) this makes the concept of “Load Balanced” fail-over in Active/ Active redundant.

Pros and Cons of SCC;

As in all scenarios there are pros and cons to any configuration – the following are the arguments for and against SCC clustering in Exchange 2007:

Pros:

  • It’s a familiar clustering model for those that have setup and configured Exchange 2000 and 2003 clusters
  • Providing that the the hardware is certified (to the MS HCL) it is a pretty simple type of clustering to setup and configure
  • Provides a reasonable amount of fault tolerance from a node perspective
  • Good option for larger companies that are limited on sites – but have the money to invest in a locally fault tolerant solution

Cons:

  • Is typically expensive to setup – this is mainly down to the fact that shared storage is required between both the nodes – this is usually SAN (FC-AL) based, but in a number of installations is SCSI – generally speaking you will required a significant hardware overhead to accommodate SCC
  • The Shared storage is a single point of failure – lose the shared disk array = lose the cluster – unless you are employing some form of replication software across sites (more expense – and if you are you need to consider CCR)
  • Due to the shared storage requirement of SCC both your cluster nodes need to be in the same location
  • Requires an very specific hardware configuration to run on
  • Requires Enterprise Versions of Exchange and Windows

However, with the above said – a number of us whom are planning a move to Exchange 2007 will currently have the equivalent of SCC clusters in their Exchange 2003 installations – what do we do about the investment that we have made here?

Well, from my perspective, I plan to relocate a node from my existing Exchange 2003 environment to a remote site and take the shared storage with it – then with the remaining node at my home site plum that into another SAN that we have then configure a CCR cluster between the sites (although I am aware that simplifies the process), however in order to do this I will require a server that can (temporally) take the load of at least one of my clusters (please see the “Bunny Hop” method).

Cluster Continuous Replication  [CCR];

Wow – what an idea! – what an implementation!, where has it been all my life (yes I am drooling – and yes it is sad).

CCR makes use of a type of Windows Clustering called MNS (Majority Node Set) which is then combined with a new technology in Exchange 2007 which is part of CCR – called “Log Shipping” – there will be more on that later.

Some of you may not have heard of the “Majority Node Set” idea – if you would like further information on this type of Windows clustering please have a read of the article:

http://technet.microsoft.com/en-us/library/cc784005.aspx

How does it work?

Firstly before we go into the detail of how it works lets have a quick look at the minimum requirements to implement CCR clustering:

Two clusters nodes which roughly meet the following criteria:

  • Exist in the same rout-able subnet (unless you are running Exchange 2007 SP1 and Windows 2008)
  • Have enough storage either based around DAS, ISCSI, or SAN – but it is sensible to ensure that each nodes storage is from a capacity perspective a match – remember – each node in this type of clustering uses its own storage to function – not a the shared array principle that we have seen used in Exchange 200 / 2003 and Exchange 2007 SCC
  • A third server which can perform the role of the File Share Witness (or FSW) – this is normally installed on a Exchange 2007 Hub Transport, but can also work on any Windows server as a file share.

The idea behind CCR is that there are two copies of the Exchange database, one active (which resides on the storage of the primary node) and one passive (which resides on the storage of the passive node) – transaction logs from the active database are asynchronously “shipped” to the passive node’s database and then replayed to give you are fairly current copy of the data – more on this a bit later.

The process of shipping can occur over a WAN link to a separate Data-centre (as long as it exists in the same subnet as the Active node [NOTE: This requirement changes in Exchange 2007 SP1 and Window Server 2008] as log files are around 1024 KB in size – or – the you can have a node in the same Data-centre / Building without the restriction of having to be in the same rack / room as the shared storage aspect is eliminated.

When you Initially install the passive node in a CCR cluster each storage group and associated databases are copied from the Primary to the Passive node (this is called seeding) from there on in log files are shipped to the passive node and replayed on a constant basis.

Logs are shipped from the Primary Node to the passive node when are then “closed” – which results in the passive node not always having a copy of every single log from the primary node this can mean that the database on the passive node might not be totally up to date – however this can be rectified when you have resolved the issues with the Primary node and rectified them – then performed a fail-back.

There is an exception to the situation where your databases is not completely up to date which is when the Exchange Administrator issues the move-ClusteredMailboxServer command from the EMS (Exchange Management Shell) – this would normally be done when maintenance is required on the primary node – but a log Sync is performed between the node when this command is run.

A diagram is provided below which depicts a simplified version of how a CCR cluster can be configured over three sites (two nodes at two separate sites and a third site for the file share witness):

Of course the above diagram does not take into account other roles (such as CAS and Hub) within the respective sites (A) and (B) – this I will be looking at in a separate article, however for information in the example given above you would require a CAS role in Sites (A) and (B) to maintain client connectivity to your Exchange environment should site (A) go down (you could also have both CAS servers running the HUB role).

One key thing to bear in mind with CCR clustering is that unless you are using Windows 2008 and Exchange 2007 SP1 each cluster node needs to be in the same IP subnet – therefore unless you are using some fancy routing between sites you cannot place nodes in disparate IP ranges.

Unless you are planning to use Windows 2008, this might limited the initial attractiveness of CCR – however from a personal point of view is seriously consider Windows 2008 as your Exchange platform of choice – especially if you are working on a “Green Field” build of Exchange.

If you do implement Exchange 2007 SP1 on Windows 2008 you can gain the benefit of having your cluster nodes dispersed over diverse subnets spanning separate sites or perhaps countries (if you have the bandwidth). 

Pros and Cons of CCR;

Pros:

  • When using a multi site scenario it represents an excellent fault tolerant, and high availability solution with DR and Business Continuity
  • Doesn’t specifically require an special hardware configuration
  • Not tied to close proximity based clustering
  • Doesn’t require third party replication tools
  • Ideal for larger Exchange Organisations with multiple sites where you could locate an additional Exchange installation
  • Major benefits released when using Windows 2008 

Cons:

  • Not as simple to configure as Traditional Clustering
  • Works best with multiple sites (from a DR and BC perspective)
  • Requires the Enterprise version of Exchange and Windows 2003
  • Can only contain one CCR enabled database per storage group
  • Major benefits released when using Windows 2008

Understanding the Transport Dumpster:

The Transport Dumpster is a feature which is found on Exchange 2007 Hub Transport servers. Essentially its main task is to managed the delivery of messages in a Hub Transport queue which are destined for mailboxes which reside on a CCR mailbox server (e.g. to make sure that they do not get deleted).

As explained above, the replication between an Active CCR database and a Passive CCR database is asynchronous – which means that the passive database is always slightly out dated (unless you have run the move-clusteredmailboxserver cmdlet). Therefore when a failure occurs on the active node – it is a fair bet that the most recent logs will not have been shipped over to the passive CCR node – this can result in missing mail items.

The Transport dumpster is used in this scenario – essentially when a CCR fail-over occurs, the Hub Transport is asked to re-deliver lost mail. Bear in mind that this process is for mail that has to all intents and purposes already been delivered – and transient mail is held in local submission until the store comes back online.

The Transport Dumpster is configured to work with CCR and LCR (from Service Pack 1) enabled mailbox servers.

Local Continuous Replication [LCR]:

Ok the best way to consider this is the same as CCR clustering however it happens using a single server (well ok that’s not a cluster – but the shipping and replay technology is similar – only it occurs at a disk and controller level).

CCR clustering is often referred to Resilience at a site level, whereas LCR can be considered that at a server level.

What do you need for LCR?

In order to make use of LCR your server should meet the following requirements:

  • A Server capable of running x64 Exchange 2007
  • The server should have x 2 independent RAID controllers (you can configure it using a single controller – but, if you lost that controller from the server then you will not get access to the replayed data).
  • Separate storage per RAID controller (for example on the primary RAID controller you have a single Exchange Database sitting on a RAID 5 array and all of your Logs sitting on a Mirror – these will (and should) represent separate disks – this configuration should be replicated on your passive RAID controller

The following is a simplified diagram which depicts LCR operation – the orange areas of the diagram represent separate disks attached to separate controllers on a single server:

During normal operation when using LCR the active database’s logs are shipped to and then replayed into the passive database, in the event of a fault either on the Primary RAID controller or Primary disk array you can manually “Activate” the passive copy of the Exchange Data. The process of Activation can be accomplished via one of the following means:

  • Changing the Active Storage group and database paths via the EMS (Restore-StorageGroupCopy) or EMC (Restore-StorageGroup task)
  • Via the Operating System (reconfiguring Disk mount points / drive paths)

Pros and Cons of LCR;

Pros:

  • Great solution for smaller firms that have the money to invest in a single well spec’ed Exchange server
  • Only requires the Standard Edition of Windows and Exchange
  • For smaller enterprises it represents a good level of fault tolerance within a single box
  • Easy to setup

Cons:

  • Not really suitable for larger organisations where mail is critical
  • Does require a server that can handle enough disks and two RAID controllers for it to really be effective (this could put it out of SME’s price range)
  • Can only contain one Database per LCR enabled storage group

Standby Continuous Replication [SCR] – Service Pack 1 for Exchange 2007:

SCR is a feature that is introduced in Service Pack 1 for Exchange 2007.

Essentially SCR allows for an Exchange Database to be replicated to a target elsewhere (different data centre / Exchange server) on a per storage group basis. One of the great things about SCR (and a key difference between LCR and CCR) is that you can replicate your data to multiple targets and multiple target types – for example:

Your source Exchange Server can ship its data to an offline standby server in a geographically dispersed data-centre, whilst also shipping data to a specific storage group on a active Exchange cluster within your main building.

The following diagram depicts a basic SCR scenario:

In the example given above, we have site A which is replicating its data to Sites (B) and (C) – site B contains a clustered Exchange SCC instance and site C contains a standby basic instance of Exchange. It is a wonderful “belt and braces” scenario and can be further adapted.

It should be noted that the target in SITE B is the PASSIVE node of a fail-over cluster (SCR).

As you can see SCR has great potential as an additional line of defence from losing your data, however there are some things worthy of note about this configuration:

  • The database and log paths MUST be the same on the source and target servers
  • A target standby server must not have LCR enabled for any storage group contained on it
  • A target must have the Exchange 2007 mailbox role installed (even if it not hosting any mailboxes)
  • SCR can be administratively delayed

Again as with LCR the process of switching (or Activating) between Active and Passive copies of your database it manual operation.

Pros and Cons of SCR;

Pros:

  • Highly resilient and allows for multiple targets for your data
  • Requires only the Standard Editions of Windows and Exchange
  • Works for Enterprises of all sizes
  • Allows for a built in delay in replication

Cons:

  • Can only be managed from the command shell (this means that it could be tricky to setup and manage)
  • One database per storage group

We hope that you have enjoyed that quick run through of the options available for High availability in Exchange 2007.

Labels: , , , , ,


 

 

 


 

 

 

Previous Posts
Browse Monthly Archives

Suggest a Topic
Hire Us

Subscribe to
Posts [Atom]