Tuesday, January 19, 2010

Understanding Exchange 2010 Storage Architecture: Part 2

By Mahmoud Magdy

In Part 1 of our series on the Exchange 2010 storage architecture, we went back to the basics by reviewing Microsoft’s ESE (Extensible Storage Engine), then moved on to discuss the new enhancements that further reduce IOPS (Input/Output operations per Second.)

In Part 2, we will continue our journey through the Exchange 2010 storage enhancements by exploring the concepts of logical and physical changes to the Microsoft ESE database. But first I would like to revisit a few important topics that deserve elaboration--namely, the SIS (Single Instance Storage) removal and the Lazy View Updates.


SIS (Single Instance Storage) Removal:

SIS, or single instance storage, was introduced to the Exchange server product suite in Version 4.0 and remained there until the release of Exchange 2007 (Version 12). The role of SIS was to store a single copy of an email or attachment in a Mailbox database, thus allowing any recipients within that database who received the message to be able to access it via a single instance. The greatest asset of SIS was its ability to prevent attachments from being duplicated, engendering huge space savings on the disks.

SIS in Action:

Consider the following example:

When User A sends a message with a 1 MB attachment to a DL (Distribution List) or a group of 100 users, SIS steps in and delivers only 1 copy of the attachment to the mailbox store on which this particular group of users is located. Thus, instead of User A forcing that database to store all 100 MB, or 100 copies of the attachment, he or she saves approximately 99 MB of space on the Mailbox store.

Many people were concerned when they heard SIS was being removed from Exchange Server 2007, but one must trust that Microsoft has their reasons. In 1996 when Exchange 4.0 was released, disks were bigger, slower and more expensive in comparison to current storage prices. Since SIS is only effective when used within a single database, SIS was the perfect solution to reducing the size of mailbox stores in a time when many companies only had one database. The trend in storage architecture shifted as disks became smaller, faster, and cheaper, meaning that most companies now have multiple databases storing more users on fewer disks.

As disk storage became less expensive and the database engine itself evolved from the mid 1990s through the turn of the century, Microsoft admitted that the benefits of SIS were no longer as beneficial as they used to be. In fact, studies have indicated that the 20% database reduction savings were never fully realized, and that the more accurate figure was closer to 10% and in some cases as low as 5%. If you recall from Part 1 of our series, Microsoft decided to make a dramatic change to the ESE, but in order to do so they had to make a choice: keep SIS or provide better performance? To provide better performance meant Microsoft had to increase the IO size to 32KB and force the ESE to make larger IOs and reduce the frequency of read/writes. Incorporating these changes for the sake of better performance required bidding the SIS farewell.

After implementing these changes, however, Microsoft found that space hints and the new B+ tree architecture added approximately 20% space to the Exchange 2010 database, so Microsoft introduced a new feature called the Database Compression or LV (long value) Compression.

Before we dive into Long Value Compression, let’s first answer the question of what is a long value (LV)? As many of you know, in Exchange 2010 the boundary of a page size was increased to 32 KB, and to understand why you must first understand the basics of how data is stored in Exchange databases. In Exchange, all data stored in databases is held in B+ trees which are further divided into pages. The unit size used for caching in databases is the page size, which is the minimum size required for reading and writing to the database. Since performing operations by memory is much faster than reading directly from the disk, by increasing the page size to 32 KB it allowed the ESE to reduce IOPS. The result of the reduction in IOPS is improved performance since the larger page size is cached in the memory.

Now back to the explanation of Long Values. Since the page size in Exchange 2010 is 32 KB, the emails larger than this value end up consuming extra pages and space within the database. LV Compression is the solution to this problem: it defines another table to be used by those emails, and then they are compressed to provide better space saving.







The above figure illustrates the database file analysis and comparison between E12 and E14. E12 wins in the analysis for RTF files; however, as you all know most of the emails are text or HTML-based, so using the LV compression technique renders a better space saving. Even with the removal of the SIS, the Exchange 2010 DB file is reduced by about 12% less than the E12 database size.



Lazy View Update:
Another dramatic change to the ESE brought about by Exchange 2010 is the Lazy View Update. To examine this in further detail, let’s consider the following example:


In E12, if a User (who is using OWA or Outlook Web Access) has 5 views in his inbox, then the next time the User gets an email Exchange instantly updates all of the 5 views. While this improved the end-user experience, it forced Exchange to do 2 things:
1. Perform unnecessary IOPs. (i.e. The user might be out of office, or the email might have been received in the middle of the night, thus forcing Exchange to pay for IOPs that are not necessary.)
2. Since the update is done per email, it made Exchange create excessive small IOPs to update the views.



Microsoft has solved this problem with the introduction of Lazy View updates. Going back to our example, if the above User is using OWA or Outlook Online, the view will not be updated until that User opens the view. Although this might be slower on the backend than in previous versions, the larger and now sequential IOs that are performed prevent the User from noticing any performance impacts during viewing or opening the views.





ESE Logical Contiguity:


Microsoft has made dramatic changes to the ESE storage in order to allow better IO utilization using sequential IO; a single hard disk cannot exceed 200 random IOs, while a regular SATA disk can do 300+ sequential IOs easily.

Now to better reflex the changes in the ESE architecture, try to envision the following scenario in your head. (I recommend this approach as it has greatly helped me during my own Exchange sessions.)

Imagine that you are looking at the ESE database through two transparent films: one is a logical film and one is a physical film.

The logical film is how data is structured in the ESE database, and includes tables, indexes, LV (Long Value) tables, etc. Once data is located, you must go in and find its reflex and physical location within the ESE database. (Remember this is where the pages, which are stored directly on the hard disk, are stored inside the ESE database file.)




In Part 1 of this series, we introduced the concept of logical contiguity. Let us complete our exploration of this topic by looking at the following diagram:




Microsoft has changed the table architecture in the mailbox store from a table per database to a table per mailbox. This allows fewer yet larger size sequential IOs to be committed against the ESE database, and thus optimizes the IO operations at the logical layer.

SIS removal, table architecture change, LV Compressions and Lazy View Updates are all fundamental components of the logical architecture changes to the ESE engine.

ESE Physical Contiguity:


Now that we have explored logical contiguity, let us take a look at the physical structure inside the ESE Database. Recall from Part 1 that the ESE data is stored based on the B+ tree model, which consists of properties which are stored in records which are in turn placed in a node that is stored in a page.

In the previous versions of Exchange (E14 and below), data was stored inside the database in a random matter, which was the reasoning behind having to place logs in separate disks or spindles apart from the database files. This was done because logs used to commit sequential IO while Exchange used to commit Random IOs.

This behavior negatively impacted the Exchange storage design and performance, and over time the database became fragmented and offline defragmentation of the database was necessary. In order to improve this behavior, Microsoft has changed the ESE writing behavior so that it stores the ESE pages in a contiguous manner.

To understand it better, one must visualize the design. Take a look at the following diagram:


The above diagram compares the B+ tree in the previous version of Exchange to the current Exchange 2010 version. As you can see, in Exchange 2007 pages are committed to the database in a random manner, causing the database to become fragmented over time and forcing Exchange to commit IOs in small random orders.

In Exchange 2010, the B+ tree design has been modified: pages are now stored in a contiguous manner where they are written and read in a sequential manner, thus improving the physical contiguity of the ESE file.

There remain some missing pieces to the puzzle. For instance, what happens if a read/write IO has to be committed and it cannot be done sequentially? This mystery, along with others, will be discussed in Part 3 of this series.







Labels: ,

Tuesday, January 5, 2010

Understanding Exchange 2010 Storage Architecture: Part 1

By Mahmoud Magdy

In this article, we will take a close look at the Exchange 2010 Storage architecture, but first let us go back to the basics by reviewing the ESE engine storage and then delve into the new enhancements that were introduced with Exchange 2010. First, a brief review of the ESE basics: Microsoft’s Extensible Storage Engine (ESE) is an ISAM (Indexed Sequential Access Method) data storage technology. The purpose of the ESE is to allow applications to store and retrieve data via indexed and sequential access. The ESE is suitable for server applications since its transactions are highly concurrent; but at the same time it is lightweight enough that it also works well for auxiliary applications. Worried about losing stored data in the event of a system crashing? The ESE provides transacted data update and retrieval, meaning that data consistency is maintained should your system crash via the ESE’s crash recovery mechanism.


As you all know, ESE relies on the B+ tree in order to store data. The following diagram features a simple tree that illustrates how information is stored in the data tree:

Since sorting and searching through mounds of data is time-consuming, ESE stores data in trees in order to optimize their sorting and searching behavior. In addition, the regular tree model has been updated using the B+ tree to allow for faster, more efficient sorting of data.

There are 2 types of data sorting: either internal or external. Internal data sorting means that the system can store and sort the data in the memory. However, since it is impossible for each system to sort its data within the memory, the system is forced to store data on the disk and then begin using the B+ Tree.


Data in the ESE is stored based on the following hierarchy:

  • A property is created, generated and placed in table record. Keep in mind that MAPI uses properties in order to define data and their structure at the lowest level.


  • Multiple properties are placed in a record.


  • The record is stored on a node, and a corresponding key is used to both index and vastly access the record. One thing to remember is that the leaf nodes (the end nodes) are logically linked together to allow the horizontal crawling and movement of data within the B+ Tree.


  • A record is placed into lines which are then stored on a page, with the page being the smallest element of the hard disk. Storage sizes in previous versions of Exchange: In Exchange 2003 the hard disk size was 4 KB. That number doubled to 8 KB in Exchange 2007, and then quadrupled to 32 KB in Exchange 2010.

How did Microsoft improve the storage engine in Exchange 2010?

Exchange 2007 introduced significant enhancements for the storage usage and optimization, however Microsoft wanted to further improve these enhancements with the release of Exchange 2010. While doing preliminary research to determine the most pertinent areas in storage use and optimization that need attention, Microsoft found that enterprises suffer from several challenges with the current storage technologies, including but not limited to:

  • Random IO and disk limits: The current technologies provide limited random IOs throughput; however, most of the current systems can perform several hundred requests on sequential IOs.


  • Storage Design flexibility: As email communication increases, enterprises are continually demanding improved and flexible options for storing users’ growing amounts of data.


  • Using SATA Disks and JBOD technologies: Enterprises were limited to their capacity limits by the SAS/SCSI disks; however, there are currently 2 TB SATA disks (even though Exchange should be able to work with the limited throughput of the SATA disk.)

Task 1: change the ESE storage scheme:





In previous versions of Exchange, as illustrated in the first diagram, there were multiple tables per database that contained the users’ data. In figure 2 (and in Exchange 2007) there were multiple tables (for example: mailbox table, folders table, messages table, etc) per mailbox database. Thus, in order to open a user’s mailbox, Exchange required multiple small IOs to be performed.

In Exchange 2010, Microsoft moved to a table per mailbox, making it faster and easier to open a user’s mailbox. With Exchange 2010, opening a mailbox requires fewer and larger IOs in order to open a user’s mailbox and read specific email messages stored inside. This is due to the fact that the underlying architecture of the storage design was modified in Exchange 2010 in order to reduce IOPS (input/output operations per second). Microsoft dramatically reduced IOPS with Exchange 2010 to a full 70% reduction over 2007 and a 90% reduction over Exchange 2003.

In addition to the aforementioned features introduced in Exchange 2010, other enhancements have also been made to further reduce IOPS, including the Lazy View update and the usage of the ‘pay to play’ method. Remember that in previous versions of Exchange, custom views were updated as soon as the store received an email. Although this technique provided the end users with a better experience, it had a negative impact on Exchange, forcing the Exchange system to continuously update the view and create random small IOs in order to keep the store with the most updated view. With the Lazy View update, the email store is only updated when requested by the end user.

Exchange 2010 utilizes Lazy View technology in which the views are updated when the user attempts to access them. Although this increased the time it takes to open the view, it dramatically enhanced the Exchange IO performance by using the notion that it is faster for the disk to read data stored in larger, sequential pieces versus the disk head having to gather smaller chunks of data spread out across the disk.

In order for Microsoft to create a table per mailbox, they had to remove SIS (Single Instance storage). Some of you may complain about this initially, but never fear: Microsoft provided a work-around known as Database compression. This technology is used to compress the content of the database (especially text and html files), and provides an alternative to the SIS removal issue.

Now take another look at the Exchange 2010 ESE and compare it to Exchange 2007’s ESE. In Exchange 2007, in order to open a message in Joe’s mailbox, Exchange had to open the mailbox table, read the message header, open the message and read the attachment (examples of small random IOs.)

In Exchange 2010, the Exchange system can open the mailbox table, read the message header, and open the message directly. It is important to note that since these tables are now logically connected it is more convenient for Exchange to access them, and thanks to the new page size in Exchange 2010, E14 can read the entire message body in a single IO. If additional IOs are needed they can be done, but in order to streamline the data gathering process, these commands are now grouped in larger, sequential IOs.

Let us pause at this point and revisit our discussion of Microsoft’s enhancements to the ESE in Exchange 2010 in Part 2, at which time we will delve deeper into the topics of physical and logical contiguity.

Labels: ,


 

 

 


 

 

 

Previous Posts
Browse Monthly Archives

Suggest a Topic
Hire Us

Subscribe to
Posts [Atom]