Chapter 15

Operating and Monitoring Exchange Server 2013

Why monitor and report on your Exchange service? Exchange monitoring, reporting, and alerting are fundamentally about one thing—keeping your messaging infrastructure running sufficiently to meet your service availability targets. Given this fact, it often surprises us that project teams will go to great lengths designing highly available clustered solutions, with overly complex redundant components, and then assume that installing an operations-monitoring product with its out-of-box configuration will be sufficient to keep things running.

Predictably, this approach rarely results in a great experience, and instead it tends to swamp the operations teams with initially interesting though largely irrelevant alerts that eventually get disabled to reduce “noise.” These environments inevitably end up having an expensive monitoring solution installed with most of its alerts disabled. The main form of service alerting then comes from the most accurate barometer of Exchange service problems—the end users.

This is clearly an extreme scenario. Hopefully, though, it highlights a couple of points:

Monitoring Exchange should be about the service it provides to its customers, not just miniscule variations in disk latency counters.

If you are going to deploy an expensive monitoring solution and then ignore most of the alerts, you might as well have ignored the built-in product alerting system in the first place.

Not every failure needs fixing immediately.

The final point here is worthy of a little more discussion. Quite often in enterprise environments with 24x7 operations teams, a server failure, or blue screen, will result in someone being paged to come and fix it. Obviously, this may be necessary in some cases. However, if your Exchange 2013 infrastructure has multiple database copies, and the service workload has not been interrupted significantly, then paging an expensive resource to come by and resolve a problem that does not demand immediate attention makes no sense. The difficult decision for an operations team is how to assess the actual impact of a specific failure.

Unless the Exchange design team has provided specific guidance for alerting, then the operations teams will often end up applying the same logic used for the previous system to the newly deployed service. This may or may not be appropriate, and the risk is that the operations teams may not be able to make best use of the new platform.

We will make reference to and give examples of how the Microsoft Exchange team operates the Office 365 service at extreme scale (many millions of mailboxes). As you read through this chapter, above all remember that the primary goal of service monitoring, reporting, and alerting is to maintain the level of service defined for your messaging service and your end users. Also remember that things fail—it's a normal part of running a service. Making intelligent decisions when dealing with failure is what separates great operations teams from the rest.

While preparing this chapter, we were lucky enough to spend some time with Matt Gossage, principal program manager lead in the Exchange product group, who was responsible for running Office 365. Matt made one thing crystal clear to us—you should expect failures within your service. Failure is a normal part of running a service at scale; the way in which you deal with these failures is the important part.

Monitoring

Historically, Exchange monitoring was performed via an external monitoring solution such as System Center Operations Manager (SCOM) or SolarWinds. These types of applications gather information from the service such as performance data, event logs, mail delivery metrics, and so on. Then they store that data in a database for analysis. Following that, the monitoring solution takes action based on the recorded data. For example, SCOM has a collection of Health Manifests that can trigger when a service is not running or if a performance counter is outside acceptable thresholds.

The problem with these types of applications is that they require fine-tuning for most deployments. Even for the Exchange Online part of Office 365, the SCOM deployment requires heavy customization to meet the operating requirements.

The root cause of this monitoring problem is simple. Albert Einstein said it the best: “Not everything that can be counted counts, and not everything that counts can be counted.” What this means is that just because we can alert on how many milliseconds an operation took to complete, it doesn't necessarily mean that we should. Likewise, what is important is how the system is perceived to be meeting the demands of its end users. Recording how long an I/O operation took to complete is simple. Likewise, it's simple to define some thresholds and take action if they are exceeded. However, what if the threshold was exceeded and the end-user experience was just fine? How can we monitor the experience provided to the customers of our messaging service?

The truth is that a quality monitoring solution will require as much effort to design as the underlying service that it is monitoring. Not only will it monitor easy-to-harvest system data from event logs, performance counters, and message tracking logs, but it will also simulate user requests from outside the organization to observe how the system is performing from a user perspective. The solution should also be able to take action to fix common scenarios, such as failing over workload from an unhealthy DAG node to a healthy alternative.

Two important things to remember while reading this chapter are these:

Change the way you think about monitoring. The name of the game is no longer just about sending alerts for individual performance counters or isolated critical system events—it's about maintaining service and keeping things running when you potentially have several places that could run the same workload.
It is impossible to achieve a highly available service in the absence of a thoughtful and appropriately deployed monitoring solution. Although some may disagree, installing Gartner's monitoring product of the week and hoping for the best is unlikely to result in long-term success.

We will discuss some of the fundamental changes in Exchange 2013 that address these shifting requirements later in this chapter. We will also discuss how the Office 365 team approaches service operations.

Alerting

Alerting is what a monitoring solution does when something unusual has occurred. Historically, monitoring solutions ship with a database of thresholds for events and performance counters that, if exceeded, will result in some form of notification being dispatched: a Simple Network Management Protocol (SNMP) trap or an email, an SMS message, or a combination of both. The expectation is that a human being will eventually be tasked with resolving whatever anomaly is present.

As IT systems have evolved and have become ever more complex, so too have the systems designed to monitor and produce alerts on them. The bottom line is that it is a waste of time and effort to page an expensive on-call resource to fix something that is not directly affecting service. More significantly, the risk of human error increases dramatically when complex infrastructure is resolved under pressure by on-call resources.

One way to approach the issue of when to contact such resources is to consider service impact as a part of the alert. This can be complicated, however, and so it generally will require human involvement. (For example, it may not be an Exchange problem that has taken down service.) As a general rule, there are three types of failure in an Exchange 2013 system:

Non-service-affecting failure
Service-affecting failure
Data corruption event

In the case of a non-service-affecting failure, the resolution will vary depending on what exactly has failed. If you have a DAG and other high-availability features, a single server failure or storage chassis failure could fall into in this category. If you have four copies of the data spread across four DAG nodes and one node fails, you still have three left, and so there is no real requirement to summon on-call resources. Instead, the failure should be investigated and resolved by the normal operations teams during normal working hours.

Service-affecting failures are significant in that there are end users without service. This is a no brainer—on-call re sources should be mobilized, but “who you gonna call?” as the trio from Ghostbusters would say. Some of the worst cases we have worked on with Exchange systems had their root cause in a disaster recovery event where resolution was first attempted by the wrong resource and hence was done incorrectly. The single most important aspect of a service-affecting failure is to get the right resource to the right place as quickly as possible. It also helps if there is a set of predetermined scenarios readily available to resolve predictable problems, such as database reseeds or DAG node rebuilds.

Data corruption events are the worst type of alert, and they should be treated with the highest priority (so-called red alerts). There are various kinds of potential data corruption within Exchange, and all should be treated immediately. However, by far the worst is something called a lost flush. A lost flush occurs when Exchange thinks it has written data to the disk, but that data never got there—despite the operating system receiving confirmation that it did. Exchange attempts to detect these issues, and it will raise an alert if it detects one. Our guidance is to treat data corruption events with the highest possible priority—even if they are not affecting service. Have a remediation plan for data corruption and make sure that your team is drilled in its use. If a lost flush is detected, remove the suspect storage hardware from service immediately—do not put it back into production until the root cause is identified and resolved.

Reporting

Reporting covers a myriad of things, such as notifying customers, the business of system availability metrics, or system usage capacity reports showing availability headroom and trending.

Types of System Availability

It sometimes surprises design teams that there are two types of system availability: one that is presented to the business and one that is used within IT.

Business availability is generally a straightforward statement of system availability. That is, it is a statement of the availability of the system over a specific duration. Downtime during system maintenance windows and so forth is considered to be unavailable time. Sometimes, this value is actually calculated manually based on incident records and is not generated by system availability monitoring software.

IT availability data is usually more detailed, and it will take scheduled system maintenance windows into account. Typically, this data will be generated via a system monitoring application, and it will indicate periods of service outage and downtime during maintenance as available—even though, strictly speaking, the system was down.

Most system monitoring tools offer the ability to generate IT availability data, and the reports and charts that are produced are tailored to the IT audience; that is, they consist of relatively technical, jargon-rich information. On the other hand, business availability reports tend to be jargon free and simply state the system availability number for the month.

Trending is the taking of historical observations and using that data to predict the future. As anyone who has tried to predict the future knows, it's not often easy or accurate. Some things, however, are easier to predict than others, while, generally speaking, the farther into the future we wish to predict, the more imprecise things become.

Thirty years ago, weather forecasts were mostly inaccurate. In 2012, the Met Office (the United Kingdom's national weather service) ranked their five-day predictions to be as accurate as their 24-hour predictions were 30 years ago. This increase in prediction accuracy is mostly attributed to having precise historical data and more extensive observations from around the globe to work with.

The same logic applies when you are trying to make predictions about your Exchange infrastructure—the more information you have available, the more accurately you can predict what will happen next. Storage capacity usage is a common example. Imagine that you had capacity data for 100 mailboxes over the last six weeks compared to data for 10,000 mailboxes over the last six years. The smaller sample size will yield a less-accurate picture of growth within the organization, especially if it was taken during a holiday period or during a particularly busy time. Moreover, a few of the 100 mailboxes may be especially heavy users, skewing the average. Having a significant historical data size available plus a representative sample size on which to base prediction trends is vital. There are various statistical methods for determining minimum sample sizes for a population. For Exchange trending, however, we recommend taking the simplest approach possible and recording trending data for as many users as you possibly can.

The next step is to know what kind of information is available that you should trend. As a general rule, the following are areas that benefit most from some form of trending:

Storage capacity
Network utilization
System resources
Service usage
Failure events

Storage capacity trending is obviously not unique to Exchange. However, Exchange does bring with it some interesting nuances. Exchange storage capacity trending is made up of various subcategories, including the following:

Mailbox databases
Transaction logs
Content index
Transport queues
Tracking and protocol logs

Mailbox Databases Trending

It is important to understand that Exchange databases are not just made up of mailbox data; that is, there is other “stuff” in the database including database indexes (a structure to speed up data access), white space (empty pages that can be used to write data), and items that have been deleted that nonetheless are retained. Also, Exchange will never regain used storage space—even if the data inside the database shrinks. For instance, if you moved half of the mailboxes from one database to another, the source mailbox database would remain the same size, even once all of the deleted items have been purged. This is important since it means that there are many factors that can influence the size of mailbox databases. For example, some mailboxes may be moved from database to database leaving behind white space. This white space exists within the database, and it is not reclaimed from the disk. Thus, the actual storage capacity used on the disk remains the same. However, when new data is written to the database, Exchange will try to make use of the white space before extending the size on disk. This behavior can lead to unusual storage capacity data reports on Exchange Servers, and it is another reason why it's useful to have as much historical data available as possible. Short-term capacity trending for Exchange databases is usually meaningless. Try to get at least 12 months' worth of “steady state” capacity data before trying to predict storage capacity trends for Exchange.

Transaction Log Trending

Transaction log capacity trending may not initially seem like an interesting metric for trending. However, this metric provides two interesting pieces of information. First, the log generation rate is a very good way to observe the rate of change within the database, and second, if we know how many logs are generated per hour, it tells us how long we can potentially go without a sustained database copy outage (or backup if native data protection is not being used). Customers who have deployed a multicopy native data protection DAG often miss the second point; that is, they are relying on Exchange database copies and also potentially lagged copies rather than taking backups. They will then configure Continuous Replication Circular Logging (CRCL) to truncate the transaction log files automatically once they have been successfully copied, inspected on lagged copies, and replayed on all other nonlagged database copies. The key here is that the transaction log files do not get truncated unless they have been copied, inspected, and replayed on all other database copies. If you have a situation where you have four database copies and one is in a failed state, then the transaction log drives on all of the other database copies will begin to fill up. If this is left unchecked, the database will dismount. Therefore, having good trending data for your transaction log generation rate will govern how long the support teams have to resolve database problems before they affect service. Quite often, this value will have been predicted during the design phase but never checked or validated in production. Furthermore, the rate of transaction log file generation can change from service pack to service pack. This situation affected many customers when they deployed Exchange Server 2007 SP1, which radically increased the number of transaction logs generated.

Tip

The easiest way to gather transaction log file generation data is from a lagged copy, since these are persisted for 24 hours or more.

Content Index Trending

The content index (CI) has changed significantly in Exchange 2013. The content index provides the ability to search mailbox data more quickly and effectively by creating a keyword index for each item stored within the Exchange database. In previous versions, there was a rough rule of thumb that said to account for 10 percent of the mailbox database for CI data. This guidance has not yet been formally released for Exchange Server 2013. From our lab environment observations, the new CI database in Exchange 2013 appears to be roughly 7 percent as large as the mailbox database that it is indexing. It is not generally required to record or trend the CI space usage explicitly, but be aware that it exists in the same folder as the mailbox database and it requires additional space.

Message Queue Trending

Although there is no longer a specific transport role for Exchange Server 2013, the old Transport role has simply been incorporated into the Mailbox role. Message delivery is now a combination of the Front End Transport service on the Client Access role and the Transport service on the Mailbox role. This means that there is still a mail.que database on every Exchange 2013 Mailbox server but no mail.que on the Client Access Server unless it is colocated with the Mailbox role. The transport queue holds email messages that are queued for delivery or that Exchange is retaining as part of the new Safety Net feature. Because of this feature, most organizations will store the mail.que on a dedicated LUN. Trending storage capacity for the message queue database is nontrivial since, like the mailbox database file, it does not shrink as data is removed. From a trending perspective, it is sufficient to monitor the storage capacity required for this database file. It is rare, however, to see problems caused by a lack of space for the transport database because of the large size of today's hard drives and the relatively small size of the database.

Warning

Watch out for the transport queue database growing in the event of a failure that impacts message delivery. This will cause the database to grow—potentially very quickly—and even once the fault has been resolved, the database file will not shrink. Our advice in this scenario is to ensure that you have a process in place to shrink the mail.que file after an event that causes an exceptionally large number of messages to be queued.

Because of Safety Net data being stored within the database, the recommended way to remove white space is to perform an offline defragmentation via the eseutil /d command. This will require taking the transport server offline for the duration of the process. If you have multiple Exchange Servers within the AD site, however, it is possible to take one offline and defragment the databases one at a time until they are all back to a normal size without affecting message delivery.

Tracking/Protocol Log Trending

Message tracking logs are text files that contain routing and delivery information for every message that was handled by the Exchange Server. These files are used to track messages throughout the organization. The longer the files are retained, and the more messages your organization processes, then the greater the amount of space that will be required.

We have seen cases where a monitoring solution has been deployed that increased the duration for which message tracking logs were kept. The customer had not changed the location for the tracking logs, and they were still on the system drive when, a few weeks later, the customer's systems began running out of space. (Luckily, their monitoring system spotted this before it affected the system.) The bottom line is that even though these files are not especially large, they are stored within the default installation path unless moved and thus could potentially cause a problem down the line. Trending the storage space required over time is useful, although not typically vital.

Network Utilization Trending

Network utilization trending data has always been a critical resource for Exchange. It is becoming more and more important, however, as organizations look to consolidate datacenter locations and synchronize data between them to improve resilience to failure scenarios. The most important thing about network usage trending is to be aware of the data direction. Most network links are full duplex; this means that they can transfer data in both directions at the same time, although not necessarily at the same speed.

Network links may be synchronous (same speed, both ways) or asynchronous (faster one way than the other). This is an important differentiation since when we are planning and trending network usage, we need to specify in which direction the data is moving, for example, from Server1 to Server2 or from load balancer to client. By far the two most important areas of Exchange network utilization are from Exchange to the end user and replication traffic between DAG nodes.

End-user network traffic is typically asynchronous, with much more data passing from the Exchange service to the client than vice versa. If there is insufficient network capacity to meet end-user requirements, then network latency will increase. Latency is a value that expresses how long network data packets take to pass between hosts. As network latency increases, the end-user experience will begin to slow down due to data and operations taking longer to perform. We will discuss this in more detail in Chapter 12, “Exchange Clients.” However, from a monitoring perspective, it is vital to record and trend network utilization information, especially for the Exchange Online part of Office 365 or consolidated environments.

DAG replication traffic is generated to keep database copies up to date and their associated CI. Database replication traffic is very asynchronous, and it occurs almost entirely between the active database copy and the replica copies. If the network link used for this replication traffic is insufficient, latency will rise, and this may lead to a delay in transaction logs being copied to replicas. This could be important since the delay in copying transaction log files between database copies increases the recovery point objective (RPO). RPO is a value stating how much data you are prepared to lose in the event of a failure within your service. Obviously, if a failure occurs and the replica copy is 100 log files behind the active copy, then you have lost at least 100 MB of data on that database. Monitoring network links used for replication and attempting to trend capacity usage changes during the day can help prevent unexpected data loss during a failure.

Many organizations already have some form of enterprise network link-monitoring software. Not all of these programs, however, will perform trending of the usage patterns. The Multi Router Traffic Grapher (MRTG) provides the most common data by far. This is partly because it is free and partly because it uses Simple Network Management Protocol (SNMP) to query routers, switches, and load balancers, so it's a relatively straightforward process to get the data you need. MRTG displays this data in real time and historically in chart format. However, it will not show predicted future trends, and thus this will have to be performed manually.

Figure 15.1 shows sample output from MRTG taken from my home broadband router. (None of my customers wanted to share their link data!) You can see from the chart that there are periods of 100 percent link utilization. In my case, this is usually for synchronizing data to SkyDrive or downloading ISO files from MSDN. In an enterprise environment, this could easily be a scheduled backup occurring or perhaps someone reseeding an Exchange database across the network. Short periods of 100 percent utilization on a network link are relatively normal. However, those periods should be relatively short and ideally infrequent during the work day.

Figure 15.1 Sample network utilization graph from MRTG

Monitoring and trending network usage are relatively easy, but they're only half the story. As mentioned earlier, network latency tends to have the most dramatic effect on data throughput and client experience.

Network latency is usually expressed as round-trip time (RTT) in milliseconds. It is common for network teams to monitor link latency within their datacenters between switches and routers. Nonetheless, it is not always common for this data to be used for future trend prediction.

System resource trending is the process of recording how servers within the Exchange service are using their critical system resources such as the processor, memory, and disk. Additionally, it is often useful to trend some key Exchange performance counters such as RPC Averaged Latency. RPC Averaged Latency is a very special counter for Exchange, since it essentially shows the average time taken for Exchange to satisfy the previous 1024 RPC packets. We highly recommend trending the MSExchange RpcClientAccess\RPC Averaged Latency value for all servers within your Exchange 2013 service.

Many teams perform this type of trending simply by recording performance monitor counters via the standard Performance Monitor tool included with Windows. Other customers prefer to use a third-party tool to do this. The most important thing about resource trending as opposed to alerting is that you are trying to predict when the system will begin to trigger alert thresholds. We will examine how to use Excel to generate future predictions later in this chapter.

Service usage trending is performed by recording historical information about how the service was used and then attempting to predict future growth patterns. This may include the number of active users on the system, the number of messages processed, the number of TCP connections, or Exchange performance counters such as RPC Operations/sec. Service usage trending is primarily used to justify changes in system performance behavior.

Imagine a scenario where the RPC Averaged Latency during the workday has gradually increased from 5 ms to 10 ms over a three-month period. Although this is useful information, you need to determine what is causing this change. Do you have more users on the system? Are the users working harder (increase in RPC Operations/sec)? Have you reached a system bottleneck (system resource exhaustion, processor, disk, memory, and so on)?

The job of service usage data is to help you understand what the customers of the service are doing, when they do it, and how their behavior impacts the performance of your Exchange service.

Recording failure occurrence is something that most organizations do as a matter of course, largely due to the mass introduction of the Information Technology Infrastructure Library (ITIL), which also recommends trending of problems. Defining how to deal with such problems is outside the scope of this chapter. From a trending perspective, however, you are interested in when the problem occurred, how significant it was, whether it is happening regularly, and whether we can do anything to stop it from happening again. Remember the advice from Matt Gossage at the beginning of this chapter; that is, expect failure, because you will never be able to design the perfect system. Instead of attempting to stop all problems from ever occurring, consider designing the system to tolerate failure without affecting the service availability—especially if it is easier to do this than to fix the root cause.

Matt also provided one additional piece of advice, and that was to evaluate data corruption events seriously and quickly. Exchange 2010 introduced ESE lost flush detection. This is a form of logical data corruption that could impact all database copies, and these events should trigger the highest possible level of alerting and receive immediate attention. Any event that results in database corruption should be considered an extreme high priority. You must work to understand, resolve, and take action on such events immediately and to prevent them from occurring in the future.

Using Excel to Predict Trends

We have talked about some interesting things to record and trend so far, but we have not discussed how to perform trending. Our favorite way to deal with trending is by using Excel. Ideally, this process should be completed roughly every three months, although some organizations do this on an annual basis only. The process begins with collecting the data and then converting it into a usable format. The potential sources of data are too varied to discuss here. Nevertheless, you'll eventually want something that you can import into Excel, ideally in a comma-separated value (CSV) format.

Once the data is in Excel, you can plot it against time and then use one of the Excel trend-line functions. Excel provides six options: Exponential, Linear, Logarithmic, Polynomial, Power, and Moving Average. Our advice is to begin by adding Linear trending to your chart and then experiment with the other options to find the best fit.

Figure 15.2 illustrates our recommended approach to trending. This example shows the disk LUN capacity data for a mailbox database. In this example, the customer wanted to know when their database LUNs would reach 65 percent capacity in order to allow them sufficient time to commission more storage. The chart shows two sets of data, an initial prediction based on 12 months of historical data and another prediction based on 24 months of historical data. A trend was identified and a prediction date derived for when the LUN would reach 65 percent capacity.

Figure 15.2 Example trend chart in Excel

Initially, the historical data was plotted against the date axis and then a linear trend line was added. In this case, a linear prediction fits our data very well. We then drew a line at 65 percent capacity and looked to see where the trend line crossed the 65 percent capacity line. If we draw a line directly down to the date axis, this will give us a predicted date when the LUN will reach 65 percent capacity.

Figure 15.2 shows clearly that our prediction may change as more data is analyzed. This is why it is important to reevaluate your trending predictions regularly. We recommend that you update trending predictions quarterly.

As stated previously, though, the farther into the future the prediction looks, the less accurate it is likely to be. Likewise, the more historical data you have, the more likely the prediction is going to be accurate.

As a general rule, do not attempt to predict any farther into the future than the amount of historical data you have. That is, if you have 12 months' worth of data, only predict 12 months into the future. Attempting to apply a relatively short historical sample to predict well into the future is little more than guesswork and is unlikely to be accurate.

Inventory

Maintaining a list of hardware and software components that make up your Exchange service is extremely useful. Patching and maintaining an Exchange service can be complex, especially at scale. In the past, we have seen organizations that believed that all of their Exchange Servers were at a specific patch level, only to discover that some were not when we verified them physically. This problem only gets worse once you begin to include operating system patches, hardware drivers, and, potentially, even things such as hardware load balancer operating systems. All of these things can impact the service and the way the components interact with each other.

By far the best way to approach inventory management is to use a script or tool to do it for you. Luckily, Exchange 2013 has PowerShell, which makes it easy to find this kind of information. There are even some freely available scripts that you can download, which will provide a rich set of data. Steve Goodman wrote our favorite example of an Exchange inventory script, and it is available for free at the following URL:

http://www.stevieg.org/2013/01/exchange-environment-report-v1-5-6/

At the time of this writing, version 1.5.6 is listed on Steve's blog.

Figure 15.3 shows a sample output of this script. The script provides most of the critical information that an Exchange team will require to operate their service successfully.

Figure 15.3 Example Exchange Environment Report

We recommend keeping track of the following core information about your Exchange platform:

Organization information

Organization name
Versions of Exchange installed
Number of servers installed
Number of mailboxes in total
Ad-site topology

Per-server information

Operating system
Exchange roles installed
Exchange version and patches
Mailbox databases per server
Network card device driver and firmware
Host bus adapter/RAID controller device driver and firmware
Version of storport installed
Build date

DAG and database information

Database availability groups
Database availability group membership
Number of mailboxes per database
Mailboxes in each database availability group
Database/log LUN capacity usage

The free script provides an amazing starting point for this information. It does not gather everything, but what it does gather is useful and easy to generate. It also shows what is possible natively with PowerShell.

For many enterprise organizations, a free script will not meet their requirements adequately. Connecting to their corporate inventory system may be a part of the production handover process, but it may still be useful for the Exchange operations teams to maintain their own specific data about their service, since it will be more readily available and easier to modify if additional information is required. It is not unusual to see multiple inventory systems in use in large enterprises due to different teams requiring different information. For example, IT leadership may be interested in server numbers, costs, and licensing but probably not the device driver for the RAID controller in use on mailbox servers.

Our recommendation is to give Steve's free PowerShell script a test run, modify and update it as appropriate, and schedule it to update the HTML daily.

Monitoring Enhancements in Exchange 2013

Exchange 2013 is the first version of Exchange to be written since the Exchange product group became responsible for Exchange Online within the Office 365 platform. This is not often discussed outside of Microsoft. However, there is no denying that having Exchange developers and program managers called out to resolve faults within Office 365 has dramatically affected the direction that Exchange 2013 has developed. The benefits of running Office 365 are obvious when you look at a few new features in Exchange 2013.

Managed Availability

Managed Availability (MA) is a total shift in approach for Exchange. At its core, MA is a native health monitoring and recovery system. Historically, Exchange has provided a number of performance counters and application events that described how the system was performing. It was then the job of some other program or utility to analyze that information and do something with it. In Exchange 2013, this has been taken to a whole new level. Exchange MA is not only aware of how the system is behaving from both a performance and health perspective, but it also has predefined recovery actions that it can trigger to attempt to resolve any issues itself.

Exchange is now aware of itself in three ways:

Availability
Latency
Errors

These items together define the health of the Exchange Server.

The MA service is made up of three main processes:

Probe engine
Monitor
Responder engine

The probe engine is responsible for gathering data about the running of Exchange Server. This data is then passed to the monitor, which applies a set of logic to determine system health. If the system is found to be unhealthy, the responder engine will use its own logic to determine the correct recovery action to take at the time of the event.

Like most things, the monitors can be queried via PowerShell and have the following values: Healthy or Unhealthy (Degraded, Disabled, Unavailable, or Repairing). Once a monitor has entered an unhealthy state, a responder will take recovery action. This action will depend on the event and also on how many previous times it attempted to recover from it. The recovery action may be as simple as terminating and restarting a service, or it may be as significant as forcibly terminating a server.

One of the best sources of information on Managed Availability is Ross Smith IV and Greg (“The Father of Modern Exchange HA”) Thiel's blog post:

http://blogs.technet.com/b/exchange/archive/2012/09/21/lessons-from-the-datacenter-managed-availability.aspx

When this feature was first presented to the messaging community, many people commented that there would be no need for Exchange operations teams or third-party monitoring solutions. This was nonsense, obviously, but Managed Availability is likely to reduce the frequency of someone being summoned to deal with an easily rectifiable problem. If Exchange experiences a failure that MA cannot resolve, then in all likelihood the resolution process will require a skilled third-party support resource.

Viewing Server Health

Managed Availability brings with it a record of server health. To view the health information for a single server, run the following PowerShell command, replacing the server name with that of your own server.

Get-ServerHealth -Server org7ex2013 | Get-HealthReport

Figure 15.4 shows an example from one of our lab machines. Some unhealthy monitors are highlighted.

Figure 15.4 Sample health report output

We can tweak this command a bit to show the unhealthy monitors:

Get-ServerHealth -Server org7ex2013 | Get-HealthReport | where { $_.alertvalue -eq "Unhealthy" }

Figure 15.5 shows the output of this command, listing the unhealthy monitors for this server. In this example, you can see that we have a problem with the Monitoring and Network healthsets, but it doesn't give us any more information.

Figure 15.5 Unhealthy monitor example

To get more information, you need to pipe the previous command to format-list (fl).

Get-ServerHealth -Server org7ex2013 | Get-HealthReport | where { $_.alertvalue -eq "Unhealthy" } | fl

Figure 15.6 shows the detailed output of our failed health monitors. You can see from this that we have a number of problems with system resources:

MonitorCount            : 5
Entries:
{MaintenanceFailureMonitor, ProcessIsolation/PrivateWorkingSetWarningThresholdExceeded,                          ProcessIsolation/ProcessProcessorTimeErrorThresholdExceeded,                          ProcessIsolation/PrivateWorkingSetWarningThresholdExceeded,                          ProcessIsolation/ProcessProcessorTimeErrorThresholdExceeded}

Figure 15.6 Monitor detail example

Remember that this is a lab system, and so it has less than the minimum recommended system resources available. Thus, it would be a surprise if Managed Availability reported it as healthy. The report also shows us that this particular monitor has been triggered five times previously.

The second event is from the Network healthset, and its entries suggest that we have a problem with this server's DNS host record. The MonitorCount of 1 suggests that this is the first time that this problem has occurred:

MonitorCount            : 1
Entries:
{Network DnsHostRecordMonitor}

Hopefully, this brief walkthrough has highlighted the monitoring enhancement possibilities of Exchange Server 2013. For some environments, it may be possible to use Managed Availability and PowerShell to create an adequate monitoring and alerting system. Obviously, this is not to say that you will never need an additional solution for Exchange 2013. Nevertheless, many organizations will be able to operate without one for Exchange 2013.

Workload Management

Workload Management (WLM) is another new process within Exchange 2013 that aims to reduce the workload on the operations staff. The primary goal of WLM is to prioritize user experience over system tasks. WLM is also responsible for stopping a single “bad” client from overconsuming system resources.

Before we discuss what WLM does, we need to address what system resources actually are and how we think about them. System resources represent things like processor capacity, physical memory, or storage IOPS. Historically, these tended to be reported by percentage utilized; that is, how much of your system is actively at work at a given time. The tendency here is for teams to view low percentage utilized as good and high percentage utilized as bad. This behavior is understandable to some degree, since having a low percentage utilization of core system resources is likely to mean that your end users are experiencing good performance.

In recent years, however, there has been an increasing focus on service running costs. Thus, having systems running in the single-digit percentage utilization range represents a large amount of wasted resources. Ideally, you want to make full use of your system resources without impacting end-user performance. This is partly what WLM tries to do.

What Is Workload Management?

Workload Management (WLM) means trying to optimize resource utilization while at the same time preserving the end-user experience. It does this via the following processes:

Intelligent prioritization of work
Resource monitoring
Traffic shaping

WLM is aware of how Exchange Server is managing its end-user requests by monitoring performance counters. It is then able to schedule background system tasks to use the space resources available. If these system tasks begin to impact end-user performance, then they are throttled back until they do not, or it may potentially delay these tasks from running until the system has more resources available. This is a simple but very ingenious way of removing the requirement to plan and schedule things like maintenance tasks. An important point to remember here is that background database maintenance, a process that performs physical database integrity checking, is never throttled, regardless of system resource usage.

The impact of WLM on operations is interesting. For example, consider moving mailboxes. A mailbox move is subject to system WLM, and so it will be throttled back if it impacts end-user experience. This means that you could perform mailbox moves during the day knowing that WLM would protect service performance. The mailbox moves may take a little longer than if you performed them during off hours, but since they can happen online, end users may not even notice.

Also, think about our trending and resource planning. If WLM is making use of system resources for background tasks, then this will impact how your system resource trending patterns will look; that is, they may appear as more highly utilized than with Exchange 2010 across the workday. However, the huge spikes in nightly maintenance tasks will not be apparent. The concept of trending system resources is still vital for Exchange 2013. Jeff Mealiffe, the program manager responsible for Exchange performance and sizing, often refers to this as “smoothing peaks and filling in the valleys.” The workload remains the same, however, because WLM is just spreading it out more evenly.

WLM also introduces an improved end-user throttling system to prevent monopolization of system resources. The idea behind this is to promote fair use of Exchange Server resources for all users. Exchange Server 2010 introduced user throttling, and Exchange Server 2013 takes this further by providing better resource utilization tracking, using shorter client back-off delays, and the introduction of a token bucket.

From a deployment perspective, it is worth noting that if you are deploying in coexistence with Exchange Server 2010, mailboxes hosted on 2010 servers will use the 2010 policies, while mailboxes migrated to Exchange 2013 will make use of the newer policies.

Our recommendation for throttling policies remains the same in Exchange 2013 as it did in Exchange 2010: Leave the default global policy at its default values, and create and apply a new policy for system mailboxes where required.

Summary

This chapter walked you through some of the more interesting aspects of monitoring, reporting, alerting, trending, and operating an Exchange service. One of the challenges of writing this material is that Exchange is operated in such a wide variety of environments that what is important to one organization is irrelevant to another.

Hopefully, we have managed to convey the more important aspects of this topic for design teams; that is, design your solution with operations in mind. When you are making your design decisions, think of how those decisions may affect operations and support teams. The cost to operate and manage an Exchange service is usually an order of magnitude higher than the cost of designing and installing it in the first place.

Alerting should have some intelligence built in. Just because something has stopped working does not necessarily mean that you need to summon an on-call resource immediately, especially if the service is highly available and the end users are unaffected.

The goal of trending is to head off problems before they occur. It may take a little manipulation in Excel or the purchase of a third-party product, but being able to predict accurately when your system will run out of core resources is a vital competence for most IT shops.

Above all else, go and talk to the operations teams that will be responsible for running your designs, and make sure that you understand their requirements. It is generally much easier to hand a service over to production when the operations teams are included in the design and testing process. Don't forget that operations teams are also customers of your design.