Appendix F - Monitoring and Managing Hybrid Applications


A typical hybrid application comprises a number of components, built using a variety of technologies, distributed across a range of sites and connected by networks of varying bandwidth and reliability. With this complexity, it is very important to be able to monitor how well the system is functioning, and quickly take any necessary restorative action in the event of failure. However, monitoring a complex system is itself a complex task, requiring tools that can quickly gather performance data to help you analyze throughput and pinpoint the causes of any errors, failures, or other shortcomings in the system.

The range of problems can vary significantly, from simple failures caused by application errors in a service running in the cloud, through issues with the environment hosting individual elements, to complete systemic failure and loss of connectivity between components whether they are running on-premises or in the cloud.

Once you have been alerted to a problem, you must be able to take the appropriate steps to correct it and keep the system functioning. The earlier you can detect issues and the more quickly you can fix them, the less impact they will have on the operations that your business depends on, and on the customers using your system.

It is important to follow a systematic approach, not only when managing hybrid applications but also when deploying and updating the components of your systems. You should try to minimize the performance impact of the monitoring and management process, and you should avoid making the entire system unavailable if you need to update specific elements.

Collecting diagnostic information about the way in which your system operates is also a fundamentally important part in determining the capacity of your solution, and this in turn can affect any service level agreement (SLA) that you offer users of your system. By monitoring your system you can determine how it is consuming resources as the volume of requests increases or decreases, and this in turn can assist in assessing the resources required and the running costs associated with maintaining an agreed level of performance.

This appendix explores the challenges encountered in keeping your applications running well and fulfilling your obligations to your customers. It also describes the technologies and tools that the Windows Azure™ technology platform provides to help you monitor and manage your solutions in a proactive manner, as well as assisting you in determining the capacity and running costs of your systems.

Use Cases and Challenges

Monitoring and managing a hybrid application is a nontrivial task due to the number, location, and variety of the various moving parts that comprise a typical business solution. Gathering accurate and timely metrics is key to measuring the capacity of your solution and monitoring the health of the system. Additionally, well defined procedures for recovering elements in the event of failure are of paramount importance. You may also be required to collect routine auditing information about how your system is used, and by whom.

The following sections describe some common use cases for monitoring and maintaining a hybrid solution, and summarize many of the challenges you will encounter while performing the associated tasks.

Measuring and Adjusting the Capacity of Your System

Description: You need to determine the capacity of your system so that you can identify where additional resources may be required, and the running costs associated with these resources.

In a commercial environment, customers paying to use your service expect to receive a quality of service and level of performance defined by their SLA. Even nonpaying visitors to your web sites and services will anticipate that their experience will be trouble-free; nothing is more annoying to potential customers than running up against poor performance or errors caused by a lack of resources.

NoteBharath says:
Bharath Customers care about the quality of service that your system provides, not how well the network or the cloud environment is functioning. You should ensure that your system is designed and optimized for the cloud as this will help you set (and fulfill) realistic expectations for your users.

One way to measure the capacity of your system is to monitor its performance under real world conditions. Many of the issues associated with effectively monitoring and managing a system also arise when you are hosting the solution on-premises using your organization's servers. However, when you relocate services and functionality to the cloud, the situation can become much more complicated for a variety of reasons, including:

This is clearly a much more challenging environment than the on-premises scenario. You must consider not just how to gather the statistics and performance data from each instance of each service running on each server, but also how to consolidate this information into a meaningful view that an administrator can quickly use to determine the health of your system and determine the cause of any performance problems or failures. In turn, this requires you to establish an infrastructure that can unobtrusively collect the necessary information from your services and servers running in the cloud, and persist and analyze this data to identify any trends that can highlight scope for potential failure; such as excessive request queue lengths, processing bottlenecks, response times, and so on.

You can then take steps to address these trends, perhaps by starting additional service instances, deploying services to more datacenters in the cloud, or modifying the configuration of the system. In some cases you may also determine that elements of your system need to be redesigned to better handle the throughput required. For example, a service processing requests synchronously may need to be reworked to perform its work asynchronously, or a different communications mechanism might be required to send requests to the service more reliably.

Monitoring Services to Detect Performance Problems and Failures Early

Description: You need to maintain a guaranteed level of service.

In an ideal situation, software never fails and everything always works. This is unrealistic in the real world, but you should aim to give users the impression that your system is always running perfectly; they should not be aware of any problems that might occur.

However, no matter how well tested a system is there will be factors outside your control that can affect how your system functions; the network being a prime example. Additionally, unless you have spent considerable time and money performing a complete and formal mathematical verification of the components in your solution, you cannot guarantee that they are free of bugs. The key to maintaining a good quality of service is to detect problems before your customers do, diagnose their cause, and either repair these failures quickly or reconfigure the system to transparently redirect customer requests around them.

NotePoe says:
Poe Remember that testing can only prove the presence and not the absence of bugs.

If designed carefully, the techniques and infrastructure you employ to monitor and measure the capacity of your system can also be used to detect failures. It is important that the infrastructure flags such errors early so that operations staff can take the proper corrective actions rapidly and efficiently. The information gathered should also be sufficient to enable operations staff to spot any trends, and if necessary dynamically reconfigure the system to add or remove resources as appropriate to accommodate any change in demand.

Recovering from Failure Quickly

Description: You need to handle failure systematically and restore functionality quickly.

Once you have determined the cause of a failure, the next task is to recover the failed components or reconfigure the system. In a live environment spanning many computers and supporting many thousands of users, you must perform this task in a thoroughly systematic, documented, and repeatable manner, and you should seek to minimize any disruption to service. Ideally, the steps that you take should be scripted so that you can automate them, and they must be robust enough to record and handle any errors that occur while these steps are being performed.

Logging Activity and Auditing Operations

Description: You need to record all operations performed by every instance of a service in every datacenter.

You may be required to maintain logs of all the operations performed by users accessing each instance of your service, performance variations, errors, and other runtime occurrences. These logs should be a complete, permanent, and secure record of events. Logging may be a regulatory requirement of your system, but even if is not, you may still need to track the resources accessed by each user for billing purposes.

An audit log should include a record of all operations performed by the operations staff, such as service shutdowns and restarts, reconfigurations, deployments, and so on. If you are charging customers for accessing your system, the audit log should also contain information about the operations requested by your customers and the resources consumed to perform these operations.

An error log should provide a date and time-stamped list of all the errors and other significant events that occur inside your system, such as exceptions raised by failing components and infrastructure, and warnings concerning unusual activity such as failed logins.

A performance log should provide sufficient data to help monitor and measure the health of the elements that comprise your system. Analytical tools should be available to identify trends that may cause a subsequent failure, such as a SQL Azure™ technology platform database nearing its configured capacity, enabling the operations staff to perform the actions necessary to prevent such a failure from actually occurring.

Deploying and Updating Components

Description: You need to deploy and update components in a controlled, repeatable, and reliable manner whilst maintaining availability of the system.

As in the use case for recovering from system failure, all component deployment and update operations should be performed in a controlled and documented manner, enabling any changes to be quickly rolled back in the event of deployment failure; the system should never be left in an indeterminate state. You should implement procedures that apply updates in a manner that minimizes any possible disruption for your customers; the system should remain available while any updates occur. In addition, all updates must be thoroughly tested in the cloud environment before they are made available to live customers.

Cross-Cutting Concerns

This section summarizes the major cross-cutting concerns that you may need to address when implementing a strategy for monitoring and managing a hybrid solution.

Performance

Monitoring a system and gathering diagnostic and auditing information will have an effect on the performance of the system. The solution you use should minimize this impact so as not to adversely affect your customers.

For diagnostic information, in a stable configuration, it might not be necessary to gather extensive statistics. However, during critical periods collecting more information might help you to detect and correct any problems more quickly. Therefore any solution should be flexible enough to allowing tuning and reconfiguration of the monitoring process as the situation dictates.

The policy for gathering auditing information is unlikely to be as flexible, so you should determine an efficient mechanism for collecting this data, and a compact mechanism for transporting and storing it.

Security

There are several security aspects to consider concerning the diagnostic and auditing data that you collect:

Additionally, the monitoring, management, deployment, and maintenance tasks associated with the components that comprise your system are sensitive tasks that must only be performed by specified personnel. You should take appropriate steps to prevent unauthorized staff from being able to access monitoring data, deploy components, or change the configuration of the system.

Windows Azure and Related Technologies

Windows Azure provides a number of useful tools and APIs that you can employ to supervise and manage hybrid applications. Specifically you can use:

NoteBharath says:
Bharath You can configure Remote Desktop access for the roles running your applications and services in the cloud. This feature enables you to access the Windows logs and other diagnostic information on these machines directly, by means of the same procedures and tools that you use to obtain data from computers hosted on-premises.

The following sections provide more information about the Windows Azure Service Management API and Windows Azure Diagnostics, and summarize good practice for utilizing them to support a hybrid application.

Monitoring Services, Logging Activity, and Measuring Performance in a Hybrid Application by Using Windows Azure Diagnostics

An on-premises application typically consists of a series of well-defined elements running on a fixed set of computers, accessing a series of well-known resources. Monitoring such an application requires being able to transparently trap and record the various requests that flow between the components, and noting any significant events that occur. In this environment, you have total control over the software deployed and configuration of each computer. You can install tools to capture data locally on each machine, and you combine this data to give you an overall view of how well the system is functioning.

In a hybrid application, the situation is much more complicated; you have a dynamic environment where the number of compute nodes (implementing web and worker roles) running instances of your services and components might vary over time, and the work being performed is distributed across these nodes. You have much less control of the configuration of these nodes as they are managed and maintained by the datacenters in which they are hosted. You cannot easily install your own monitoring software to assess the performance and well-being of the elements located at each node. This is where Windows Azure Diagnostics is useful.

Windows Azure Diagnostics provides an infrastructure running on each node that enables you to gather performance and other diagnostic data about the components running on these nodes. It is highly configurable; you specify the information that you are interested in, whether it is data from event logs, trace logs, performance counters, IIS logs, crash dumps, or other arbitrary log files. For detailed information about how to implement Windows Azure Diagnostics and configure your applications to control the type of information that Windows Azure Diagnostics records, see "Collecting Logging Data by Using Windows Azure Diagnostics" on MSDN.

Windows Azure Diagnostics is designed specifically to operate in the cloud. As a result, it is highly scalable while attempting to minimize the performance impact that it has on the roles that are configured to use it. However, the diagnostic data itself is held locally in each compute node being monitored. This data is lost if the compute node is reset. Also, the Windows Azure diagnostic monitor applies a 4GB quota to the data that it logs; if this quota is exceeded, information is deleted on an age basis. You can modify this quota, but you cannot exceed the storage capacity specified for the web or worker role. In many cases you will likely wish to retain this data, and you can arrange to transfer the diagnostic information to Windows Azure storage, either periodically or on demand. The topics "How to Schedule a Transfer" and "How to Perform an On-Demand Transfer" provide information on how to perform these tasks.

NoteMarkus says:
Markus Transferring diagnostic data to Windows Azure storage on demand may take some time, depending on the volume of information being copied and the location of the Windows Azure storage. To ensure best performance, use a storage account hosted in the same datacenter as the compute node running the web or worker role being monitored. Additionally, you should perform the transfer asynchronously so that it minimizes the impact on the response time of the code.

Windows Azure storage is independent of any specific compute node and the information that it contains will not be lost if any compute node is restarted. You must create a storage account for holding this data, and you must configure the Windows Azure Diagnostics Monitor with the address of this storage account and the appropriate access key. For more information, see "How to Specify a Storage Account for Transfers." Event-based logs are transferred to Windows Azure table storage and file-based logs are copied to blob storage. The appropriate tables and blobs are created by the Windows Azure Diagnostics Monitor; for example, information from the Windows Event Logs is transferred to a table named WADWindowsEventLogsTable, data gathered from performance counters is copied to a table name WADPerformanceCountersTable. Crash dumps are transferred to a blob storage container under the path wad-crash-dumps and IIS 7.0 logs are copied to another blob storage container under the path wad-iis-logfiles.

Guidelines for Using Windows Azure Diagnostics

From a technical perspective, Windows Azure Diagnostics is implemented as a component within the Windows Azure SDK that supports that standard diagnostic APIs. This component is called the Windows Azure diagnostic monitor, and it runs in the cloud alongside each web role or worker role that you wish to gather information about.

You can configure the diagnostic monitor to determine the data that it should collect by using the Windows Azure Diagnostics configuration file, diagnostics.wadcfg. For more information, see "How to Use the Windows Azure Diagnostics Configuration File." Additionally, an application can record custom trace information by using a trace log. Windows Azure Diagnostics provides the DiagnosticMonitorTraceListener class to perform this task, and you can configure this type of tracing by adding the appropriate <system.diagnostics> section to the application configuration file of your web or worker role. See "How to Configure the TraceListener in a Windows Azure Application" for more information.

If you are building a distributed, hybrid solution, using Windows Azure Diagnostics enables you to gather the data from multiple role instances located on distributed compute nodes, and combine this data to give you an overall view of your system. You can use System Center Operations Manager, or alternatively there are an increasing number of third-party applications available that can work with the raw data available through Windows Azure Diagnostics to present the information in a variety of easy to digest ways. These tools enable you to determine the throughput, performance, and responsiveness of your system and the resources that it consumes. By analyzing this information, you can pinpoint areas of concern that may impact the operations that your system implements, and also evaluate the costs associated with performing these operations.

The following list suggests opportunities for incorporating Windows Azure Diagnostics into your solutions:

Guidelines for Securing Windows Azure Diagnostic Data

Windows Azure Diagnostics requires that the roles being monitored run with full trust; the enableNativeCodeExecution attribute in the service definition file, ServiceDefinition.csdef, for each role must be set to true. This is actually the default value.

The diagnostic information recorded for your system is a sensitive resource and can yield critical information about the structure and security of your system. This information may be useful to an attacker attempting to penetrate your system. Therefore, you should carefully guard the storage accounts that you use to record diagnostic data and ensure that only authorized applications have access to the storage account keys. You should also consider protecting all communications between the Windows Azure storage service and your on-premises applications by using HTTPS.

If you have built on-premises applications or scripts that can dynamically reconfigure the diagnostics configuration for any role, ensure that only trusted personnel with the appropriate authorization can run these applications.

Deploying, Updating, and Restoring Functionality by Using the Windows Azure Service Management API and PowerShell

The Windows Azure Management Portal provides the primary interface for managing Windows Azure subscriptions. You can use this portal to upload applications, and to manage hosted services and storage accounts. However, you can also manage your Windows Azure applications programmatically by using the Windows Azure Service Management API. You can utilize this API to build custom management applications that deploy, configure, and manage your web applications and services. You can also access these APIs through the Windows Azure PowerShell cmdlets; this approach enables you to quickly build scripts that administrators can run to perform common tasks for managing your applications.

The Windows Azure SDK provides tools and utilities to enable a developer to package web and worker roles, and to deploy these packages to Windows Azure. Many of these tools and utilities also employ the Windows Azure Service Management API, and some are invoked by several of the Microsoft Build Engine (MSBuild) tasks and wizards implemented by the Windows Azure templates in Visual Studio.

NoteMarkus says:
Markus You can download the code for a sample application that provides a client-side command line utility for managing Windows Azure applications and services from Windows Azure ServiceManagement Sample.
You can download the Windows Azure PowerShell cmdlets at http://www.windowsazure.com/en-us/manage/downloads/.

Guidelines for using the Windows Azure Service Management API and PowerShell

While the Management Portal is a powerful application that enables an administrator to manage and configure Windows Azure services, this is an interactive tool that requires users to have a detailed understanding of the structure of the solution, where the various elements are deployed, and how to configure the security requirements of these elements. It also requires that the user has knowledge of the Windows Live® ID and password associated with the Windows Azure subscription for your organization, and any user who has this information has full authority over your entire solution. If these credentials are disclosed to an attacker or another user with malicious intent, they can easily disrupt your services and damage your business operations.

The following scenarios include suggestions for mitigating these problems:

Guidelines for Securing Management Access to Windows Azure Subscriptions

The Windows Azure Service Management API ensures that only authorized applications can perform management operations, by enforcing mutual authentication using management certificates over SSL.

When an on-premises application submits a management request, it must provide the key to a management certificate installed on the computer running the application as part of the request. A management certificate must have a key length of at least 2048 bits and must be installed in the Personal certificate store of the account running the application. This certificate must also include the private key.

The same certificate must also be available in the management certificates store in the Windows Azure subscription that manages the web applications and services. You should export the certificate from the on-premises computer as a .cer file without the private key and upload this file to the Management Certificates store by using the Management Portal. For more information about creating management certificates, see the topic "How to: Manage Management Certificates in Windows Azure."

NotePoe says:
Poe Remember that the Windows Azure SDK includes tools that enable a developer to package web and worker roles, and to deploy these packages to Windows Azure. These tools also require you to specify a management certificate. However, you should be wary of letting developers upload new versions of code to your production site. This is a task that must be performed in a controlled manner and only after the code has been thoroughly tested. For this reason, you should either refrain from provisioning your developers with the management certificates necessary for accessing your Windows Azure subscription, or you should retain a separate Windows Azure subscription (for development purposes) with its own set of management certificates if your developers need to test their own code in the cloud.

More Information

All links in this book are accessible from the book's online bibliography available at: http://msdn.microsoft.com/en-us/library/hh968447.aspx.