part0021

12 Service Management and Reporting

Service Management:

Defining and treating systems as business services; understanding service availability, business criticality and continuity requirements; and measuring and reporting of service performance.

This chapter will help you:

• define technology services within your organisation

• evaluate how to measure and report service quality

Before we look at managing and reporting on technology services, we must ask ourselves a fundamental question. What is a technology service?

Within reason it could be whatever you want it to be as long as:

it can be recognised by people throughout the organisation (not just inside the IT team)
the service’s boundaries are clearly deﬁned within the IT operation
you can measure it and report on it.

I think of a service as a connected group of hardware and software components that deliver a business function or process. This might be an internal activity such as maintaining the organisation’s accounts, or a customer-facing online ordering system.

For example, a charity might deﬁne its services as (i) donors and donations, (ii) grant applications and awards, (iii) ﬁnance and accounts, (iv) marketing and promotions, (v) human resources, (vi) ofﬁce collaboration, (vii) email, and (viii) internet portal.

Whilst these all interact in some way, they can each be seen as individual technology services supporting the charity’s business processes. The internet portal could be used as a channel by many of the other services.

The technology landscape in larger organisations is likely to be more involved. Delineating services will be easier if the application architecture is modular rather than monolithic. For example, a bank might deﬁne its access channels (including ATMs, internet banking, telephone banking and call centre) as separate services which provide alternative routes to business functions such as account balance, statements, and moving money.

If telephone banking is unavailable, the customer may still be able to use internet banking.
If the statements service is unavailable, the customer may still be able to perform other banking actions.

DEFINING SERVICES

Documented deﬁnitions set boundaries between the organisation’s technology services and bring clarity when discussing operational targets and delivery with non-IT people in the organisation.

A service deﬁnition should include a summary of what it is used for, who uses it, where it is accessed from, and when it is required to be available (more on that point later).

Information about criticality of the service and when the peak periods occur should also be recorded. This will be valuable when managing any service incidents. It is important to identify key individuals who will act in a ‘service ownership’ capacity, maintaining the relationship between IT and the relevant business area.

I don’t recommend putting details of the underlying hardware and software in the service deﬁnition as they will typically become out of date as things change. This information is best stored in a conﬁguration database with references to the relevant service names.

SERVICE EXPECTATIONS

Every technology system you deliver will have been built for a business reason and part - or all - of the organisation will rely on it. There will be expectations about when services should be available, prompting a negotiation to agree service hour commitments.

Will it be needed between 8am and 6pm on Mondays to Fridays? What about weekends? Will customers access it via the internet at any time throughout the entire week? Night and day?

You will often receive the request for 24 by 7. This may be a lazy response, one not supported by a true business need. If round-the-clock service is really required, you must stand ﬁrm and negotiate scheduled downtime periods when system changes and software maintenance can be applied.

Once you have ascertained the service hours, you can move on to discuss reliability expectations and set some availability targets. Before you launch into a debate about 99.9% availability, take a moment to consider what measures would be most useful and what targets would be realistic.

Everybody wants technology to run without interruption but some degree of imperfection is inevitable. Would it be preferable to suffer one large incident and be out of action for a whole day, or to have a half hour service interruption every week? While neither is ideal, you should consider parameters like these when devising your service availability targets.

Don’t set up service availability targets blindly. Be aware that 99.9% availability for a 24 by 7 service boils down to 10 minutes of downtime per week. Or one 40-minute outage every month. You don’t need much of an incident to blow your target. Make sure you explore the arithmetic before engaging in discussions about service level targets.

SERVICE QUALITY REPORTING

What will you report back to the functional heads in your organisation? A percentage availability ﬁgure is the most obvious starting place, but how will you measure it?

You will need a reliable calculation of the maximum possible service hours for every service and a log of all outage times. Consider monthly reporting for a service that is used from 8am to 6pm Monday to Friday, and from 8am to 12 noon on a Saturday. Every month has a different number of working days, it may have four or ﬁve Saturdays. What about Bank Holidays?

As you can imagine, this can quickly become a cottage industry.

What about the outage itself – was the entire service unavailable? Or was the damage limited to one particular function, leaving the rest of the system working as normal?

How should one account for periods of slow performance? The service may be available but sluggish response times still have an impact on user efﬁciency.

Another way of representing service to the user community is by reporting the number of uninterrupted days’ service there have been. You may have seen something similar on building sites or in industrial operations – they often report the number of consecutive accident-free days as an indicator of the site’s safety record.

This concept could be used to give a high-level view of the quality of service IT has delivered. For more complex organisations, the reporting could be presented at a divisional level by grouping relevant services into baskets for each part of the company.

In a previous role, I was responsible for producing monthly availability statistics. This included reporting on the number of ‘clean days’ experienced across the organisation. For certain business areas, a small set of the most important services were chosen to represent IT service quality for the month. Incidents that affected any of the selected services were checked against an agreed list of criteria to see whether they triggered an ‘unclean’ day.

Where the criteria were very black and white - a full service outage of more than 15 minutes, for example – this was relatively straightforward. However, some criteria (periods of slow responses) were more subjective and prone to lengthy debate.

There is no perfect way of reporting service performance. It’s all ﬂawed in one way or another. So, pick the best option for you and your business. And keep it simple.

THE COST OF DOWNTIME

When discussing service delivery with the various areas of the organisation it helps if you talk their language. Putting availability in business terms goes some way towards recognising the impact of downtime in pound notes, though I recommend caution when doing so.

Many factors can contribute to the cost of downtime, some being more tangible than others:

production losses – an outage could result in fewer products being made that day
staff costs – either in terms of wasted hours, or overtime payments to catch up on work that didn’t get done
business ﬁnes – regulator penalties for missing deadlines or failing to meet standards (a data protection breach, for example)
short term business loss – customers placed orders with competitors simply because your IT service was down at the time of purchase
longer term business loss – customers will move elsewhere if your systems are persistently unreliable or insecure
brand image – you will not attract new business if your brand conjures up an image of unreliability and poor service

You could arrive at a rule-of-thumb cost per hour of downtime by using some of the more tangible items from this list. The question is whether this will be deemed reasonable by the heads of the functional business areas. You might be challenged about how your cost of downtime takes into account variation by day of the week, or peak hour activity during each day, or seasonal peak periods.

An online betting system being down for an hour on the morning of the annual Grand National horse race would represent a much greater loss of trade than it would on a lot of other Saturdays during the year.
A failure in an inter-bank payments system would cause more ﬁnancial damage later in the afternoon than it would at 9:30 in the morning because of agreed transfer deadlines between the clearing banks.

All outages are equal, but some outages are more equal than others.

And this is where I recommend caution. It is very easy to get drawn into a swamp by attempting to satisfy all your organisation’s business quirks. Trying to address the myriad of special circumstances opens you up to more and more questions which only serve to make the list longer and the whole task too complicated.