Information Technology Infrastructure Library (ITIL): September 2011

Thursday, September 22, 2011

ITIL Problem Management

Problem Management is structured to address the causes of incidents which pose the greatest risk. (Negative risk) Therefore it focuses on the heavy hitter recurring service affecting events; it doesn’t find the root cause or permanent fix for every incident. Success is measured in terms of what has been removed from the environment.

§ How many problems are identified and removed from our IT environment.

§ Problems which have a status of resolved and closed.

So let’s walk the process, for problem management.

First an incident occurs. An incident is any unplanned outcome from the operation of an information system. Incidents interrupt the IT service which the customer receives. Incidents are normally reported to the service desk, and an incident record is created.

Next, the incident is assessed. If the cause of the incident isn’t know, then the incident is escalated to a problem. A problem is an incident whose cause is not known.

As the problem is reviewed, the cause of the problem and a workaround maybe determined. As soon as these two aspects occur, the problem is changed to a known error.

Finally the known error is assessed to determine if the symptoms of the incident match an already existing problem record. If so, the new incident is cross-referenced to that problem

However if the known error doesn’t match any existing symptoms, a net new problem record is created.

The terminology incident, problem, and known error portray the effect and root causes of unexpected events in an information system. Identifying the cause of these events and minimizing their impact is the primary purpose of the problem management process.

The goal of problem management activities is to ascertain the root causes of incidents and to minimize their impact on the business operations of a company. This is done through the following processes:

§ Problem control - The purpose of problem control is to identify problems within an IT environment and to record information about those problems. Problem control identifies the configuration items at the root of a problem and provides the service desk with information on workarounds.

§ Error control - The purpose of error control is to keeps track of known errors and to determines the resource effort needed to resolve the known error. Error control monitors and removes known errors when it's feasible and worthwhile.

§ Proactive problem management - The purpose of proactive problem management is to find potential problems and errors in an IT infrastructure before they cause incidents. Stopping incidents before they occur provides improved service to users.

The primary measure of the success of the problem management process is how many problems are identified and removed from an IT infrastructure. Therefore, the primary output from this IT service management process renders problems that are resolved and closed.

The work of problem management produces the following outcomes:

§ Records of known errors and available workarounds - These records are kept in the configuration management database (CMDB), and they provide information to the service desk and other ITSM processes.

§ Requests for change (RFCs) - RFCs describe changes needed to remove a known error. Problem management does not approve or perform the change. RFCs are sent to another ITSM process, change management.

§ Changed records in the CMDB - Information about a known error and any affected CIs is forwarded to the configuration management process, the IT service management process that maintains the CMDB.

When the problem management process is used to identify the root causes of problems, it's far more likely that they will be diagnosed correctly and fixed properly. As a result, problems are permanently eliminated.

Problem management includes the following two types of approaches to address problems:

§ Reactive problem management - Reactive problem management seeks to cure the symptoms of problems. The reactive approach responds to reports of incidents that have already occurred. Reactive problem management can be viewed as two activities

§ Problem control activities - The major problem control activities are:

§ Identification and recording - Problem management receives information about reported incidents from the incident management process and the service desk. Members of the problem management team analyze this information, looking for similarities in the symptoms of reported incidents. They look for records of previously identified problems that can explain the symptoms. If none can be found, a record describing a new problem is created.

§ Classification - This control activity identifies the importance of new problems and designates resources for addressing them.

Problems are classified by category, such as hardware, software, or other types. Then they can be assigned to the corresponding support personnel. Problems are also classified by priority ranking. Problems with higher priority rankings are addressed before problems with lower priority rankings.

Investigation and diagnosis - Problem management teams look for the root cause of problems. If the cause is determined, problem management recommends a workaround or a temporary fix for the problem.

§ Identify cause of problem and devise a workaround - In the automated service management system, the status of the problem is changed to that of a known error.

When an IT department applies problem control activities, it prioritizes the problems that present the biggest threat to the information system or the company's ability to conduct business. When the root cause of a problem has been found and a workaround has been devised, problem control activities end. Then the second group of activities in reactive problem management begins.

§ Error control activities- Now the problem becomes a known error in the IT infrastructure, and error control activities begin. Error control activities include:

§ Error identification and recording - This means creating a record that identifies a known error and all the configuration items (CIs) that cause the error or are affected by it.

§ Error assessment - This activity prioritizes errors and places them into groups according to their importance.

§ Error resolution recording - The resolution to a known error may include changes to hardware or software, user training, or operational procedures. Error control creates a request for change (RFC) and forwards it to change management. The RFC is cross-referenced to the known error in the automated service management system.

§ Error resolution monitoring - Changes are planned and implemented by other IT service management processes. Problem management monitors the effect of problems on service provided to users and the progress of requested changes until they're complete.

§ Error closure —The final error control activity is error closure. When recommended changes to fix a known error have been completed, the known error record in the service management system can be closed. Records of incidents and problems associated with that known error may also be closed.

§ Proactive problem management - Proactive problem management seeks to inoculate IT systems against problems. The proactive approach identifies potential problems before they emerge.

§ Trend analysis - This is the process of examining problem and incident reports to discover what types of problems are happening more frequently. Trend analysis of existing problems and incidents can reveal where similar problems may occur in other places within the infrastructure. It can also show that repeated failures have not been adequately resolved and are likely to continue to happen.

§ Targeting preventative action - This process applies the same techniques used in reactive problem management to a select few potential problems with a high degree of business impact. Targeting preventative action may include creating RFCs, training users and service desk team members, or recommending procedural changes within the IT department.

The groups of problem management activities; (problem control, error control, and proactive problem management) identify and resolve problems which have the greatest potential impact on a company's business.

The success of problem management depends on having the right people performing the right actions. Responsibility for leading the problem management process is assigned to one person designated as the problem manager. The roles of the problem manager are:

1. To maintain and develop problem control activities - It's the problem manager's job to make sure that information about incidents within the system is being received and reviewed in a systematic way.

2. To monitor the effectiveness of error control activities and make recommendations for improvement - She must also ensure that relationships among configuration items are considered in proposed solutions to problems.

3. To cascade information about workarounds or fixes to those who need it - Communication with the service desk and incident management is a key role performed by the problem manager.

4. To monitor the progress of problems and known errors toward a final resolution - If solutions aren't implemented as quickly as necessary, the problem manager may follow procedures to escalate the priority of the problem.

Each of these four roles contributes to the ability of problem management to identify and resolve problems and known errors quickly. The problem manager will also perform typical supervisory roles to direct the activities of any other problem management team members.

The problem manager's duties should never be combined with the duties of the service desk supervisor. The priorities of the service desk and problem management are often incompatible.

The success of problem management also relies on critical factors before, during, and after the main activities in the problem management process. The critical factors for success are:

§ Performance targets - It's important to decide how the performance of problem management will be measured before the process is implemented. If possible, use statistics from the previous support activities to set goals for problem management.

§ Periodic audits - Perform periodic audits to determine whether problem management procedures are being followed. Problems that aren't properly reported or investigated are more likely to cause interruptions of service to users or a major impact on the business.

§ Problem reviews - Conduct major problem reviews after problems with high urgency or impact have been resolved. Look for ways to improve the way problems are identified and resolved. Problem management procedures should be continually improved.

Problem management will succeed when an effective problem manager fills the required roles, and critical factors for the success of the process are included in everyday operating procedures.

Implementing problem management brings many benefits to a company and its IT department. However, there are also some problems and costs that arise during the implementation of problem management.

Among the most common problems companies experience is a difficulty establishing adequate communication between problem management and another IT service management process, incident management. Communication between the two can be difficult because they pursue the following conflicting goals:

§ Problem management - The goal of problem management is to investigate the root cause of a problem. The speed with which a solution is found is an important, but secondary, consideration.

§ Incident management - The goal of incident management is to recover from incidents and restore service to users as quickly as possible. Determining the cause of a problem is less important.

Companies also often have difficulty establishing lines of communication between the software development process and problem management. Programmers and developers are frequently aware of known errors in the software they create, but they can be reluctant to identify them.

In many companies, employees resist new procedures. Many companies report that employees cling to previous informal problem management methods. It takes time for employees to accept the discipline of problem management.

Companies should expect to incur some costs with the implementation of problem management. However, it isn't necessary to create a vast problem management process that's capable of handling every single problem that arises. As a result, the incremental costs of problem management are negligible. The hardware and software tools needed are shared with other IT service management processes, and the additional personnel costs are small.

Problems and costs arise frequently during the introduction of problem management. However, the problems and costs are manageable and bring worthwhile improvements in the performance of the IT infrastructure.

Problem management seeks to identify the underlying causes of incidents in an IT infrastructure and to remove those causes. The problem management process addresses the causes of incidents reactively and proactively.

Incident Management

You know the call. We all have received them it is the break fix call. The call occurs any time during the day. The call that tell you three main things, someone's unhappy, someone has to handle this, and isn't there a better way to manage this aspect of support. There is some good news here for all, ITIL has best practice guidelines for dealing with incident management.

Look at an incident from a high level, there is a pattern of actions that can be taken to resolve the incident. All incidents have inputs, outputs, and management activities like all other processes.

The parts of an incident management process are:

§ Inputs - Inputs are key to the process. Incident details are received from the service desk, network, or computer operations. There are many forms of inputs, break fix issues, service request, and/or automatic monitoring alerts.

§ Outputs - Outputs need to be considered from the viewpoint of what are the outputs of incident management. Obviously this would be the closed incident or restored application availability. But looking at a higher level, there is the user satisfaction, improved productivity for all, customer follow up and communication, the documentation for the incident reports and management information.

§ Incident Management Activities - Incident management activities are detection and reporting; classification and initial support; investigation and diagnosis, resolution and recovery; incident closure; and incident ownership, monitoring tracking, and communication.

Just to review the flow of events, most customer and user incidents are initially reported to the service desk. This action gives ownership of the handling and tracking of the incident from beginning to end to the service desk, even though the work maybe completed coordinating with other departments.

The activities of an incident management process are:

§ Incident detection and reporting - Incident detection and reporting is the act of learning an incident has occurred and recording the basic details related to it.

§ Classification and initial support - Classification and initial support categorizes the incident, by matching it against the knowledge base of issues, assigning a priority, assessing if it is related to configuration details, providing initial support and closing the incident or routing it to a specialist group.

§ Investigation and diagnosis - Investigation and diagnosis relates to assessing incident details, collecting and analyzing the information and resolution, then routing the incident to line support

§ Resolution and recovery - Resolution and recovery surrounds the completing of the incident, using a solution or workaround, or raising a request for change.

§ Incident closure - Incident closure is the act of confirming the resolution with the reporter of the incident and closing the incident.

§ Incident ownership, monitoring, tracking, and communications - Incident ownership, monitoring, tracking, and all communications are all the activities that surround monitoring the incident, escalating it, and informing the user of the latest status, key accomplishments, and next steps.

Imagine how helpful this would be if it was in place. Wouldn't anyone work towards this ideal?

To elaborate on this point a little further, as import as roles and responsibilities are to any effective plan, the tools to get the job done are just as important. You simply need the right tools to be able to work effectively.

Tools commonly used in incident management are:

§ Automatic incident logging and alerting - This tool can automatically log incidents and alert support personnel in the event of fault detection on mainframes, networks, servers, and possibly through an interface to system management tools.

§ Automatic escalation facilities - Automatic escalation facilities help facilitate the timely handling of incidents and service requests. Imagine automatic notification, instead of constantly checking a worklist.for a group's queue.

§ Highly flexible routing of incidents - This is a requirement; when control staff members are located in multiple sites or collocated in an operational bridge, the incident calls can be routed efficiently and effectively.

§ Automatic extraction of data records - Automatic extraction of data records from the configuration management database, CMDB, of a failed item and affected items is helpful.

§ Specialized software - This software is used for the speed and effectiveness of handling incidents. BMC is an ideal system. It can help with very accurate classification of incidents and successful matching at the point of alert.

§ Telephone systems integration - Telephone systems integration can be used to automatically registering the names and phone numbers of users.

§ Diagnostic tools - These tools can assist with the diagnostic process so that the support staff can more quickly diagnose the source of incidents.

One of the constant statements is that you can't manage what you can not measure. Normally the incident manager is accountable and responsible for reporting the performance of the incident. In order to accomplish reporting is to have clearly define objects with measurable targets that can provide performance information.

Common metrics used to report the effectiveness and efficiency of the incident management process are:

§ Incident volume refers to the total number of incidents that are handled by the incident management process.

§ Mean elapsed time shows how much time was taken to achieve incident resolution or circumvention. The time is broken down by impact code.

§ Incident response time refers to the percentage of incidents handled within the agreed upon response times, which may have been specified in service level agreements by impact code, for example.

§ Average incident cost refers to the average cost of each incident.

§ The percentage of incidents closed refers to the percentage of incidents closed by the service desk without reference to other levels of support.

§ The number and percentage of incidents resolved remotely refers to those incidents that were taken care of off-site, with no physical visit.

The relationships between the incident management process and other IT Service Management processes are:

§ The configuration management database defines the relationships among resources, services, users, and service levels. For example, let's say a server fails. With the configuration management database, all existing processes, applications, and interfaces would be documented, so downstream affects would be noted immediately.

§ Problem management provides information about problems, know errors, workarounds, and quick fixes.

§ Change management yields information about scheduled changes and their status.

§ Service level management monitors the service level agreements with the customer about the support to be provided.

§ Availability management measures the aspects of the availability of services and uses the incident records and the status monitoring provided by configuration management.

§ Capacity management assures that storage capacity matches the evolving demands of the business. It is concerned with incidents that relate to this objective, such as incidents caused by a shortage of disk space or slow response time.

An item to remember and maybe even reinforce is that the incident management process is interwoven with the other IT service management processes. The processes work as long as all the processes support each other.
Finally there are some common barrier in the form of costs and problems to implementing an incident management process.

The common costs are the implementing and operating cost, as is standard with almost any implementation. Implementation costs are the training, tools needed, process and workflow definition, and resources expended in the implementation. Operating costs are the continuing maintenance license feeds and operating resources expended.

Some of the common recurring problems that affect all organizations are:

§ Users and IT staff bypassing incident management procedures - this results in the IT organization does not obtaining important information about the service level and the number of errors.

§ Incident overload and backlog - This circumvention makes it difficult to record incidents effectively. Escalations may occur if incidents are not resolved quickly enough.

§ Incomplete service catalogs and service level agreements - define the time in which an incident or request for service needs to be solved or escalated. If these documents don't exist or are incomplete, the caller may not be able to get the issue resolved--and get back online--as quickly as possible.

§ Lack of commitment -This is a problem because effective incident management requires real staff commitment, not just involvement.