Business Continuity and Resilience Engineering: How Organizations Prepare to Survive Disruptions to Vital Digital Infrastructure THESIS

Size: px

Start display at page:

Download "Business Continuity and Resilience Engineering: How Organizations Prepare to Survive Disruptions to Vital Digital Infrastructure THESIS"

Russell Stewart
6 years ago
Views:

1 Business Continuity and Resilience Engineering: How Organizations Prepare to Survive Disruptions to Vital Digital Infrastructure THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Jessica Romine Graduate Program in Industrial and Systems Engineering The Ohio State University 2012 Master's Examination Committee: Professor David Woods, Advisor Professor Phil Smith

2 Copyright by Jessica Romine 2012

3 Abstract This paper explores business continuity and resilience engineering by examining exercises one socio-technical organization used to assess the consequences of breakdowns in with vital digital infrastructure. The first exercise, which is implementation of disaster recovery plans, was conducted with practitioners and technologists at the sharp end of the organizational hierarchy. The second exercise, a risk-based scenario to test large-scale business continuity (aggregate execution of multiple disaster recovery plans), was conducted with administrators and business leaders at the blunt end of the organizational hierarchy. These two exercises together compose the organization s processes to build resilience. The results contrast how an organization implements disaster recovery tests at smaller, more manageable scales with the use of risk-based simulations at larger scales to prepare to respond to challenge events. The results also demonstrate how a company can build resilience by preparing to manage disruptions to minimize downtime and quickly restore normal business operations. ii

4 Acknowledgments I would like to sincerely thank members of CSEL for the many discussions related and not-related to my thesis. Dr. Matthieu Branlat for attending the risk-based scenario with me and whose collaboration helped along many ideas found in this paper. My advisor Dr. David Woods for enlightenment from the very first class and the broadening of my understanding in the engineering field. Finally, I thank my family and friends for playing instrumental roles in the completion of this milestone. iii

5 Vita August Teays Valley High School B.S. Industrial and Systems Engineering, The Ohio State University Fields of Study Major Field: Industrial and Systems Engineering iv

6 Table of Contents Abstract... ii Acknowledgments... iii Vita... iv List of Tables... viii List of Figures... ix Chapter 1: Introduction Overview of Exercises and Observations Business Continuity and Vital Digital Infrastructure Vignette: Single Server with Two Database Infrastructure Vignette: Single Server with Recovery Strategy Vignette: Single Datacenter Infrastructure Vignette: Multiple Datacenter Infrastructure Introduction to Complex Adaptive Systems Resilience Chapter 2: Study Methodology v

7 2.2 Disaster Recovery Exercise Introduction and Background Methodology Data Results Discussion Risk-based Scenario Exercise Introduction and Background Methodology Data Results Discussion Chapter 3: Discussion: Study and Resilience Recommendations for Effective Exercises Simply Conducting Exercises Test Environment versus Real World Appreciate Interdependencies Vital Digital Infrastructure Modeled as an Adaptive System Adaptive Capacity vi

8 3.2.2 Base Adaptive Capacity Second Oder/Extra Adaptive Capacity Three Ways Adaptive Systems Fail Decompensation Working Across Purposes Getting Stuck in Outdated Behaviors Chapter 4: Potential Future Work and Conclusions Study Challenges Future Directions Conclusion References vii

9 List of Tables Table 1: Cognitive Task Analysis tools for Phases of Study 23 Table 2: Prompt Summary per Timeframe Table 3: Comments Related to System Premise "Perspectives" Table 4: Comments Related to System Premise "Cross Scale Interactions" Table 5: Comments related to System Premise "Emergence" Table 6: General Themes from Organization Results Table 7: Lesson Learned from Organization Business Units Table 8: Three Ways Adaptive Systems Fail (adapted from Woods & Branlat, 2011) viii

10 List of Figures Figure 1: Example of Single Server with Two Database Infrastructure... 9 Figure 2: Example of Single Server with Two Database Infrastructure with Recovery Database Figure 3: Simple Network Infrastructure Example Expands to Datacenter Figure 4: Single Datacenter Expands to Multi Datacenters Managed by One Organization Figure 5: Recovery Event Preparation Resource Team Coordination Map Figure 6: Recovery Event Infrastructure Shutdown Resource Team Coordination Map. 34 Figure 7: Recovery Event Infrastructure Startup Resource Team Coordination Map Figure 8: Focus Group Analysis Example Figure 9: Blunt versus Sharp End (Adapted from Woods & Hollnagel, 2006) Figure 10: Blunt versus Sharp End with Exercises Labeled Figure 11: Organization Governs Blunt and Sharp Ends Figure 12: Stress-Strain State Space with Restructuring (Woods & Wreathall, 2008) Figure 13: Organization Acute versus Chronic Goal Tradeoff Figure 14: Organization Efficiency versus Thoroughness Tradeoff Figure 15. Margin of Maneuver (adapted from Stephens, 2010) Figure 16: Organizational Broad versus Narrow Perspective Tradeoff Figure 17: Organizational Robustness versus Fragility Tradeoff ix

11 Chapter 1: Introduction This paper explores organizational resilience by examining how an organization prepares to manage disruptions caused by external events, such as extreme weather, to minimize down time and to restore normal business operations. The results are is based on observing the development and use of crisis management exercises that test and train the company s ability to respond to outside disrupting events. Events such as extreme weather can impact business organizations on a large scale by effecting vital infrastructure or global supply chains. These challenge events can appear to be distant from a business operations or appear to affect only minor, isolated aspects of business operations. In today s highly interconnected world, events can have effects which propagate and cascade in surprising ways to disrupt business operations. Businesses may be un- or under prepared for handling challenge events on two dimensions: (1) how to prevent or minimize their impact on operations, and (2) how to cope with unexpected difficulties that arise. Preparation and practice on how to maintain or restore business continuity in anticipation of outside challenge events has grown in importance as companies have witnessed the surprising reverberations of large scale events such as Hurricane Katrina, Japan s 2011 tsunami, and terrorist attacks of 9/11. These events caused widespread outages of networks and impacted supply chains with long term consequences and interruption of services. 1

12 Companies have recognized that managing the risk of loss of business continuity requires the organization to prepare for and respond to challenge events by developing new capabilities with respect to information, decision making, and coordination. This preparation is made more difficult by the impossibility to know in advance the exact nature of the events to be prepared for and the specific ways they will have an impact on business continuity. Organizations need to be agile through anticipating the challenges posed by events, rapidly assessing how these events can produce cascades of effects that undermine business continuity, and effectively coordinating decision making across units to recover and restore normal business operations smoothly. In other words, organizations need to develop resilience (Hollnagel et al., 2006; Hamel & Valikangas, 2003; Gulati, 2010). 1.1 Overview of Exercises and Observations The empirical study described in this thesis consists of observations of two exercises. Analyzed together, they represent efforts in one organization to build resilience with respect to business continuity in the case of a loss of vital digital infrastructure. The first exercise, or phase one of the study, observed an exercise to practice implementation of disaster recovery (DR) plans. These plans address how to recover from the loss of a single data center. The observed DR involved approximately fifty engineers and technologists who work daily with the system in question. They practice recovery by conducting the shutdown of a data center. The shutdown is planned in 2

13 advance to be run during non-production hours and analyzed to avoid impact of normal business operations. The exercise allows the personnel to practice components of recovery that would be needed to avoid or minimize any disruption to business continuity should a data center shutdown due to unplanned internal or external causes. The DR plans, despite the company s label for this exercise, address disruptions at relatively small scale relative to total size of the company s digital infrastructure. The author observed a particularly complex exercise of DR plans that went on for approximately 16 hours. After observing the exercise, the author had the opportunity to interview experienced personnel on the DR process and on how DR exercises were planned and run. The author also convened a focus group where experienced personnel discussed the current processes and opportunities to improve recovery tests. The session provided an avenue for participants to discuss end to end processes across teams concerning DR testing. The second exercise, or phase two of the study, observed a new scenario designed to test the company s ability to handle an external event disrupting several aspects of vital infrastructure together. The company called the exercise a risk-based scenario (RBS). The RBS was run twice with a total of over 150 administrators and business leaders participating. The observer team included two cognitive engineers with access to scripts, video feed and permission to audio record the exercise. The first run included employees intimate with business processes and the second run included higher level management and administrators. Organizers simulated a large technology outage impacting multiple infrastructure locations. The scenario engaged personnel who would make critical decisions if an actual disaster occurred to undermine the company s infrastructure. 3

14 Chapter Two describes the method, data and results for each exercise--dr and RBS. Chapter Three combines observations from the two exercises and discusses how the results reflect on the ability of a company to manage disruptions to digital infrastructure and maintain business continuity. 1.2 Business Continuity and Vital Digital Infrastructure The types of organizations under consideration for this thesis provide services for customers similar to organizations in the financial sector. Research on resilience in financial organizations is growing (Sundström & Hollnagel, 2011). These organizations operate under pressure from stakeholders, shareholders, customers and employees to continuously become more efficient, provide higher quality reliable service, maintain company reputation, and keep costs low. However, these organizations also have come to recognize the need to manage internal or external events that produce shocks to the company s ability to provide services. These organizations are divided into functional units or lines of business (LOB). Each LOB has specialization for a specific service offering so that customers receive tailored service in each area. Each LOB has resources specific to their business which are managed by that LOB as well as resources shared across the organization. For example, information technology service and troubleshooting is shared across lines. Goals are divided among the LOBs which contribute to larger global goals. Each LOB has a hierarchical management structure from technologists to administrators. 4

15 More and more organizations today rely on digital infrastructure to operate. The infrastructure is managed both internally and with support from external third party vendors. Applications required for company processes are configured within a network of datacenters, servers and databases. A healthy digital infrastructure is essential for supporting applications that support business processes. Business processes support the ability to serve customers. Serving customers results in profit. How does a company ensure the digital infrastructure is healthy to enable business continuity? The infrastructure itself is a network laced with interconnectivity between hardware, software and security measures. Fast-paced business culture requires multiple updates as network components such as operating systems become outdated quickly. A robust infrastructure has redundancy built into the system so that when a component goes down, the application can still pull information from secondary sources. Reliability and business continuity may even be achieved through maintaining a secondary production environment on standby during business as usual. Redundancy increases complexity due to increased interconnectivity in the network. It is similar to adding more highways and by-ways to an ever expanding city. The already complex environment becomes even more difficult to manage due to an expanding collection of technology. Security measures necessary for protecting these systems from hackers further complicates the structure. These interconnections, recovery elements and security functions create a highly interdependent network that can be modeled as a complex adaptive system (Woods, 2006). The network continually evolves and changes in order to meet new business demands and customer desires (Gulati, 2010; 5

16 Sheffi, 2005). In order for this complex adaptive system to operate effectively and flexibly, domain experts manage everyday operations and control changes while administration manages organizational strategy. Before a controlled change is executed, resource teams conduct analyses to identify impacts across the interdependent network of hardware, software, locations and functions. Even with close study, surprises occur when controlled changes are introduced due to the complexity of the system interdependencies. The label controlled change refers to intent rather result as any change is susceptible to unknown impacts and unanticipated cascades of effects can occur. Even actions to mitigate the effects of a disruption inadvertently can contribute to cascades of unforeseen consequences across network interdependencies. As compared to internal changes, when external events disrupt the system, a thorough impact analysis is impossible prior to the event. External impacts emerge from a range of catalysts including vendor mishaps, extreme weather, political mandates or security breaches (Hamel & Valikangas, 2003; Woods & Cook, 2006; Cook & O Conner, 2005). Vulnerabilities can be monitored, such as keeping track of network health, however, once these types of events occur, direct impacts and downstream effects may be difficult to prepare for due to the interdependencies within the network. External events cause surprising effects within the infrastructure due to the same interdependencies between components that complicate implementing internal changes. One type of surprising consequence of interdependence is the result of common mode failure. Common modes exist when components depend on a functioning entity to 6

17 operate, such as a power source or internet connection (Woods, Dekker, Cook, Johannesen, & Sarter, 2010). A common mode failure occurs when a disturbance impacts the dependent resource and causes multiple components to fail simultaneously (Woods & Hollnagel, 2006; See Perry et. al., 2005 for a specific example). Common modes are not always explicit and external events have a higher tendency to impact unknown common modes. Common mode failures contribute to cascading effects from multiple areas in interdependent, difficult to understand conundrums. Interdependent components can cause downstream reactions based on unsuspecting commonalities. The issue of common mode effects illustrates that it is not the disrupting event per se that matters; rather it is how classes of disrupting events create surprising cascades that propagate effects across interdependencies and across scales. It is these interactions that produce the risk of major failures of business continuity. The vignettes in the next section help illustrate these characteristics of a complex adaptive system as applied to business continuity. The more swiftly a company can recognize cascading disturbances and coordinate across levels to execute the solution the more likely customers will experience little to no service interruption. How does a company foster this ability, assess their performance and become effective at this skill? 1.3 Vignettes and Introduction to Complex Adaptive Systems Consider the following story: An engineer decides to create a network to support an application in his garage. The application requires a large amount of data to support 7

18 requests from the user, so the engineer builds a database and connects it to the server which houses the application. The engineer then realizes the application requires more data from a source that already manages the data, so instead of creating his own set of information he writes requirements and connects to the second database as a third party customer. Imagine this engineer s application becomes very popular and grows in size so that he needs someone to help him manage the infrastructure. Right now, Engineer One knows everything about his network since he wrote the requirements and built it. Engineer Two, who is experienced at network administration, has to learn the idiosyncrasies of this network. When Engineer One makes any changes to his network, it needs to be communicated to Engineer Two, as each component has hundreds of requirements to function. Now, imagine the application grows and requires a secondary application to support new functionality requested by users. Engineer One and Two hire Engineer Three and Four to help manage the system and the cost of coordination increases. This example for a system of technology and human governance has already become an example of a complex adaptive system. Adaptive systems change and reorganize their component parts to adapt themselves to the problems posed by their surroundings. Some examples of complex systems that are also adaptive are ecosystems, the biosphere, economies, organisms and our brain. An essential aspect of an adaptive system is nonlinearity, leading to multiple possible outcomes. These systems are difficult to define and even more difficult to control as the goal of the system could actually be a 8

19 moving target. For example, the goal of our Engineers is to support ever-changing customer desires. The first system boundary drawn in the example vignette was the garage surrounding the infrastructure and the cognition of the supporting engineering team. Engineers begin digital infrastructures on a small scale, and due to the adaptive nature of that system, it may grow into an enterprise or company. While the garage story is illustrative, the next few vignettes will expand system boundaries and begin to include multiple scales as they capture the growth of a company s infrastructure. Each focuses on a common characteristic of complex adaptive systems: interdependencies Vignette: Single Server with Two Database Infrastructure Application on a Server Network Network Network Database #1 Database #2 Figure 1: Example of Single Server with Two Database Infrastructure 9

20 Figure 1is a basic network with one server dedicated to one application and two databases. The application provides a user interface for customer interaction. The server has a UNIX operating system. Assume the application is an Internet Movie Catalog which retrieves information about a movie for a customer. The data for the actors and actresses in the films is kept in Database #1, which contains raw, unfiltered information. Database #1 is an Oracle database owned and maintained by the Actors Guild. Database #2 is owned by the company operating the server and houses the information concerning producers, release dates and movie summaries. The two databases exchange information in order to provide a full set of information back to the server. The server is connected to the databases via a cloud network. Each network entry and exit point is connected to a router and switch. Information has to pass through a firewall for security. Every time the customer uses the application to look up data about a movie, the application makes a call to Database #1 and Database #2 to request information. The database has to receive the request, select the correct information and pass it back to the application. The application has to accept and process the information before passing it back to the customer. The entire infrastructure system collaborates to successfully provide the customer with accurate information. Now what happens when the engineers receive a notification that something has gone wrong with the system? When alarms occur indicating poor system performance, how is the engineer informed of the actual defect? Most likely the only indication for the engineers that something has gone wrong is feedback from an angry customer. A root 10

21 cause analysis is virtually impossible with ubiquitous component coupling. There are dependencies on connectivity to the network, from the server to the network and from the databases to the network. Database #1 network connectivity is governed by the Actors Guild so the company has little authority to investigate. It is the network expert s responsibility to ask questions about system health. Is application memory capacity high enough to handle the information received from the databases? Does proper feedback for the customer exist if there is an error? Is the network down? Is the firewall malfunctioning? Was a component updated without proper compatibility test to accurately assess how the updates will impact the system? Most problematic of all, during the investigation, the customer is experiencing a lack of business continuity. The governing organization will now have to properly mitigate customer concern with the lack of performance or loose the customer. Since reliability is imperative, especially when the service provided becomes more critical than an internet move search engine, the organization adapts in order to decrease the chances of loss service. The adaptation will provide flexibility to meet customer demand. The company invests in the recovery database shown in Figure 2. 11

22 1.2.2 Vignette: Single Server with Recovery Strategy Application on a Server Network Network Network Network Database #1 Database #2 Network Recovery Database #2 Figure 2: Example of Single Server with Two Database Infrastructure with Recovery Database The company creates a mirrored database to recover operations should Database #2 fail (Figure 2). The introduced database also introduces more network connections and additional complexity. Coordination requirements between resource teams overseeing the infrastructure increases. The scale of the network may shift larger as the recovery database may be located in another location. This example demonstrates a recovery database, however, the company may also invest in recovery servers or redundant networks. Sometimes recovery systems may be exploited to aid with load balancing when capacity grows. 12

23 Any technology system is created to be as tolerant of the hundreds of editable characteristics of the code as possible. Various applets handle different languages and program interfaces more easily than others. Experienced engineers learn idiosyncrasies concerning application tolerance to code or network changes. This knowledge affords easier transitions when technology changes or updates are needed. If any one of these components changes in the slightest or cause outages, application owners, developers and engineers have to manage the system for continuous service to customers Vignette: Single Datacenter Infrastructure Application on a Server Network Network Network Network Database #1 Database #2 Network Recovery Database #2 Figure 3: Simple Network Infrastructure Example Expands to Datacenter If the infrastructure system is sizeable, a company may need to house the infrastructure within a datacenter. Consider the example system as a large-scale 13

24 organization which includes a datacenter. A datacenter is a facility used to house computer systems and components with certain environmental characteristics to enable large technology systems to function. A datacenter usually supports one line of business and contains hundreds of servers, which house hundreds of applications. Generally, a datacenter will include redundancy built into the system as well as recovery systems. The ability to seamlessly switch from production to recovery sites enables continuity of business operations. Periodically, these datacenters are intentionally shut down to test recovery plans. Technology infrastructure included in the shutdown is made up of servers, operating systems, databases and applications, and the people involved are the resource teams governing them. The key component to these shutdowns and startups is coordination within and across resource teams. Gaps arise between the plan and the world and people are the adaptive element (Woods, 2006). The necessity of coordination was apparent even in the two engineer garage network and increases dramatically with more critical services and growth in the network. What happens when the team is locked into a plan and the plan does not match what occurred in the world? Even though thorough impact analyses are conducted leading up to the exercise, hidden linkages are always stumbled upon during the actual shutdown. Assume during a technological lull period, administration shuffled resource teams through a partial reorganization for reasons unrelated to DR testing. When it came time to perform another DR test the team relied on past success to carry them through, without 14

25 evaluating how the new team organization would impact execution. The result was a less experienced person in a role previously held by an expert with knowledge of the network interdependencies. The event resulted in many applications going unexpectedly down and negative customer impacts. The failure event highlighted the critical need for coordination of knowledge to restore operations. This example event was a reality for the organization under consideration, and the actions taken afterward are detailed in Chapter Two Vignette: Multiple Datacenter Infrastructure Figure 4: Single Datacenter Expands to Multi Datacenters Managed by One Organization To continue the vignettes, redraw the example system boundary to a scale that includes multiple datacenters, all of which support one organization. Each datacenter is associated with an LOB, but may also serve as the recovery site for other datacenters. 15

26 Datacenters may be in close proximity or spatially distributed. What happens when a common mode failure impacts these datacenters? Common mode failures occur when external events interconnect seemingly unconnected units, such as flooding, vendor caused outages or new governmental legislation. Factors that affect vital digital infrastructure can also arise internally as occurred in Perry et al. (2005). Common mode failure lays the foundation for the third vignette. Two facilities happen to be located in low lying areas and near enough to each other to be affected by severe weather events or seasonal patterns. The organization managing these facilities did not account for this common mode failure, wide area flooding, when designing this business continuity strategy. If a flooding event were to occur it would disable multiple centers at one time.. Organizational testing of recovery ability did not include weather events that could produce wide area flooding. In fact, a flood caused immediate multiple impacts the company did not foresee. An incidental correlation (special dependency) caused cascading failures unaccounted for in recovery strategies. The difficulty lies in assuming the recovery will function as planned. The organization planned a perfectly functional reserve for a one-datacenter outage. A common mode failure caused outages at two datacenters concurrently and capacity demand outweighed reserve resources. Everyone was counting on the same reserve, however, it became oversubscribed. Local adaptations contribute to working across the purposes of other functional units. It turned out the reserve was not available due to someone else using it from another LOB. 16

27 In this scenario, workplace availability might also be impacted by the flood. Buildings may become inaccessible, leading employees to work from home or alternate locations. The recovery plan assumed employees could relocate to alternate locations. However, the disaster impacted the ability to travel. Travel availability was not considered a threatened variable in recovery planning. The complexities of planning for multi-datacenter outages across various special, temporal and technological scales challenge organizational efforts to sustain business continuity. The vignettes illustrate the challenges associated with maintaining business continuity for vital digital infrastructure Introduction to Complex Adaptive Systems Successful implementation of recovery plans is dependent on how well plans match the variability of the real world. Complex adaptive systems exist in an open world and the processes for business continuity given vital digital infrastructure can be thought of as a kind of complex adaptive system (Anderson & Doyle, 2010). Open world and closed world are labels used in description of systems. These labels provide a framework for describing system attributes such as boundaries, inputs, outputs, internal units/agents and external units/agents. Any system could be considered an open world without boundary as it exists within a universe of catalysts, unforeseen circumstances, extensive interdependencies and wide ranging interactions with the environment. Labeling a system as open or closed is making certain assumptions about the system. For example, assume a glass filled with milk sitting on a table is a closed system. 17

28 A narrow system boundary could be drawn around the table, glass and milk. An external disturbance to the system is a child bumping into the table causing the glass of milk to spill. We could have drawn a different system boundary that includes the house in which the table, glass, milk reside. The child s disturbance or other human activities now become part of the system. Vignettes in this chapter provide four examples of expanding the boundary for defining the system of interest. As the system boundary expands more interdependencies become visible and more events can be recognized that would challenge business continuity. How does an organization concerned with business continuity recognize which interdependencies matter when? As the vignettes demonstrate, preparing for business continuity challenges is not just crisis preparation. Studying the rules by which complex adaptive systems work will help an organization create resilience. 1.3 Resilience The use of resilience in simple, traditional sense is the ability of a system to rebound from an adverse event (Sheffi, 2005). How well does the body recover after a broken bone or virus? How well do supply chains for a global car manufacturer recover after a seemingly distant tsunami causes major disruptions? This definition captures some essential concepts of resilience such as flexibility, adaptability, active response to disruptions and coping with the unexpected to maintain system performance. It requires mobilization of resources, willingness to monitor system inputs and impromptu coordination. Coutu s (2002) article in Harvard Business Review describes resilience as 18

29 merely the skill and the capacity to be robust unsder conditions of enormous stress and change. However, as the research of resilience continues, more concise and actionable definitions have emerged. Using words such as resilience and continuity shift the focus from rebounding from unexpected disturbances to preparation and proactive actions. This use of vocabulary changes the connotation for the practice of crisis management to positive forward thinking rather than reaction to events. The use of this terminology is not simply choosing another buzzword, rather, companies have recognized that managing the risk of loss of business continuity requires the organization to develop new capabilities. Preparation and response to both internal and external disrupting events require attention to decision making, coordination and information flow. Across disciplines, conceptual frameworks for engineering resilience are in flux. This results in varied terminology across disciplines and communities. The following paragraphs highlight the diverse uses of resilience and clarify the resilience of complex adaptive systems as related to the exercises studied. Resilience engineering was developed as a response to limits in making system safety became more proactive (Hollnagel et al., 2006). In business continuity, authors also began to emphasize resilience as a proactive process to continuously anticipate and adjust to trends prior to experiencing major changes or disruptions (Hamel & Valikangas, 2003). Both lines of work emphasize how the system interacts with a dynamically changing environment and how the system is able to monitor, anticipate and learn to accommodate changes. 19

30 In systems safety, engineering resilience is concerned with domains where failures injure or kill workers and the public (Amalberti, 2006; Woods, 2005; Cook & O Connor, 2005; Dijkstra, 2006). In the case of business continuity, the argument for enhancing resilience shifts from increased safety to maintaining longer term system viability. Loss of business continuity threatens the viability or sustainability of the business organization due to extended downtime or slow restoration of normal business operations (or slower recovery as compared to competitors). The emphasis on resilience also resulted from the increasing complexity of infrastructure systems making it more difficult to expertly manage interdependencies and avoid cascading failures to ensure business continuity. Today s highly interconnected world results in events that propagate and cascade in surprising ways to challenge existing plans. The system reacts to the cascades in both successful and unsuccessful ways, and it is important to study both in order to learn how to build resilience. Resilience engineering looks beyond incidents to include the study of a system s performance in general rather than limiting to things that go wrong (Hollnagel, 2010). Resilience Engineering is a paradigm for business viability that focuses on how to help people cope with complexity when under pressure to achieve success. Organizations that govern highly critical, complex systems need the ideas of resilience to proactively manage for business continuity. What does an organization use to monitor, learn, assess and respond to changes in its world? How does an organization with a global footprint track interdependencies between functional units and the environment? This thesis addresses these questions by investigating two kinds of 20

31 exercises a company uses to build resilience and sustain business continuity in anticipation of future challenge events. The results identify some of the abilities needed to respond to events, monitor ongoing developments, anticipate future threats and opportunities, and to learn from past failures and successes (Hollnagel, 2010). 21

32 Chapter 2: Study Chapter Two describes the study of this thesis. First the objectives and methodology are described for the study as a whole. Then, two phases are described in greater detail in respective introduction, method, data result and discussion subsections. 2.1 Methodology The objective of this study is to capture how an organization prepares for challenges to business continuity so they can prevent and stop the spread of cascades due to interdependencies. The study contributes to understanding of how disturbances propagate and how an organization learns, tracks and prepares for cascades. The study can be divided into two phases in the form of two exercises conducted by the organization. Using various tools and techniques to capture cognition at work in combination provides greater leverage and deeper insight (Crandall, Klein & Hoffman, 2006). Each of the techniques used contribute to a better understanding of the organization s attempts to create resilience. This study includes multiple cognitive task analysis methods as displayed in Table 1. 22

33 Phase 1 Recovery Test Unobtrusive observation Semi-structured interviews Focus group Phase 2 Risk-based Scenario Observed exercise preparation Observed exercise execution Observed lessons learned Table 1: Cognitive Task Analysis tools for Phases of Study Only through synthesis of data collected can we understand the practices and tools that contribute to resilience. By studying work in context and those who make critical decisions we can learn how to track, prepare for and prevent cascades that undermine business continuity as well as develop strategies for resilience. In the first phase we studied the orchestrated shut down of a datacenter by using three techniques in cognitive task analysis: Unobtrusive observation Semi-structured interviews Focus group The organization simulated an emergency situation requiring a datacenter shut down and then carried out the actual shut down of infrastructure. A particularly complex datacenter shut down is the basis for Section 2.2 of this thesis. We observed this exercise from the command center, taking notes. Our observation was unobtrusive. The focus of observation was coordination and teamwork. Then, we conducted seventeen semistructured interviews with expert participants from various teams involved in DR exercises. The focus of our interviews was to capture the details hidden from observation 23

34 of coordination efforts between teams during preparation and execution of the recovery test. The interviews were then compiled into a Coordination Map which was used as the main artifact to facilitate a discussion between a focus group of experts. During the focus group session, experts discussed their respective processes and opportunities to improve execution of recovery tests. In the second phase of this study, we had the opportunity to observe three phases of executing a risk-based scenario (RBS): Observed exercise preparation Observed exercise execution Observed lessons learned analysis We investigated the preparation and execution of an RBS that proposed an outage of multiple datacenters impacting business continuity. RBS allows for observation of direct phenomena of interest by tailoring the staged scenario to probe participants (Woods, 2006; Voshell,2009). This particular exercise was mixed fidelity in that business leaders were flown into one location and placed in a boardroom to hear the script played out, receive real-world alerts and watch newscasts. However, the scenario was entirely simulated and no actual shutdowns of technology occurred. We were involved in the design of the staged world exercise at a high level and in the execution and observation. To capture participant behavior during the scenario execution, we relied on a fixed camera in the boardroom and observer hand notes. The two critical spaces for observation were the boardroom and scenario-control rooms. We were also invited to conferences 24

35 held in the weeks after the exercise to discuss lessons learned. The focus of our observation was on how decision makers handled interdependencies. Each technique has various dimensions that contribute to the validity of data collected. Participants in each exercise are the employees who perform everyday work, the subject matter experts for the digital infrastructure, the decision makers in actual events and stakeholders in the value of testing. Participants of each exercise were highly invested in the outcomes. In the recovery exercise, infrastructure is literally shut down in a high fidelity experiment. The staged world exercise is a walkthrough, necessarily lower fidelity, however, the business leaders are provided the opportunity to practice decision making in critical situations. Action plans determined during the RBS in response to the scenario were not implemented but discoveries of potential risks were brought back to teams for analysis. 2.2 Disaster Recovery Exercise This section describes the Disaster Recovery (DR) exercise. It begins with a general background on the support of a technology system followed by a description of the method for capturing cognition and decision making in this specific context. Data is presented and analyzed initially in the results section. More general discussion of how the DR exercise relates to a study of firm wide resilience is included in the discussion chapter following the RBS section. 25

36 2.2.1 Introduction and Background The socio-technical organization observed in this analysis periodically tests the effectiveness of recovery plans for digital infrastructure viability. The scale of a recovery test includes one datacenter. A datacenter is a facility used to house computer systems and components with certain environmental characteristics to enable large computer systems to function. For example, the temperature in the building is highly regulated to cool the server rooms. Disruptions to normal service encountered in this space may include malfunction of the environmental regulators, hardware malfunctions, severe weather conditions, facility maintenance and general hardware or software maintenance, etc. Generally a datacenter will include redundant components for recovery systems in cases when the primary components fail. Sometimes a company may choose to set up recovery sites in different locations than the primary datacenter when anticipating what may go wrong. When disruptions to business as usual occur, the ability to seamlessly switch from production to DR sites enables continuity of business operation. Disaster recovery (DR) exercises test the transfer of operations from the primary site to the secondary site. In each DR exercise, technology is shut down during non-production hours in order to practice business continuity plans, which might involve moving production data to planned secondary environments, running emergency backups or going without capabilities for that resource. The shutdowns occur in a considerably controlled environment. Impact analyses are conducted in the weeks beforehand in an attempt to 26

37 capture downstream effects of shutting down resources. The cognitive system involved includes the technology infrastructure made up of servers, operating systems, databases and applications, and the people serving as the resource teams managing them. Various teams are responsible for different parts of the infrastructure. Each team is separated as a functional unit governing a certain resource type, such as operating systems, databases, applications, etc. For large DR tests, a command center of resource team managers is hosted in one large conference room. These managers are typically experts in their respective fields, routinely responding to incidents and solving service issues during day-to-day operations. Each resource team manager directs a team of experts in the shutdown and startup of the test. The teams are distributed across the globe and are connected by instant messaging, and phone lines. Most shutdown and startup of the technology occurs remote from the hardware site via network connections, algorithms and scripts Methodology An exploration was conducted to better understand the processes behind DR testing and the coordination necessary to successfully execute a test. It was suspect that sources of resilience and knowledge of interdependencies resided in coordination and interaction between resource teams. Three methods were used to conduct a cognitive task analysis to extract the details of DR test preparation and execution. First, we observed a particularly complex DR test in real time. Second, semi-structured interviews were conducted with seventeen expert participants from six different relevant resource teams. 27

38 These interviews were compiled into a coordination map. Third, this coordination map was presented for review in a focus group of experienced technicians and managers. Real Time Observation We observed both shutdown and startup from the command center. While observing the DR test we were given permission to view inter-team instant messages where status updates were given. Our positions enabled us to see resource managers at work and listen to telephone conference lines where most intra-team status updates were given. We also took notes of whiteboard communications such as lessons learned. Semi-Structured Interviews The goal of the interviews was to gather information that would help map the coordination efforts of resource teams and track how status messages were passed. Status messages would indicate the transfer of pertinent knowledge relative to network environment health as well as dependencies. Once a technician collected all required status messages for dependent upstream components, he could take the next action. How did infrastructure engineers determine it was safe to shut down in the midst of complex networks and interdependencies? The interviews were semi-structured, guided towards understanding coordination between teams. A short verbal presentation was given before the interview to inform participants of our main interest, to understand the coordination surrounding DR tests. 28

39 Any time a participant began to describe process improvement or ideal methods, we reminded the interviewee to describe current activity. There are three main time periods of interest for each DR test. The first is event preparation, what actions the team took from the first indications a DR test was going to occur. The second timeframe was infrastructure shutdown, and the third was infrastructure startup. Timeframe Preparation Prompt You have just learned a DR test is going to take place impacting certain applications and resources. Where did you receive this information first and what does your team do next? Shutdown It s Saturday evening and the infrastructure teams are arriving onsite. The command center is set up, cords running underneath the conference tables and people are up and running connected over the internet to their remote teams Startup, It s Sunday morning and the infrastructure teams are arriving onsite. The command center is setup from the work the night before. Incidents from the night before are written on a whiteboard Table 2: Prompt Summary per Timeframe Each time period was framed with a small introductory scenario. Before the first question of the event preparation timeframe, interviews began with, You have just learned a DR test is going to take place impacting certain applications and resources. Where did you receive this information first and what does your team do next? 29

40 The second timeframe, actual shutdown, began with, It s Saturday evening and the infrastructure teams are arriving onsite. The command center is set up, cords running underneath the conference tables and people are up and running connected over the internet to their remote teams. Team chat windows are open and conference lines are started. Snacks are served on the buffet table and you ve got your refreshments and are sitting at your computer. From whom do you receive your first signal to begin and what do you do? The third timeframe, actual startup, began with, It s Sunday morning and the infrastructure teams are arriving onsite. The command center is setup from the work the night before. Incidents from the night before are written on a whiteboard and detailed in your . Team chat windows are open and running, already buzzing with summaries of shutdown activity. From whom do you receive your signals to begin start up of technology? For each timeframe, experts were asked where information was given and received from within their resource team, outside their resource team, and outside the infrastructure teams. It was also important to record how the information was conveyed and which communication medium was used from yelling in the command center to s to the collaboration spreadsheets. Experts were also prompted to explain, as if you were going to be absent from the DR test and were filling in the novice who is in charge of your role, the typical challenges encountered within each timeframe. Experts were not asked to give specific timelines since each DR test was announced and prepared for with varying amounts of headway. Also, experts were 30

41 steered away during interviews from going in depth regarding actual running of scripts to shut down or start up of computers and similar technological tasks. Sometimes describing these functions would nonetheless aid the interviewee in imagining the DR tests, enabling gathering more accurate data. Expert Focus Group After the seventeen interviews from six different resource teams were conducted and information concerning coordination and status messages was collected, a coordination map was created (Figure 5, Figure 6 and Figure 7). This product served as the single representation of knowledge elicitation from many experts. Each interview was analyzed in relation to other relevant interviews from the perspective of information flow. Resource teams are dependent upon one another for data. It was expected that information concerning interaction initiation in one interview would match interaction receipt in respective interviews. Almost every time an interviewee mentioned provid ing a status update, the receiver of the status update mentioned the expected receipt. When conflicts arose in the analysis each participating interviewee was asked to clarify the information until resolution. Due to the large amount of data collected on coordination, only information required for key decisions and used as cues was included on the map. If more than a few interviewees mentioned the information, or if the information was key to avoiding failure during the test, the coordination surrounding the information was included. 31

42 The final step was to present the coordination map to the interviewees and target participants of DR tests for review. The three maps were projected onto a large wall for group viewing as an artifact to discuss the coordination of teams executing DR tests, similar to a kaizen event 1. The beginning of the presentation was a walkthrough of each of the three time periods displayed on the map. Then, the teams were encouraged to discuss the maps by determining where the coordination system creates or reacts to waste system components one time period at a time Data The seventeen interviews were compiled into three maps of the coordination space, one for each timeframe event preparation (Figure 5), shutdown (Figure 6), and startup (Figure 7). Observation and Interview Data The nodes of the map are the resource teams and general tasks. The arcs contain the information passed between nodes. Circular nodes are teams external from the infrastructure teams and the open rectangle is the collaboration tool. 1 A Kaizen event is driven by waste reduction or continuous improvement. Team members may have a special meeting to focus on defining activities, improving supplier/customer connections and achieving flow for a particular process. 32

43 33 External Team Data 1 External Team Data 2 90 Days Legend Data Store Notification of Event General Requirements Notification Of Event Inventory Information External to Infrastructure Teams SP site: SharePoint OCIS DR Site External Team Define Timelines and Additional Info Internal Resource Team Add Event to SharePoint MEPC Calendar Parse ECMs for OCIS Impacts Determine Level of Impact Days Internal Resource Team Impacts Notification of Event & Impacts Notification of Event & Impacts Notification of Event & Impacts Notification of Event & Impacts Notification of Event & Impacts Notification of Event & Impacts External Administration Team Determine size of impacts, type of event & detail event Gather data from Application Teams for informational use, outage levels, impacts and alerts Internal Resource Team Treat given requirements as partial list and search for Impacts Allocate resources normally by application Daily/Weekly Staff Meeting- Discuss Event Internal Resource Team Get Engaged on what to do with DR from Application Team Internal Resource Team Translate general requirements to ours Check that plans are clear, Compare requirements for servers on to spreadsheet X Internal Resource Team Check Internal Administration Team s impact list with Inventory Management impact list Maintain spreadsheet Impacts Server List Allocate Resources Pre-validation on server Internal Resource Team Match Impacts to our services Discuss event in daily/weekly Meetings & allocate resources Determine Scripts are ready Internal Administration Team Pull Impact info for assets Create coordination web site & KPIs startup and takedown Assign resources Days Application Team Info Assignments on web site List of Requirements Supplements To impact list, Engineer assignments Engineer assignments Event Awareness & Instructions Resource team overall requirements created Resource team overall Requirements created Communicate Impacts Servers needing Maintenance mode Recovery Event Preparation Internal Administration Team Determine Command Center Necessities Check actions against Event Prep Checklist Attend weekly planning meetings Host Internal prep meetings Review Contact List for escalations 20-7 Days Requirements Data 1 External Team External Team Suggestions for Start/End of Waves BR Updates / Command Center Info / Change Awareness External Administration Team Assign Waves Wave Assignments Internal Administration Team Review Wave Assignments Wave Assignments All Resource Teams To Shutdown 0 Days 3 Days Figure 5: Recovery Event Preparation Resource Team Coordination Map 33

44 34 From Planning Event Internal Administration Team Recovery Event Infrastructure Shutdown Wave Info 8 PM General Command Center Management Open or join chats with resource teams Begin Lessons Learned List Validator on Call Contact According To requirements Green Light Green Light Collaboration Website Green Light Status Internal Resource Team Open Team Chat Follow instruction when to take application down Possible instance where no impact b/c app gracefully comes down Internal Resource Team Open group chat Contact dependent resources for each server Confirm Application/ Database are down by looking at server Filter Issues through Internal Administration Team Status of Application Continuous Communication Communication if requirements directs External Application Teams Continuous Communication KPI External Administration Team External Team Liaison between Infrastructure and Application teams, runner for command center communication Status Status Updates Status Green Light Green Light Database Ready? Status Internal Resource Team Open group chat Rely on subject matter experts to know application dependencies outside of requirements Status of Resource Status of Resource Collaboration Website Update Internal Administration Team External Application Teams Status Updates Status of Resource Update Collaboration Website Outstanding Issue Assistance Keep Issue List General Communication Internal Resource Team Bridgeline Green Light & General Communication Begin non-intrusive Work.5 hr before start Open Group Chat Server Shut down according to waves Rely on collaboration website for most information Status of Resource Status Legend Data Store Outside Infrastructure Team External Administration Team Liaison between Infrastructure and Application Teams Internal Resource Team Wave go window to start taking down databases Re-establish links if necessary External Team Midnight To Startup 8pm - Midnight Figure 6: Recovery Event Infrastructure Shutdown Resource Team Coordination Map 34

45 35 6am Recovery Event Infrastructure Startup From Shutdown Internal Administration Team General Command Center Management Open or join chats with towers Begin Lessons Learned List External Administration Team Green Light Green Light Internal Resource Team Open Team Chat Collaboration website gives when dependent technologies are up to begin next task Receive go ahead from External Application Team to start Application and Perform Technical Validation Continuous Communication Communication if Requirements directs External Application Teams Validation Communication KPI External Teams Green Light Green Light Liaison between Infrastructure and Application Teams Green Light Status Green Light Status Internal Resource Team Open group chat Contact dependent resources for each server Filter Issues through Internal Administration Team Basic validation when server is up External Administration Team Liaison between Infrastructure and Application teams, Status of Resource runner for command center communication Rely on collaboration website that applications are a go for validation Status of Application Update collaboration website with validated applications External Team Bridgeline Collaboration Website Green Light Status Status Green Light Status Updates DB Open group chat Subject Experts know application dependencies outside of requirements Standby for validation confirmation from app team Status of Resource Status of Resource Collaboration Website Update Status Internal Administration Team Status of Issues Resource Ready? & Add. Special Request Communication Status Updates Internal Resource Team Status of Resource Update Collaboration Website Outstanding Issue Assistance Legend Data Store Outside Infrastructure Team External Application Teams General Communication Green Light & General Communication 6am - Afternoon Open Group Chat Server Startup according to waves or requirements if different than shutdown Rely on collaboration site for most information Standby to determine if Applications come up Internal Resource Team Go window to start taking up resource Re-establish links if necessary If wave is up assume server is up Status of Resource Status External Team At Completion Figure 7: Recovery Event Infrastructure Startup Resource Team Coordination Map 35

Focus Group Data Figure 8: Focus Group Analysis Example The focus group narrowed in on primarily three areas of DR testing as shown by the sticky notes in Figure 8: Communication with teams outside

46 Focus Group Data Figure 8: Focus Group Analysis Example The focus group narrowed in on primarily three areas of DR testing as shown by the sticky notes in Figure 8: Communication with teams outside infrastructure in all three timeframes Source data quality concerning technology inventory Difficulties of transferring expert knowledge of infrastructure These three areas demonstrate participant acknowledgement of the difficulties with passing status information and coordination. Participants often focused on suggestions for process improvement for each interaction for communication with teams 36

47 outside infrastructure. They expressed the difficulties with presenting requests and updates consistently from each resource team to external teams. Participants also kept mentioning source data quality concerning technology inventory. Participants described incorrect inventory records often surface during DR tests since they rely on inventory records to analyze impacts. They discussed that novices often rely on the inventory data source more than they should for impact analysis, which feeds into the third area participants in our focus group discussed: the importance of expert knowledge and transfer of that knowledge. It was common for the focus group participants to refer to difficulties of transferring knowledge of infrastructure from experts to novices. Participants recognized that experts not only knew inventory almost better than the inventory system, but also knew which status updates were important to pass along to the next resource team. Managers in this meeting even formed a focus group to meet regularly in order to determine how to capture the expert knowledge imperative to successful shutdowns. One participant explained that this meeting provided more value to them than any other meeting had in years concerning DR tests Results The key component to these shutdowns and startups is coordination within and across resource teams. The technology infrastructure is intimately connected and the functional units governing it should be as well. This is illustrated by the many teams and communication lines of the coordination maps. The application requires shut down before 37

48 the database, which is required before the server can be shut down. Demonstrated by the multiple communications from team to team, the network complexity is compounded as an application can live on multiple servers and databases at once and simultaneous with other applications. A DR test event could involve hundreds of applications and moreover hundreds more servers. We observed the coordination of resource teams in the command center as integral to success by tying up loose ends of infrastructure connections. The web of communication between participants is as tangled as the imaginary wires enabling the web of technological connection. Interviews showed that there were procedures in place outlining how to run scripts and where figurative switches to flip the systems on and off were located, but the focus group showed the knowledge of when to begin shut down or start up resided in expertise, understanding of the infrastructure, and the status message exchange between teams. The coordination maps demonstrated how unorganized the status message exchange may become. Communication occurs by talking across the room in the command center, conference lines, online chatting applications, s and real time collaboration spreadsheets. In fact, coordination for DR tests is so widely distributed that no single resource has a holistic perspective of all the work involved. Instead, success relies heavily on individual expert knowledge. This reality spurred a tumultuous DR test we observed as described in the next few paragraphs. As these events were ran successfully more often, experts became more resource specific focused. Experienced technicians retained knowledge of network 38

49 interdependencies through many executions of DR tests, decreasing the need for coordination with other resource teams over time. Engineers waited on a green light to run scripts, were passed the green light through a communication mechanism and essentially carried out isolated activities. Engineers acted on their knowledge of infrastructure interdependencies to carry out described tasks successfully. Then, during a lull period between DR tests, administration shuffled resource teams through a partial reorganization for reasons unrelated to DR testing. When it came time to perform another DR test the team relied on past success to carry them through, without evaluating how the new team organization would impact execution. Even though thorough impact analyses were conducted leading up to the exercise, hidden linkages were stumbled upon during the actual shutdown. As always, engineers were expected to adapt and reconcile. The result of the reorganization was a less experienced person in the role previously held by an expert with knowledge of interdependencies. Actions were taken which unintentionally caused negative impacts to downstream components Discussion The following subsections are results and beginning analysis related to the DR test singularly. After the RBS chapter, more analysis is conducted of the DR test exercise in relation to the RBS exercise and organizational resilience. 39

50 Systems Thinking Taking a systemic approach has three core premises: perspectives, cross-scale interactions and emergence (Woods & Hollnagel, 2006). Comments were captured from the participant s evaluation of the coordination map highlighting each premise. The following paragraphs evaluate resource team member s insights captured during the focus group in relation to the three premises. The coordination map helped employees step out of their typical, daily perspective into a systems perspective. These comments show no person or group of people in a system has an omniscient view covering all events at the relevant scales, or has knowledge of all constraints that require consideration (Woods & Hollnagel, 2006). There is immense value to resource teams in collaborating to handle disturbances to the system and to learn from one another the interdependencies present in the network. The coordination map allowed resource teams to discuss the limitations of each other s perspective concerning the overall network infrastructure. 40

51 System Premise: Perspectives Everyone has a different view on this right? If we take down application 1 we have to take down application 2 and all the rest of them. Where other resource team says they have servers in such and such wave is their concern. There are multiple sources, do they consult their own internal sources first and then bounce that against ours? Do they look to reconcile? We don t have to solve their problems but it would be good for us to know how they get their data. Resource Team gives out a data list of our managed servers but no one trusts the data so everyone goes and creates their own list. They will always want to know what is going on prior to their needing to know. Table 3: Comments Related to System Premise "Perspectives" Comments were also captured when resource teams were noticing another premise of systems perspectives: cross scale interactions between infrastructure teams, the technology and other supporting teams to the test. To study a system in one particular scale, it is also necessary to consider the scales surrounding the one of interest for a more complete analysis. 41

52 System Premise: Cross Scale Interactions It seems like we have a duplication of efforts. We are both talking to them at the same time about the same issue. But other group has additional information to gather from them. I m confused, the first thing that gets shut down is the applications, followed by hardware. Even if I start with the application, does the application teams notify other team and let them know they can start? If other team 1 takes longer than planned, we need to somehow let other team 2 know since they are not on our call bridge. And then there is us. Other team doesn t know the impacts of what they do. Table 4: Comments Related to System Premise "Cross Scale Interactions" A few comments from team experts below demonstrate the existence of emergent phenomenon resulting from complex interactions in the technology infrastructure. System Premise: Emergence Who knows the sequence? There is some way we need to have the startup dependencies listed out Q: Why would an application team need to assist with shutdown? A: Because their business processes need to finish before we can take it. They touch servers that aren t impacted or are going into disaster recovery. How much of this stuff do we record after we identify the impacts? What do we do when we figure out, hey these are actually impacted? Table 5: Comments related to System Premise "Emergence" 42

53 The experts within the system know each resource area very well and handle most anomalies with little difficulties. When major disturbances occur, the concepts from systems thinking (perspectives, cross-scale interactions and emergence) interplay with the complexity of the environment. This contributes to cascades of disturbances in the interactions between infrastructure components. Successful DR tests may indicate resilient operations and it is just as important to study why it was successful as why it failed. The focus group was tuned to asking themselves, why does this work in addition to why does this fail and insight into sources of resilience and brittleness became apparent. Specialized knowledge and expertise is a source of resilience yet limited perspective is a source of brittleness. Without the system perspective, efforts to improve or understand the DR process would be futile as knowledge of the system resides in different resource teams. DR tests were performed successfully consistently resulting in business decisions to take further advantage of the success. While production infrastructure was shut down, the business began to use the unique opportunity to perform system upgrades separate from the DR test initiative. Startup became more than charging up the existing infrastructure, it included testing the newly upgraded systems in addition to a transition back from DR. The inclusion of system updates also added another team to the coordination efforts, which necessitated more people requiring additional status updates. As resource teams become more successful, more may be demanded, pushing the teams to juggle demands that will eventually exceed capabilities. Greater coordination efforts between more teams will be required to complete tests. More advanced forms of 43

54 real time collaboration status updates will need to be developed and the web of communication required to complete successful DR tests will become more befuddling. With new capabilities, effective leaders will adapt to exploit the new margins by demanding higher tempos, greater efficiency, new levels of performance, and more complex ways of working. This set of observations illustrated the law of stretch systems 2 in operation (Woods & Hollnagel, 2006). 2.3 Risk-based Scenario Exercise This section describes the Risk-based Scenario (RBS) exercise. It begins with a general background on the support of a socio-technical organization followed by a method section. Three methods are described: the method for developing the script, the method for presenting the script and the method for observing the execution. Data is presented and analyzed initially in the results section. More general discussion of how the RBS exercise relates to firm wide resilience is included in the discussion chapter Introduction and Background The testing of business continuity plans is nearly impossible considering the amount of resources that would be required. It is possible to test portions of these plans at a time, as in the DR tests described previously. However, the aggregate invocation of many DR plans across all business lines is not practical or feasible at such a large scale. 2 Under resource pressure, the benefits of change are taken in increased productivity, pushing the system back to the edge of the performance envelope. 44

55 An RBS is one option to test the current plans and evaluate changing system boundaries without suffering the costs of failed performance. RBSs are used in many contexts for various objectives such as training and, as examined in this paper, testing plans prior to implementation. They also offer opportunities for objectives such as conducting cognitive systems engineering research (Voshell, 2009; Branlat, 2011) and eliciting expertise (Crandall & Klein & Hoffman, 2006). The main challenge of conducting an RBS is to meet the goals of all stakeholders, which may be multiple, overlapping and conflicting. In his study of running large-scale scenario exercises as learning laboratories, Voshell (2009) claims the foundation for designing an RBS is the purposes and goals of stakeholders. From the perspective of the organization studied, the risk-based scenario provides an opportunity for businesses to test business continuity plans in order to assess adaptive capacity of LOBs and resilience of business continuity plans Methodology By considering other accidents across multiple domains, organizations and researchers can determine common vulnerabilities that contribute to failures across complex settings. These common vulnerabilities can be exploited to create hypothetical scenarios that challenge business continuity plans, hence a risk-based scenario. In this case, the scenario was a technology outage situation caused by hackers and weather. We were involved in the design of the staged world exercise at a high level and permitted direct observation of test execution. There are two aspects to the methodology employed 45

56 in this phase of the study. The first subsection describes the logistics of RBS preparation and execution. The second subsection describes the methods for observation and capturing data during preparation and day of execution. In the design of RBS execution and shaping the conditions for observation we were most interested in learning how interdependencies between functional units impacted performance. Logistics of Preparation and Execution Storyline creation is crucial to the success of RBS exercises. The fundamental attribute to scenario-based methods is the organizers ability to design them (Woods & Hollnagel, 2006). Something must be known about the challenging situations that create difficult work environments for players in order to recreate the situation. Organizers have to elicit information about work environments, guided by stakeholder objectives for the scenario. If the simulated work environment is done well, participation will occur naturally. This particular RBS was mixed fidelity in that business leaders were flown into one location and placed in a boardroom to hear the script played out, receive real-world type alerts and watch professionally recorded newscasts. However, the scenario was entirely simulated and no actual shutdowns of technology occurred. The script, structure and logistics provided sincerity factors to encourage participant engagement. Participation levels determine to a large extent the success or failure of the experiment. To ensure script relevancy and accuracy, representatives from many LOBs provided inject information and aided in scenario design. Injects for each LOB were unique to the business s typical alerts and received during LOB-specific production 46

57 crisis. Collection and administration of these injects was not an easy task as there are many LOBs and no one organizer has the knowledge of technology alerts, problem handling escalation avenues or personnel roles for each business function. Adding to the challenge of collected detailed injects, organizer s job functions were sometimes too high level to know details of problem management needed to create a realistic situation for participants. The result of these challenges was a decision to expand the scenario delivery to two groups of participants. Two run-throughs of the scenario would be conducted, the first with experts and technologists closer to problem management and the second with higher-level administration participants. The technologist participant scenario run was titled Day One and the business leader run-through was coined Day Two. Objectives for Day One varied from Day Two as described in the next paragraph. On the day of RBS execution, Day Two, business leaders from various LOBs were brought together at one site for four hours before lunch. They sat in a large room around a U-shaped table with the main script readers at the end of the U. A large projector screen was in the room to show video injects. A breakout room specific to their LOB supported each business leader. The breakout rooms could hear the boardroom via a conference bridge line but could only interact with the business leaders through s and personal phones. Each business leader had cellular phones with access to their , alarm systems and contacts. 47

58 Methods for Observation The main purpose for the RBS, from the perspective of the firm, was to test business continuity plans when disturbances are larger than business as usual. The firm did not manage observation opportunities explicitly. Significant time was not spent to capture data on the day of the event or coordination observers to capture joint-cognitive system challenges. Increases in complexity of large scale exercises traditionally require increases in observation coordination. My role in execution of the scenario was to monitor audiovisual components and ensure general logistics ran smoothly. My familiarity with the script meant people could depend on me to facilitate certain components, which meant my ability to capture data was limited. Due to the already tight space constraints and delicate nature of work, only one outside observer was permitted. A research scientist from our lab was enlisted to help gather observations on the side since the organization had no formal plan to do so besides a single recording from one vantage point. The purpose of this video feed was more to display the command center to the control room, so the selection of vantage point was based on those objectives rather than observation for later reflection. The command center s view of the boardroom was through a video feed. The vantage point was placed twenty feet or so from the tables in order to provide the command center with a full view of the tables. With the macro view, small but important aspects of the room were missed such as quiet conversations and nonverbal queues between participants. Small microphones, one per participant, recorded sound into a 48

59 conference bridge line. If a participant spoke into the main microphone or a small one at their seat, the conversation was saved along with the video feed. My role necessitated I remain behind the control board, a prime spot for observations of the room at large but not for detailed conversations between business leaders. My associate spent the majority of the RBS in the command center with organizers where most injects were handled. Neither the observation team nor the organizers were able to capture events in the breakout rooms. The conversations between business leaders and their teams were not monitored or studied. The only insights into conversations came from business leaders highlighting them during roundtable discussions. Valuable insights into coordination and business functions may have been missed Data Instead of using process tracing methods or other observation capture techniques often used for analysis of simulated events (Branlat, 2011), this paper uses the organization s results from the exercise as data. This data set will then be expounded upon in the results section. From the perspective of the organization and business units, general themes of the event were identified (Table 6). The discussion of results is a discussion of what the organization considered their results from conducting the RBS in order to capture how an organization prepares to manage disturbances. 49

60 Theme Detail General Organizers used scenario for deep dive testing prior to simulation which created a more complete set of information during the simulation Command and Control Strong leadership and engagement from key decision makers during scenario is critical for effective implementation of recovery solutions LOB Interdependencies Cross-LOB discussion is key; ensure business units are comfortable with the recovery strategies of other business you rely on Recovery Facilities Several LOBs rely on one site s recovery seats, need to determine cross unit prioritization Table 6: General Themes from Organization Results As part of the exercise, actions items were developed by each LOB on lessons learned from the exercise. The following table is an excerpt from different LOBs on their specific action takeaways from the exercise. Details indicating specific teams or data types have been excluded. 50

61 1 2 Lesson Learned During this exercise we identified that a set of data critical to one group does not have recovery in place. Primary recovery strategies involve site facilities recovering to other sites located nearby. Unfortunately, while the facilities are officially in different zones, the facilities are still in close proximity and it is conceivable they could be impacted by the same event, as was the scenario for this test. Planned Remediation Will work with the group and identify the type of recovery they would like to deploy and take the necessary step to put the recovery in place. For LoBs that have this situation, they need to formally look for alternate recovery solutions which take into account all the facilities in one area. 3 Processes with reliance on incoming mail should be reviewed to determine if there is a suitable plan to receive mail if the building is inaccessible. 4 5 Create Crisis call information template and establish within Crisis plan Regular testing not performed by certain groups at some of the impacted offices. 6 May want to look into capturing a Internet Service Provider in case a situation occurs where some work from home is impacted by network outages. 7 Verify best effort steps, to help LOB move production when both production and recovery sites are down based on the same disaster scenario. Review applicable processes and update plans where applicable. Common framework for crisis communication updates needs to be developed to provide and level set expectations from business unites. Once developed at company wide level, each team should include framework into plans Schedule testing to ensure there are no connectivity issues in the event we need to relocate. Will discuss with management to see if/how this can be accomplished. Review Plan 8 Business area managers will take responsibility for updating their own plans with items identified as needed in their plans during the exercise. Table 7: Lesson Learned from Organization Business Units Business Area manager in the process of updating plans Results The main storyline was a localized technology outage challenge, spurred by a cyber-attack on routers and severe weather from the previous night. The script presented a cascade of outage problems that began with seemingly unconnected applications, 51

62 eventually spreading to indicate infrastructure outages. Critical third party network providers were down for an unknown amount of time, isolating business processes. The following sections describe the results of this exercise. Presentation of Scenario to Participants This scenario was expressed to participants in a myriad of production crisis alerts and in room script readings. The first of three high-fidelity videos showed various employees unable to complete their work due to network outages. Certain locations were indicated as blacked out with usual coordination lines disabled. The second video showed a local newscast concerning the outages to demonstrate company public relations would need to be addressed. At the simulated three-hour outage mark, when cause and effects were still unclear, business leaders from each LOB were sent to breakout sessions. In the sessions they worked with their technology teams to ensure business processes continued regardless of the technology disturbances. Upon return to the command center from breakout sessions, a roundtable discussion was conducted for each business leader to summarize their team s continuity plans and resource activation. The scenario was then fast forward to five days later by PowerPoint slides and in room script readings. Participants were informed the outages were caused by a hacker attempting to take advantage of power outages from inclement weather. Outages continued for five days, but third party providers were able to get the network up and running. A third video showed a local newscast reporting on the business status. Participants were given another slew of injects via alerts and phone calls. 52

63 Business leaders were again sent to breakout sessions with their teams to discuss and plan. Upon return to the command center, another roundtable discussion was conducted to summarize LOB actions at the five-day mark. Finally, in the last moments of the exercise, the participants were invited to reflect on the exercise and provide their greatest lesson learned. Participants were told to identify interdependencies between business units and to notice other LOB s execution of business continuity plans. They were instructed to accept the scenario as plausible, as real implausible events of the past had disrupted operations in recent years. Organizers stressed to participants that the RBS was not a test to determine rights and wrongs but an exploration of existing plans. They were encouraged not to hide findings when the scenario challenged plans, but to share these experiences explicitly and tell of the alternative courses of action developed. Organizer Decision Making The decision to expand the RBS to two groups of participants meant two runthroughs of the script. The purpose of Day One was for experts closer to the scene of action to participate with the master script in preparation for their support role during Day Two. While running through the outline, participants recorded key figures and stats concerning impacted infrastructure, functionality, business units and customer agreements while noting cross impacts. These facts were then translated into detailed injects for Day Two participants. When Day Two participants questioned Day One participants in the breakout sessions for more information, these figures were kept readily 53

64 available. This test format enabled Day Two scenario execution to be carried out within a three hour time frame. LOB specific injects and readily available detailed answers for business leader questions contributed to exercise fidelity in the relative small amount of time executives scheduled to partake in the exercise. Otherwise, instead of spend ing time making decisions, business leaders and teams would have spent time gathering information. In the organization of this exercise, these decisions were made in seconds while dialed into a phone conference. Example Exchange: Should we make the Day One exercise 3 days long instead of one, giving us time to analyze our business continuity plans in the midst of production? Short pause Yeah I think that wouldn t be a problem. We ll get the invites changed today. Example Exchange: (Resilience Manager) wants more color in the injects. Do we have third party vendor interactions we could add? Short Pause Yeah I ll check the vendor logs, that s a good idea. Any way we can get them on site for the exercise? These exchanges do not illustrate planning deficiencies but rather the pressures on resource time for all organizers. One organizer even mentions the constraint of production pressures while his team would be taking part in the exercise. Many decisions critical to meeting the objectives of the RBS were made on the spot, with organizers quickly running through critiques of options that if given more time would be weighed more heavily against exercise objectives. 54

65 2.3.5 Discussion The following subsections present the analysis related to the RBS only. First an analysis of the planning process is conducted followed by an analysis of the organization s results. In the next chapter, global analysis of both exercises is discussed with regards to business continuity, organizational resilience and capturing how an organization prepares to disturbances. Planning Process Cognitive Systems Engineering has a long history of conducting research in complex domains utilizing effective staged and scaled world design techniques to support, explore and illustrate the critical cognitive challenges of practitioners (Voshell, 2009). What happens when organizational teams balancing multiple agendas run a similar exercise? In this exercise, a few elements are clearly different than traditional cognitive system engineering researchers running an RBS. Business organizational teams prioritize much less time for planning and coordinating overall, have competing job related tasks, are often less familiar with recent academia publications on the subject and have less experience concerning the logistical requirements for large scale exercises. These elements ultimately change the framework for the exercise and challenge the maximum learning yield from the exercises. Research in cognitive systems engineering largely supports the planning and designing of staged world experiments as deliberate development and use of probe events, pacers, artifacts, injects, scripts, stimuli and logistical layouts (Woods & 55

66 Hollnagel, 2006; Voshell, 2009; Branlat, 2011; Woltjer, Trnka, Lundberg & Johansson, 2006). As discussed previously, decisions were made in seconds during phone conferences concerning planning and design. Conducting RBS s is naturally less urgent than production requirements for the moment, as often the absence of failure is taken as a positive indication that hazards are not present or that countermeasures are effective (Woods, 2005). However, organizations need to balance goals of high productivity and resilience given the certainty of the changing world related to business continuity planning. The organization did provide resources to participate in the scenario, however, management maintained concern for daily production handling, which in turn competed with organizing the scripts, injects and logistics for the RBS. In every large-scale observation, there is risk that logistical issues will overcome the opportunity to learn. Voshell (2009) explores a future-case field observation event that illustrated the challenges of conducting a large-scale exercise while balancing stakeholder goals. The exercise failed from logistical breakdowns across multiple levels that ultimately resulted in exercise fragmentation. In Voshell s example field observation, there was lack of explicit planning, visible organizers, and pacing that updated the state of the stimulated world. Without attention to recent published work on proper execution to maximize learning from RBS and/or experience with running these events, the risk is high for failure. The stakeholders in this exercise were not specifically labeled by the business or brought to the table as suggested by Voshell (2009). One outcome of running this RBS 56

67 was that the business leaders began to reflect on who the main stakeholders of the scenario might be, as addressed in the after action roundtable. One business leader commented that the ultimate stakeholders in any action by the organization are customers. The lack of stakeholder clarity, the realization of which may have been dormant, did contribute to challenges in the design, planning and execution processes. People involved in scenario design and structure played multiple roles in exercise preparation and execution. Day One participants were scriptwriters and stakeholders as they received valuable insights into their processes while writing and developing relevant injects. When conducting large-scale exercises, key personnel became stakeholder groups who commit resources, benefit from learning and are the participants. Most Day One participants became invested in the exercise as all three stakeholder categories. Lessons learned The LOBs submitted main lessons learned and planned remediation for each lesson. An excerpt from the LOBs, exhibited in Table 7, demonstrates key learning points the organization considered important. These learning points provide insight into how the organization reflected on the exercise. The exercise was considered a success, and during the round table open remarks many business leaders said participation in the exercise was valuable for various reasons. It was left up to participants to determine what was interesting during the day related to their LOB and then to develop action plans. The following paragraphs discuss their lessons learned. 57

68 Rows one (concerning data recovery) and five (concerning regular testing) demonstrate the ability of the scenario to highlight inefficiencies in planning recovery sites. Some plan inefficiencies were known by organizers and the scenario targeted these specifically while other inefficiencies were highlighted inadvertently. The exercise identified various components of the infrastructure that did not have recovery sites at all. In a few instances, recovery sites did exist, however, the ability to recover had not been tested recently enough to ensure business continuity. Rows two (concerning recovery strategies) and six (concerning network outage impact) demonstrate the ability of the scenario to foul existing recovery plans and highlight bottlenecks. Single points of failure are often bottlenecks. Recovery sites located in the same building or even within close proximity become single points of failure. LOBs have remediation plans to address these situations but will they develop remediation plans to review their infrastructure across the board? Will LOBs generalize from searching for single points of failure within the scope of the scenario to looking for single points of failure at other sites or even other types of dependencies? Row three (concerning reliance on mail) demonstrates the ability of the scenario to highlight dependencies on third parties. Vendors, suppliers, government systems and network providers all experience challenges to business continuity, which in turn challenges the processes of the observed organization. For example, one particular LOB relies on receipt of mail for some critical operations and through the exercise determined there was no recovery plans if this system failed. 58

69 Row four (concerning crisis communication) demonstrates the ability of the scenario to challenge crisis management methodologies. When business leaders reviewed the effectiveness of round table updates, they were troubled by both too much detail and not enough detail in the presentations. Too much detail took too much time to present during precious minutes, while too little detail would inhibit discovery of interdependent issues or learning from each other s actions. It was decided a common framework for crisis communication would help set expectations for sharing and establish a balance for both too much information and not enough. The organization should not emphasize the framework or a fill-in-the-blank template as the solution to this challenge. Instead the organization should train business leaders to reflect on which pieces of information should be shared rather than a chronological list of we did this and then this and then this Important information will change according to context. For example, a practice from resilience would suggest reporting conclusions of difficult tradeoff decisions, especially when actions impact other LOBs. Rows seven (concerning same site production and recovery) and eight (concerning updating contingency plans) demonstrate utilizing the activity surrounding the RBS to update plans without which resources may not have been delegated to this task. In addition, the scenario provides some ideas for business leaders to test their recovery plans. For example, business leaders might think of cross-lob impacts and third party impacts. They might conjure situations when the plan will break and then evaluate if the plan should be changed to accommodate them. 59

70 Organizers conducted phone conferences to continue lessons learned in the weeks following the RBS exercise. These conversations often turned to fine tuning logistics for RBS execution rather than reflection on the lessons learned by the LOBs. This is a more challenging reflection that would require more time not prioritized in busy schedules. It is likely the lack of observation into LOB activities during the event contributed to lack of analysis performed on the value of lessons learned. Many comments were made on the fidelity of the scenario, for example how the round table during exercises may be carried out differently than real time roundtables during crisis events. How should they keep the ability to observe the exercise and maintain scenario fidelity to the real world? This question is basis for cognitive systems engineering design of joint cognitive systems studies, as generally, efforts for studying cognition disrupt the current system. In the design of staged world studies, decisions concerning fidelity are solved when the creation of the staged world is explicitly related to stakeholder objectives of interest. Overall, many lessons learned targeted by the organization are based on improving contingency plans and attempting thorough anticipation. The exercise was considered a success by the organization, however, did it increase resilience? The next chapter explores resilience engineering of vital digital infrastructure beyond these concepts in order to ensure business continuity for the customer. 60

71 Chapter 3: Discussion: Study and Resilience The complexity of technology is exploding around us but in ways that remain largely hidden. Modern institutions and technologies facilitate robustness and accelerate evolution but enable catastrophes on a scale unimaginable without them (from cascading failures in networks to market crashes, war, epidemics, and global warming) -Alderson, & Doyle, 2010 This chapter is the primary discussion of the exercises related to resilience. It begins with a section proposing provoking opportunities for business leaders to reflect upon resilience in their operations. Then, the exercises in Chapter Two are used to demonstrate general concepts of resilience. Business leaders may use concepts from studies of complex adaptive systems and resilience as a launching point to determine sources of resilience or brittleness. 3.1 Recommendations for Effective Exercises This section describes some frameworks and concepts that business leaders may use to begin reflecting upon their business continuity strategy. Vital digital infrastructure provides flexibility for a company to operate and yet can be a source of vulnerability and brittleness. Neither exercise presented in this thesis alone provides the holistic picture of how to prepare for disturbances. Together they provide the opportunity for an organization to better prepare for disturbances and assess resilience. The following 61

72 paragraphs suggest three recommendations to consider for effectively managing a system for resilience: Simply conducting exercises is not enough for a company to become resilient. Recognize the difference between a testing environment and real world variation Appreciate the role of interdependencies Simply Conducting Exercises Simply conducting exercises is not enough for a company to become resilient. There is so much work that goes into orchestrating an exercise at any scale that many organizations assume the return is valuable learning. Maximizing learning potential is more important than conducting a successful exercise logistically, even though a logistically sound exercise is necessary for learning to occur. Maximizing learning depends on defining clear objectives from stakeholders in the exercise and seeing those objectives through the planning, preparation, execution and evaluation stages. Voshell (2009) extensively demonstrates ways to create learning labs when designing and conducting exercises. Any exercise is contextual in the time, place and scale where the exercise is conducted. The immense value in conducting exercises emerges when participants and organizations are able to generalize the lessons learned to the broad array of real world variations. For example, a participant may realize the uncommon configuration of an application/server/database system during an exercise. How much more valuable it might 62

73 be if the participant was encouraged to generalize the uncommon configuration beyond the context of that particular exercise event. It is evident LOBs had specific takeaways from the exercise, however, without them being specifically captured, the question remains whether local learning is generalizable to other scenarios or even across LOBs. Besides the end of exercise lessons learned roundtable, very little attention was paid to studying cognitive work during the exercise. Those who participated in the exercise may have learned a great deal, but the lessons were not collected or combined in a meaningful way to produce valuable results that go beyond local learning (Voshell, 2009). This is no easy task for employees with various other job demands competing for time, as learning opportunities have to be extracted from complex distributed teams in multiple locations in order to analyze large scale team performance. After the exercise, common practice is to focus on the failures during the exercise. Resilience assessment includes what makes things work as well as what makes things break. The exercise evaluation should include an eager analysis of what did work, and an attempt to understand why. Again, this should be generalized to assessments outside of the immediate entities included for full learning potential from the exercise. By studying the exercises these ways, business leaders will better attain knowledge of how these complex systems actually work with regards to decision making and automation. Erik Hollnagel (2010) proposes that the engineering of resilience is comprised of four capabilities: ability to respond to events, to monitor ongoing developments, to anticipate future threats and opportunities, and to learn from past failures and successes alike. 63

74 Exercises may help assess the presence of these capabilities as long as an organization goes beyond simply conducting the exercise. One way to move beyond simply conducting exercises and assuming learning has occurred beyond simple lessons learned is to ensure the role of observer is present. An experience observer understands the concepts in resilience (Sections 3.3, 3.4, and 3.5) and is familiar with specific-to-general transition. The observer may play other low key roles in the exercise with a main objective to understand exercise events. Voshell (2009) reserves an entire chapter to observation techniques and shaping conditions of observation. The observer may also conduct digital infrastructure assessment with regards to Hollnagel s (2010) four capabilities of resilience. Hollnagel proposes the use of The Resilience Analysis Grid (RAG) to measure the resilience of a system Test Environment versus Real World Recognize the difference between a testing environment or controlled shutdown and real world disturbances. Just as simply conducting an exercise will not increase resilience, treating performance in testing environments as the absolute indicator of performance in real world disturbances is incorrect. Performance during exercises and tests can provide assessment and evaluation opportunities. However, the very actions taken to make these exercises safe to perform, such as conducting them off hours, inhibit the fidelity of the exercise. Care should be taken to extrapolate exercise activity to real world activity with regards to exercise design. 64

75 For example, the controlled environments of DR shutdowns include weeks of preparation and impact analysis. Performance in these exercises is an indication of performance for dealing with natural catastrophes, not absolute indication of successful recovery. In a real world datacenter crisis, there is no time for proactive impact analysis and the shutdown has external causes rather than internal, which may include downstream effects. The most experienced staff may not be available, etc. Also, testing environments often include one datacenter at a time. Datacenters located in close proximity have a high probability of also going down when disturbances of spatial scale occur. Does successful exercises for each datacenter shutdown indicate a successful recovery when both datacenters shutdown at once? Not necessarily. Within a single test or exercise the participants may become painfully aware of the test environment is different from real world variation. The use of controlled shutdown might be a misnomer. By shutting down infrastructure intentionally, real world variation will creep in, instigating unknown effects. What was once a low risk exercise may become an actual risk for customer impact if participants cannot maneuver technology back onboard to the test plan or reconfigure. While some exercises are higher fidelity, such as running through the actual shut down of technology, others are lower fidelity in some aspects due to practical factor scale of implementation. Lower fidelity exercises have different objectives such as offering opportunities for training or practice rather than measuring success (Smith, 2010). Where it may at first appear more obvious the testing environment differs from the real world, in analysis of the experiment it is still important to step back and reflect upon this 65

76 phenomenon. In the large scale exercise described in Chapter Two, the implementation plans of business leaders in response to the scenario were presented in a roundtable discussion. There was very little challenge or pushback between participants as to whether the implementations would be feasible. These even included temporary relocation of many crucial business operations, the success of which was assumed for the sake of exercise schedule Appreciate Interdependencies Managing for resilience includes an appreciation for the role of interdependencies. System practitioners commonly attempt to perform root cause analysis to diagnose a single cause of failure. The practice of root cause analysis may actually impose barriers to understanding complex systems. Acknowledging the effects of interactions is more accurate than assuming a linear sequence of causation. With a complex system of interdependent components, distinguishing one cause for failure from other contributors is incorrect (Woods et al., 2010). The following phrases may help systems users realize they may be dealing with interdependencies: Whole is greater than the sum of its parts Unintended consequences When evaluating systems, one must take into account latent factors and multiple contributors when attempting causal analysis. Multiple small (or large) actions or 66

77 mistakes combine to create system failures of large consequence. Sometimes, if any one of the proximate causes is removed from the chain of events leading up to the event, then the event does not occur. Interdependencies contribute to emergent phenomenon not apparent by component analysis. Simply appreciating the role of interdependencies in causal analysis provides engineers the ability to formulate a more realistic system model. If digital infrastructure managers kept the role of interdependencies in mind when performing work on the system, there is increased probability for successful implementations with less downstream consequences. Engineers would evaluate how the interdependencies may impact downstream operations before taking actions, no matter how small. This type of thinking and analysis, rather than linear root cause analysis will enable engineers to create more accurate mental models of the system. Appreciating interdependencies will also highlight links between seemingly unconnected functional units. For example, LOBs seem unconnected based on functionality for the customer and management hierarchies, yet are dependent on the same company brand and will suffer common consequences or rewards due to brand perception. To unearth dependencies between separate functional units look for shared resources and networked components. The three recommendations presented in this section may help business leaders begin to think of exercises in crisis management and business continuity within the framework of resilience. For those readers interested in further discussions of resilience of vital digital infrastructure, the next two sections demonstrate digital infrastructure as a complex adaptive system. Then, the final section of chapter three describes the three 67

78 ways adaptive systems fail, as presented by Woods (2010), with examples from the exercises. Acknowledgement of these concepts in overall business delivery will design a strategy for business continuity to help people cope with an uncertain world. Simply demanding business continuity is infeasible, as eventually tradeoffs decisions concerning resources, space, etc. will need to be made. Resilience Engineering helps business leaders make decisions concerning business viability in the tradeoff spaces. 3.2 Vital Digital Infrastructure Modeled as an Adaptive System To begin modeling an organization s governance of vital digital infrastructure, the concept of sharp and blunt end of systems will be used to describe echelons of management (Figure 9, see Woods et al., 2010). Digital infrastructure systems are a cognitive system with people supervising and managing the technology (Woods & Hollnagel, 2006). At the sharp end of a complex system, practitioners, such as pilots, spacecraft controllers, and, in medicine, nurses, physicians, technicians, pharmacists, directly interact with the process under control. They are responsible to handle variations in the world in order to reach goals. The resource teams effectively running DR tests represent the sharp end of the technology infrastructure system. At the blunt end of the system, regulators, administrators, economic policy makers, and technology suppliers control the resources, constraints, and multiple incentives and demands that sharp end practitioners must integrate and balance to accomplish goals (Woods, 2006). The business leaders and administration managing the individual LOBs constitute the blunt end of the technology infrastructure system. 68

79 Figure 9: Blunt versus Sharp End (Adapted from Woods & Hollnagel, 2006) The RBS exercise was conducted with management and business leaders at the blunt end of the organizational hierarchy who relied on sharp end practitioners and engineers for sharp end knowledge of the system. Figure 10 shows the relationship from the exercises to the blunt versus sharp end framework. 69

80 Figure 10: Blunt versus Sharp End with Exercises Labeled But how are the exercises related to one another? To evaluate dynamics between the LOBs, resource teams and the exercises, another dimension is needed in representation. The organization (Figure 11) governs the goals for the blunt end which then translates to objectives for the sharp end. The rest of this chapter will use this base model to show a vital digital infrastructure as a complex adaptive system and describe three common ways adaptive systems fail. 70

81 Figure 11: Organization Governs Blunt and Sharp Ends Adaptive Capacity Adaptive behavior is imperative for system viability. If the system is responding to customer desires in order to turn a profit, then it needs to adapt to changing customer desires. If a system must endure disturbances from inclement weather, then it needs to adapt to the weather to avoid collapse. Every system s ability to adapt or absorb change has limits. Every designed system involving people makes assumptions about the types of changes the system can expect. Changes that are predictable are incorporated into the system s base adaptive capacity. Unplanned changes are unforeseen and require the system take extra measures for adaptation (extra adaptive capacity). The following analysis uses the two exercises to determine a distinction between strategies for predicted challenges and unforeseen challenges for adaptive systems such as management of vital digital infrastructure. 71

82 Ways to characterize and measure an organization s resilience can be based on an analogy from the world of materials engineering: that of the relationship between stress the varying loads placed on a mechanical structure, and the resulting strain how the structure stretches in response, as described by Woods and Wreathall (2008). The curve (Figure 12) can be described as two regions: the uniform region where the organization stretches smoothly and uniformly in response to an increase in anticipated demands; and an extra region where sources of resilience are drawn on to compensate for non-uniform stretching (risks of gaps in the work) in response to increases in unforeseen demands (Woods & Wreathall, 2008). Figure 12: Stress-Strain State Space with Restructuring (Woods & Wreathall, 2008) 72

83 3.2.2 Base Adaptive Capacity Disaster recovery tests enable the firm to deliver normal operations without inhibition from perturbations commonly faced. Suppose a server reaches a high CPU threshold, requiring DR invocation. System processes are moved according to DR plan and carried out on the secondary location so system performance is sustained. DR is maintained until the primary server is repaired, additional processing capacity is added or other contingency plans are complete. This process is mostly uniform across platforms. By testing DR plans, the firm knows which applications have effective contingency plans when perturbations occur. After DR tests are carried out successfully in a controlled environment, the system is reliable to handle similar demands without more challenges. The DR tests are exemplary of Base or First Order adaptive capacity testing. All systems have some capacity to adapt to changing demands built into the plans, procedures, and roles designed into the system. The costs of stretching to meet these forms or levels of demands are built into how the organization operates (Woods & Wreathall, 2008). The resource teams execute tests to prepare for typical disturbances in normal operations. It is relatively common in the industry for hardware or software failures to require invocation of DR plans. Woods (2006) determined organizations with developed plans, procedures, training, personnel and related operational resources that can stretch uniformly as demand varies are operating in the uniform response region. This is considered the on-plan performance area. DR tests enable the firm to stretch smoothly in the uniform region of the stressstrain state as a response to increased demands. This movement to secondary DR location 73

84 is fairly typical within a week s time concerning the high number of applications governed by the team. Often business operations are not affected by invoking DR as DR testing has enabled consistent adaption to these events successfully Second Oder/Extra Adaptive Capacity From the perspective of the observed organization, the goal of running an RBS is to analyze performance of business continuity efforts when surprises occur. A surprise or challenge event is any type of disturbance that presents challenges to business as usual. As described in the DR test, some disturbances happen so often they become business as usual to mitigate them. The disasters tested in RBSs are the type of disturbance that interrupts contingency plans and challenge the decision making abilities of senior business leaders. For example, in the exercise described previously, recovery sites prescribed in contingency plans were also impacted by the technology outages. An RBS tests the organizations abilities to respond to changing demands, as people need to coordinate new decisions or resources in order to change their response effectively. From the stress-strain analogy, an organization s ability to stretch to respond to unforeseen demand is related to Second Order or Extra Adaptive Capacity. This is the section of the curve when more compensation is required for the non-uniform stretching region in order to avoid the failure point. The compensation requires more work, drawing on additional resources or for business leaders to create new strategies on the fly to keep up with the demand. According to Woods and Wreathall (2008), the system will continue to cope with increasing demands until either the second-order adaptive abilities are 74

85 exhausted and the system fails, or until the system reorganizes and functions in a new restructured mode. The RBS exercise was designed to specifically challenge adaptive response to demand and the need to restructure to avoid failure. The question remains whether the restructuring efforts of the business leaders would indeed result in successful business continuity. Since the outage and response were only a simulation, the success or failure of business continuity is not measurable. However, the RBS highlights the ability of LOBs to transition from the uniform to the non-uniform regions. To recognize gaps in the process, LOBs should be aware of their actions in the different regions and taught to recognize when the strain is transitioning to severe levels. 3.3 Three Ways Adaptive Systems Fail Conducting exercises to test business continuity plans at various scales in the organization and using system perspectives to analyze them collectively will help an organization monitor overall changes in adaptive ability. The stress-strain analogy provides one framework to integrate the exercise analyses into one picture. For organizations wanting to explore resilience engineering as a strategy to outlast competitors and ensure business continuity, then this section is the key. What should an organization search for when managing an adaptive system? Woods and Branlat (2011) have identified three basic patterns of adaptive system failures based on studies conducted in healthcare, mission control, military operations and urban firefighting. 75

86 Three Ways Adaptive Systems Fail Decompensation Working Across Purposes Getting stuck in Outdated Behaivors Table 8: Three Ways Adaptive Systems Fail (adapted from Woods & Branlat, 2011) Results of this study can be used to illustrate each adaptive system failure pattern. Woods and Branlat suggest adaptive systems usually correspond to more than one pattern and, as in the case of this study, all three patterns are often at play. Resilience engineering suggests balancing and monitoring tradeoff decisions will help monitor each failure pattern. When it comes to ensuring business continuity, tradeoff decisions are eventually required (Hoffman & Woods, 2011). For example, at some point it will become too expensive to continue allocating resources to remedy failure and the organization must risk loss of continuity. In addition to each pattern of failure presented in detail in the following subsections, respective key tradeoff decisions are highlighted. Resilience engineering is an optimistic approach to manage complexity. It suggests human ability to reflect on actions enables systems to manage complexity and build adaptive capacity (Woods & Branlat, 2011). With complex adaptive systems such as digital infrastructure, simply demanding business continuity from employees and resources will ultimately lead to collapse and failure. This mindset does not take into account interdependencies, goal conflict or learning capabilities. Even experts analyzing impacts for weeks, such as in the DR test, will encounter surprises in execution. 76

87 3.3.1 Decompensation Decompensation occurs when an adaptive system exhausts capacity to adapt as disturbances challenge and cascade (Woods & Branlat, 2011). As examined in this study, as demands increase, the costs of coordination and resource activation increase. In vital digital infrastructure, if the recovery plan is impacted by a disturbance, does the organization revaluate and communicate plans or collapse? Often before collapse, components and reserves are stretched to performance maximums and overall control deteriorates. To avoid decompensating, management can monitor how hard components are working to keep the system under control. However, this analysis takes trained people, resources, and time which may or may not directly raise or save money for the company. Most organizations will admit they know the value of reflecting on performance and exercises but lack the ability when acute pressures from the world are strong. It takes production time away from both sharp and blunt end resources as both perspectives are required to understand performance. Yet, if a company wants to be resilient, it must invest in knowing how to become resilient and how it can improve where lacking. The tradeoff working here is long term goals versus short term pressures, or chronic versus acute goals (Figure 13). 77

88 Failure Pattern: Decompensation Tradeoff Decision: Acute vs. Chronic Figure 13: Organization Acute versus Chronic Goal Tradeoff Recognizing adaptive capacity and managing it requires constant decision making around which capabilities to include in common testing and which are less important to continually assess. This particular RBS tested a technology outage but many other outages may impact the company. Why not create more comprehensive scenarios to challenge decision makers? Most applications are chosen to have DR sites selected based on certain criteria. Why not backup all applications with plans? Testing all possible disaster scenarios is not cost effective and backing up every application is extremely expensive. A resilient organization balances thoroughness versus efficiency when making decisions that impact or build adaptive capacity (Figure 14). Variation in the world will 78

89 never be encompassed by adaptive capacity without portentous monetary costs so tradeoff decisions must be made. Failure Pattern: Decompensation Tradeoff Decisions: Acute vs. Chronic, Efficiency vs. Thorough Figure 14: Organization Efficiency versus Thoroughness Tradeoff Working Across Purposes Working across purposes occurs when various functional units take actions that undermine the activity of another functional unit. This pattern is directly related to the complexity of the environment and the system itself. Actions taken in the realm of dependencies have implications at other scales than the level of initial action. To manage working across purposes, organizations must realize how attainment of goals from a local 79

90 perspective may cause conflict or deteriorate performance for others or at the global perspective. For example, the RBS targeted production sites and recovery sites within close proximity to experience network outages. Only a few site locations nearby were operable. Physical seats became scarce and in demand, which resulted in many LOBs requesting more seats than available (a seat represents a workstation with a computer connected to the network). Multiple business leader participants mentioned physical seat-space constraints impacting their contingency plans. Discussions ensued regarding who might determine priorities for seats especially when each LOB has unique, urgent needs separate, yet dependent upon each other. The seat allocation challenge is empirical support for working at cross purposes in join systems. In shared resources contexts, monitoring working across purposes is balancing margins of maneuver. Margins of Maneuver (Figure 15) are cushions of potential actions and additional resources that allow the system to continue functioning despite unexpected demands (Stephens, Woods, Branlat & Wears, 2010). It is sustaining the capacity to adapt to the next disturbance. In this specific case the margin of maneuver would be the excess workstations during business as usual, or ability to create additional workstations at sites not affected by outages. 80

91 Figure 15. Margin of Maneuver (adapted from Stephens, 2010) Organizational units have a number of potential types of strategies for reorganization to maintain their individual margins of maneuver (Stephens, Woods, Branlat & Wears, 2011). Each LOB is a unit with individual goals for the firm. Each LOB s margin of maneuver maintenance may impact the ability of other LOBs to maintain margin. Care should be taken to avoid global mal-adaptation due to local adaptations. The creation and maintenance of margins of maneuver is one strategy for resilience in the face of shifting demands. Adaptive capacity is tuned to the future. This requires viewing adaptations from more than one perspective, which are blunt vs. sharp 81

Scenario Design for Training Systems in Crisis Management: Training Resilience Capabilities

Scenario Design for Training Systems in Crisis Management: Training Resilience Capabilities Amy Rankin 1, Joris Field 2, William Wong 3, Henrik Eriksson 4, Jonas Lundberg 5 Chris Rooney 6 1, 4, 5 Department