This file contains the presentation foils for the software engineering 1 course (part Ia/IIg/Dip). It is in ascii for the benefit of blind and partially sighted students. Many thanks to Guita Ramsurun for typing these foils up. These foils cover five of the six lectures. The remaining lecture will be given by Dr Robert Brady of Brady plc. He will discuss the aspects of software engineering that are of most importance in the development of package software. 1. SOFTWARE ENGINEERING CST IA/IIG/Dip ROSS ANDERSON 2. OUTLINE OF COURSE * The 'Software Crisis' * The Software Life Cycle * Critical Software * Quality Assurance * Tools * Large Systems 3. RESOURCES * The newsgroup comp.risks * `Software Egineering' , R S Pressman * `Safeware', N Leveson Additional reading: * `The Mythical Mon Month' F P Brooks * `Computer-Related Risks' P Newman * `Digital Woes' L R Wiener * `Report of the Inquiry into the London Ambulance Service', SW Thames RHA Recommend: wide reading in whichever application area(s) interest you (aviation, healthcare, banking,......) 4. The 'Software Crisis' * The reality of software development has lagged behind the apparent promise of the hardware * Most large projects fail - either they are abandoned, or do not deliver the anticipated benefits - LSE Taurus GBP 400 m - Denver Airport $ 200 m - CONFIRM $ 160 m * Some software failures cost lives or cause large material losses - Therac 25 - Arianne - Pentium - NY Bank - and Y2K in general * Some combine project failure with loss of life, e.g. London Ambulance Service 4. THE LONDON AMBULANCE SERVICE SYSTEM * Manual operation: - 999 calls written on forms; map reference looked up; conveyor belt to central point - controller de-duplicates and passes to NE/NW/S district - division controller identifies vehicle and puts note in its 'activation box' - form passed to radio dispatcher * This takes about 3 minutes, and 200 staff (of 2,700 total). Some errors (esp. deduplication), some queues (esp. radio), call-backs are laborious to deal with * Attempt to automate in 1980's failed - the system failed load test * Industrial relations poor - pressure to cut costs * Decided to go for fully automated system: controller answering 999 call would have on -screen map and could send 'email' directly to ambulance * Consultancy study said this might cost GBP 1.5m and take 19 months, provided a packaged solution could be found, and excluding an automatic vehicle location system (AVLS) 6. LAS (2) * Idea of a GBP 1.5m system stuck. Idea of AVLS added. Proviso of packaged solution forgotten. New IS director hired. Tender put out 7/2/91 with completion deadline 1/92 * 35 firms looked at tender; 19 proposed; most said timescale unrealistic, and only partial automation possible by 1/92 * Tender awarded to consortium of Systems Options Ltd, Apricot and Datatrak at bid of GBP 937,463 - GBP 700K cheaper than next bidder * Design work 'done' July; main contract August; mobile data subcontract September; told in December that only partial implementation would be possible in January - front end far call taking, gazetteer + docket printing * already in June 91, a progress meeting had minuted - - 6 month timescale for 18 month project - methodology unlcear, no formal meeting program - LAS had no full time user on project * also observed that SO relied on 'cozy assurances' from subcontractors 7. LAS (3) * Problems apparent with 'phase 1' system included client & server lockup * 'Phase 2' introduced radio messaging. Problems included blackspots, channel overload at shift change, inability to cope with 'established working practices' such as taking the 'wrong' ambulance * System never stable in 1992, yet under management pressure the full system - with automatic allocation - went live on 26/10/92 * CE said 'no evidence to suggest that the full system software, when commissioned, will not prove reliable' * An independent review had stated that volume testing was needed, with a written implementation strategy, change control and training. It was ignored. * On 26 Oct, room reconfiguration to use terminals not paper. resource allocators separated from radio operators and exception rectifiers. No backup system. No network managers. 8. LAS DISASTER * 26/7 October - vicious circle: - system progressively lost track of vehicles - exception messages build up, scrolled off screen and were lost to rectifiers - incidents held as allocators searched for vehicles - callbacks from patients increased workload - data delays - voice congestion - crew frustation - pressing wrong buttons and taking wrong vehicles - many vehicles sent, or none - slowdown and congestion proceeded to collapse * Switch back to semi-manual operation on 27th. However, irretrievable crash at 2AM, 4 November due to memory leak: 'unlikely that it would have been detected through conventional programmer or user testing' * The real reason for failure was poor management throughout 9. The Software Crisis * Emerged during the 1960's when large and powerful mainframes (such as the IBM 360) made large and complex systems possible * People began to ask why project failures, lost overruns and so on were so much more than with large projects in civil engineering, aerospace engineering, ... * The term 'software engineering' was coined in 1968, the hope was that by applying engineering disciplines such as project planning, documentation and testing, things could be got under control * These techniques certainly help and we shall discuss them * Firstly, let us look at how software differs from machinery, and where its unique problems and opportunities lie 9. WHAT MAKES SOFTWARE DISTINCTIVE? * The features that make programming 'fun': - joy of making things that are useful to others - fascination of building puzzle-like objects from interlocking moving parts - joy of a nonrepeating task - continuous learning - delight of a tractable medium - 'pure thought stuff' * The (related) things that make it hard: - the requirement of perfection - the need to satisfy user objectives, and conform with existing artefacts, standards, and interfaces, that are outside our control - larger systems becomes qualitatively more complex (unlike large ships or large bridges) - the tractability of software leads users to demand 'flexibility' and frequent changes - the structure of software can be hard to visualise or model - a lot of hard slog in debugging and testing which accumulates at the end of a project - when the excitement is spent, the budget is overspent and the deadline (competition) is looming 10. LONDON AMBULANCE SERVICE REPORT: http://www.cs.ucl.ac.uk/staff/A.Finkelstein/las.html 11. (pictures) 12. THE SOFTWARE LIFE CYCLE * The cost of owning a system is not just the development cost but the whole cost over its life cycle: development - testing - operations - replacement * In the days of 'bespoke' software, it was common for 90% of an IT department's programming effort to be devoted to the maintenance of old systems rather than the development of new ones * Most research on software costs and methods focuses on this business model. We will discuss it in this lecture * Different business models apply to safety critical and related software (lecture 3) and to package software (lecture 4). However many lessons learned in one model apply to the others too 13. COMMON DIFFICULTIES * Although code doesn't 'wear out' the way gears do, both the platform and the application requirements change over time. Code becomes more complex, less well documented, harder to maintain, more buggy * Its failure rate mirrors that of machinery (but for different reasons!) (diagram - number of bugs falls initially, then is steady for a long period, then starts to rise again) * When it is redeveloped (or developed for the first time), there are often unrealistic expectations of price versus performance (as hardware gets cheaper, software seems more expensive) * Two of the main causes of project failure are requirements that are incomplete/changing/misunderstood, and insufficient time * These and other factors lead to the 'tar pit' - any individual problem can be solved, but the number and complexity of them gets out of control 14. LIFE CYCLE COSTS * Development costs (Bochm, 75) Requirements/ Implement Test Spec Command & Control 48% 20% 34% Space 34% 20% 46% O/S 33% 17% 50% Scientific 44% 26% 30% Business 44% 28% 28% * Maintenance costs: typically ten times as much again * By the late 60's it had become 'intuitively' clear' that - well built software cost less to maintain - effort spent getting the specification right more than paid for itself by reducing the time spent implementing and testing, and the cost of subsequent maintenance. 15. WHAT DOES CODE COST? * Common measure is KLOC (thousand lines of code) * First IBM measures (60's): - 1.5 KLOC / man year (operating system) - 5 KLOC / man year (compiler) - 10 KLOC / man year (app) * AT&T measures: - 0.6 KLOC (man year) (compiler) - 2.2 KLOC (man year) (switch) * More sophisticated measures: - Halstead (entropy of operators, operands) - McCabe (graph complexity of control structures) - `function point analysis' * Two lessons learned: - main gains come from using an appropriate high level language (each KLOC does more) - wide variations between individuals (>10X) 16. BROOKS' Law * 'The Mythical Man-Mouth' attacked the idea that men and months are interchangeable * More people - more communications complexity (n people means n(n-1)/2 channels and 2n cliques) * Adding people - productivity drops as they are trained * E.g consider a project at 3 men X 4 months * Design takes 2 months not 1! So there are two months left to do work that was originally estimated at 9 man-months * If time slippage disastrous, add 6 men. (Training takes 1 month so all the 9 man-months must be done in the last month.) * However, the work that 3 men could do in 3 months can't be done by 9 men in 1 month (complexity, interdependencies, testing, ...) Hence * Brooks' Law: 'Adding manpower to a late software project makes it later? 17. BOEHM'S EMPIRICAL STUDY * Brooks' Law was enunciated in 1975. It led to empirical studies, notably by Barry Boehm ('Software Engineering Economies', 1981): - The cost-optimum schedule time to first shipment, T = 2.5 times the cube root of the total number of man months - With more time, the cost rises slowly ('people with more time take more time') - With less time, the cost rises sharply - Hardly any projects succeed in less than 3/4T, regardless of the number of people employed! * Other studies show that if more people are to be added, they should be added early rather late * Some projects have more and more resources thrown at them yet are never finished at all (e.g. CONFIRM); other are years late 18. STRUCTURED DESIGN * The only practical way to build large programs is to divide them up into modules * This enables the architect to control complexity * Typically high level components/subsystems under control of project teams (e.g., general ledger, loans, tellers, ATMS,.....), with each of these divided into modules under control of individual programmers and testers (calculate interest, update file, ....) * Often the subdivision of tasks is straightforward * Sometimes it isn't * Sometimes - worst case - it just seems to be! * There are a number of methodologies (SSADM, Jackson, Yourdon, ....). Some are more data driven, others oriented towards functionality. We will discuss tools in more detail in lecture S. 19. THE WATERFALL MODEL * Royce, 1970; now a US DoD standard Requirements (drives) Specification (drives) Implementation & unit testing (drives) Integration & system testing (drives) Operations & maintenance * Requirements are written in the user's language * Specification is written in system language * Unit testing checks units against the spec * System testing checks the requirements are met 20. ELABORATION - FEEDBACK * Validation operations provide feedback from Specification to Requirements and from Implementation/unit testing to Specification * Verification operations provide feedback from Integration/ system testing to Implementation/unit testing, and from operations/maintenance back to Integration/system testing * What's the difference between `validation' and `verification"? * Validation: `are we building the right system?' * Verification: `are we building it right?' * It might seem logical to add another feedback path - validation - from operations and maintenance back to requirements * However this would change the development model and erode much of its value.... 21. ADVANTAGES OF THE WATERFALL MODEL * It makes the project manager's task easier by providing definite milestones to aim at * It enables the developer to make appropriate charges for changes to the requirements (each stage may be a separate contact!) * It couples early clarification of system goals, architecture, interfaces and is conducive to good design practices * It is compatible with a wide range of tools and detailed design strategies * Where it is applicable, it is usually the best approach * The critical factor is whether the requirements can be defined in detail, in advance of any development or prototyping work. Sometimes they can (e.g. a compiler); sometimes they can't (e.g. a human-computer interface) 22. OBJECTIONS TO THE WATERFALL MODEL * 'Reality isn't like that' * Iteration is important in the software development process, especially where: - the requirements are not yet understood by the development team - the requirements are not yet by the customer - in some types of applications, eg interface development - the technology is changing - the legal environment is changing - the customer environment is changing, e.g. from one customer to many * The quality improvement that a top-down approach can yield may be unimportant over the system lifecycle * Specific objections from safety-critical and package software developers 23. WHERE IS THE LIKELIHOOD OF FAILURE HIGH? Requirements Very High Specification Low Design Low Implementation Low Installation High Operation Enormous Maintenance Very High 24. REQUIREMENTS Requirements are developed by at least two groups of people who speak different languages and who come from different disciplines. 25. LOW RISK ACTIVITIES Specific and Design and Implementation are done by a group of single-discipline professionals who usually can communicate with one another. 26. INSTALLATION Installation is usually done by people who don't really understand the issues or the problem or the solution. 27. OPERATION After a start-up period, Operation is almost always left to people who don't understand the issues, the ethics, the problem or the solution (and often understand little else). 28. MAINTENANCE Maintenance is usually performed by inexperience people who have forgotten much of what they once knew about the problem or the solution. 29. OPERATION IS THE BIG SOFT SPOT Robert Courtney, a New York security consultant, examined thousands of security beaches in both industry and government and found that 68% of them were due to careless operations or incompetent operations. 30. A CAUTIONARY TALE * In 1985, a large bank decided to replace a mixture of old systems with a centralised IBM mainframe * Decided to buy in a retail banking package and customise it as they had 'no experience at specifying a next generation banking system' * A proprietary variant of waterfall was adopted * A user team prepared a list of requirements changes needed to adapt the package from its original US environment * When the system was fielded in the first branches, people realised that these changes had made it functionally almost identical to the old system * The many changes meant that the code was incompatible with the next release of the package * 'Instant legacy system' at a nine-figure cost 31. ITERATIVE DEVELOPMENT * Some systems need iterative development to clarify requirements * Others can benefit from making operations as fail-safe as possible * Naive approach Develop outline spec ----> Build system ---> Use system ^ | | V ---- NO <---- System OK? | V Deliver System * This algorith needn't terminate (satisfactorily) * Can we get a combination of the management benefits of Waterfall, with the flexibility of iterative development? 32. THE SPIRAL MODEL (Boehm, 88) * fixed number of iterations of form: identify alternatives, assess + choose, build, evaluate (diagram: presented as an outward spiral from the starting point, with successive iterations of each of these steps on the same radial) * driven by risk management * iterative prototyping applied to relevant ports of the system (e.g., human computer interface) 33. CRITICAL SOFTWARE * Many systems have the property that a certain class of failures is to be avoided if at all possible - safety critical systems - failure could cause death, injury or property damage - security critical systems - failure could result in leakage of classified data, confidential business data, personal information - business critical systems - failure could affect important operations * Critical computer systems have a lot in common with critical mechanical or electrical systems (bridges, flight controls, brakes, locks, ...) * Start out by studying how systems fail 34. EXAMPLE - PATRIOT MISSILE * Failed to intercept an Iraqi SCUD missile on 25/2/91; SCUD struck a US barracks in Dhahran * Other SCUDs got through to Saudi Arabia, Israel * Reason for failure: - measured time in 1/10 sec, truncated from binary representation .0001100110011.... - as system upgraded from anti-aircraft to anti-missile, greater accuracy introduced - but not everywhere in the code - two modules got out of step by 1/3 sec after 100 hours operation. Target not acquired - defect not found in testing as the spec called for 14 hour continuous operation only * Many critical systems failures are multifactorial: 'a reliable system can't fail in a simple way! 35. DEFINITIONS * error: design flaw or deviation from intended state * failure: non-performance of the system within some subset of the specified environmental conditions * fault: careful!! Elec. eng: (error -->) failure --> fault Comp. sci: error --> fault --> failure * reliability: probability of failure within a set period of time. Sometimes expressed as 'mean time to (before) failure' - mttf (mtbf) * accident: undesired, unplanned event that results in a specified kind (and level) of loss * hazard: set of conditions of a system, which together with conditions on the environment, will lead to an accident (Thus, failure + hazard ---> accident) * risk: hazard level combined with: danger (prob. of hazard --> accident) and latency (hazard exposure or duration) * safety: freedon from accidents 36. SYSTEM SAFETY PROCESS * Obtain support of top management, involve users, and develop a system safety program plan - identify hazards and assess risks - decide strategy for each hazard (avoidance, constraint,....) - trace hazards to hardware/software interface: which wil manage what? - trace constraints to code, and identify critical components and variables to developers - develop safety-related test plans, descriptions, procedures, code, data, test rigs,... - perform special analyses such as iteration of human-computer interface prototype and test - develop documentation system to support certification, training,.. * Safety needs to be designed in from the start. It cannot be retrofitted 37. REAL-TIME SYSTEMS * Many safety critical systems are real time systems used in monitoring or control. They have particular problems * Often, very extensive application domain knowledge is needed for design * Criticality of timing makes many design verification techniques inadequate * Exception handling is particularly problematic. E.g., Arianne 5 (4/6/96); - Ariane 5 accelerated faster than Arianne 4 - alignment code in IN set had an 'operand error' on float-to-integer conversion - core dumped - core interpreted as flight data - full nozzle deflection --> 20 degrees angle of attack --> booster separation --> self destruct HAZARD ANALYSIS * There may be several hazard categories. For example, the Motor Industry Software Reliability Association uses: - Uncontrollable: failures whose outcomes cannot be influenced by human response and are most likely to lead to extremely severe outcomes - Difficult to control: failures whose effects could, under favourable circumstances, be influenced but are likely to lead to very severe outcomes - Delibilitating: effects usually controllable reduction in safety margin, outcome at worst severe - Distracting: operational limitations, but a normal human response limits outcome to minor - Nuisance: affects customer satisfaction, but not normally safety * Different hazard categories require different failure rates and different levels of investment in varying software engineering techniques 39. HAZARD ANALYSIS (2) * In complex or high-risk systems, we may want hazard analysis to be much more structured E.G, a nuclear - capable US Navy cruiser/destroyer missile programme had: - preliminary hazard analysis, leading to - - system hazard analysis - interfaces between components - operating hazard analysis - human machine interfaces - maintenance hazard analysis - computer program safety analysis - subsystem hazard analysis - radiation hazard analysis - nuclear safety analysis - inadvertent launch analysis - weapon control interface analysis * In other words, a number of overlapping and interlocking studies that drive the safety programme 40. HAZARD ELIMINATION * Many hazards can be completely eliminated by small changes in design. E.g., motor reversing circuit in which the output of a battery is connected to the poles of a double-pole double-throw switch whose contacts are connected to the motor (if pole A has contacts 1A and 2A, and pole B has contacts 1B and 2B, where the state `1' indicates forward and `2' reverse, then contacts 1A and 2B are wired to one end of the motor and contacts 1B and 2A are wired to the other end) * hazard: if the switches don't move together (e.g. connection is made cimultaneously with 1A and 2B) then the result is a battery short circuit, causing a fire in the battery * Redesign: connect the motor to the poles and the battery to the contacts. That way, if there's a short circuit it's only the motor that gets short circuited which is less likely to cause a fire * The 'holy grail' is intrisically safe software. However, this needs a system level approach as hazard elimination techniques usually involve more than just software 41. THERAC - 25 * 25 ME V 'Therapeutic accelerator' with two modes of operation: - 25 MEV focussed election beam on a target that generates X-rays for treating deep tumours - S-25 MEV spread election beam for direct treatment of surface tumours (picture of machine over hopsital bed in shielded room, with operator console outside) (picture of turntable situated between patient and electron beam. This has a scan magnet for steering a direct beam, a target for generating X-rays, a mirror for alingment, a counterweight to the X-ray target, and microswitch actuators on the rim so the tuntable position can be detected) 42. THERAC ACCIDENTS * The focussed election beam used in X-ray therapy has 100 times the beam current of the beam used in election therapy; it is highly dangerous to living tissue * Previous models (Therac 6 and 20) had fuses and mechanical interlocks to prevent the high intensity beam being selected unless the X-ray target was in place * In the Therac 25, these safety mechanisms were replaced by software. The fault free analysis, arbitrarily assigned a probability of 10^{-11} to 'computer selects wrong energy'. * However, between 1985 and 1987, there were at lest six accidents in which patients were directly irradiated with the high energy beam. Three died as a direct result * Major factors were a poor human computer interface and very poorly written, unstructured code. 43. THERAC ACCIDENTS (2) * Marietta, Georgia, June 1985: women's shoulder burnt. Sued and settled out of court. Not reported to FDA, or explained * Ontario, July 1985: women's hip burnt. Died of cancer. AECL found that a 1-bit microswitch error might have caused it, but could not reproduce the fault. Software changed. * Yakima, Washington, December 85: woman's hip burnt. Survived. 'Could not be a malfunction' * Tyler, Texas, March 86: man burned in neck and died. AECL denied knowledge of any hazard * Tyler, Texas, April 86: 2nd man burnt on face and died. Hospital physicist managed to recreate the fault: if the parameters were edited too quickly, the interlock was overwritten * This had also happened with the Thera-20 but resulted in a blown fuse * Yakima, Washington, January 87: man burned in chest and died - due to different bug thought now to have also caused the Ontario accident 44. THERAC - LESSONS LEARNED * AECL ignored the safety aspects of software; assumed when doing risk analysis - and investigating Ontario - that hardware must be at fault * Confused reliability with safety - since the software worked, & accidents rare, assumed it was ok * Lack of defensive design - machine couldn't verify that it was working correctly * Failure to tackle root causes - Ontario accident not properly explained at the time (nor was first Yakima incident ever!) * Complacency - medical accelerators had a good safety record up till then * Unrealistic risk assessments ('think of a number and double it') * Inadequate reporting, follow-up and government oversight * Inadequate software engineering practices (spec on afterthought, complicated design, dangerous coding practices, little testing, careless human interface and documentation design) 45. FAULT TREE ANALYSIS * Idea: work back systematically from each identified hazard (picture of tree. The root is `wrong or inadequate treatment administered'. Next level: `vital signs exceed critical limits but not corrected in time', `vital signas erroneously reported as exceeding limits', etc. Next level: `nurse does not respond to alarm', `vital signs not reported', `computer fails to raise alarm', `frequence of measurement too low', etc. Followed back to specific sensor failures, operator mistakes, design errors, processing faults, etc) This enables you to identify where the redundancy is, and which events are critical 46. FAILURE MODES AND EFFECTS ANALYSIS * FMEA is the heart of NASA's safety methodology * Look at each component's functional modes and list the potential failures * Describe the worst-case effect on the system 1 - loss of life 2 - loss of mission 3 - other * Secondary mechanisms are used to deal with interactions * Software not within this system. However, FMEA used on software by other organisations 47. REDUNDANCY * some systems, like Stratus & Tandem, have highly redundant hardware for 'non-stop processing' (picture of four CPUs grouped in two pairs, each pair connected to two bus lines by a comparator. In the event of CPUs disagreeing, the comparator goes open circuit. Fault detection hardware then orders a spare board from the manufacturer.) * But then the software is where things break * The 'hot spare' inertial navigation set on Arianne S failed first! * Idea: multi-version programming * But: significantly correlated errors, and failure to understand requirements comes to dominate (Knight, Leveson 86/90) * Also, many problems with redundancy management. For example, 737 crashes Panama/Kegworth 48. EXAMPLE - PANAMA CRASH * When flying in instrument meteorological conditions, it is critical to know which way is up * Traditional approach: artificial horizon plus turn-and-slip. Measure different things in different ways; can fly with either (picture: traditional instrument panel. Articifical horizon driven by a 2-axis gyro powered pneumatically. Turn indicator driven by a captive gyro powered electrically. Slip indicator is a ball bearing in a U-tube) * New generation of airlines also have Electronic Flight Information System (one each side) (picture: large multifunction display, twice the size of a traditional artificial horizon, with side windows containing airspeed, altitude etc) * You might think that this added redundancy, with a 'state-of-the-art' human computer interface, would be safer! (cause of crash: plane flew with one EFIS gyro faulty, and drove both EFIS screens from single gyro. Redundancy thought adequate as both traditional instruments still functional. However when connector to remaining EFIS failed, crew believed erroneous reading - later simulations showed this a common error as the EFIS is twice the size of the artificial horizon and centrally located in front of the pilot) 49. EXAMPLE - KEGWORTH CRASH * British Midland 737-400 left Heathrow 8/1/89 for Belfast with 8 crew, 118 pax * Climbing through 28,300', a fan blade fractured in the #1 (left) engine. Caused vibration, shuddering, smoke, fire * Crew mistakenly shut down the #2 engine and cut throttle to #1 to descend to East Midlands Airport. Vibration reduced, until throttle opened again on final approach * Crashed next to M1 at Kegworth. 39 pax died in crash and 8 later in hospital; all but 5 of 79 survivors seriously injured * Inital assessment: engine vibration sensors cross-wired by accident * Mature assessment: crew had failed to assimilate information from new, digital, instruments * Recommendations included human factors evaluations of flight systems, clear `attention getting facility', video cameras showing aircraft interior and exterior 50. `HUMAN ERROR' RATES * Extraordinary errors - difficult to conceive how they would occur. Stress free, powerful cues to success 10^{-5} * Errors in regularly performed, common simple tasks with minimum stress 10^{-4} * Pressing wrong butten / reading wrong display - complex tasks, little timne, some cues necessary 10^{-3} * Dependence on situation and memory; unfamiliar task with little feedback and some distraction 10^{-2} * Highly complex task, considerable stress, little time 10^{-1} * Process involving creative thinking, unfamiliar and complex operations, time short and stress high O(10^0) 51. MODES OF AUTOMATION (a) Computer provides information and advice to controller, perhaps by reading sensors directly (picture: operator in loop with displays, sensors, process, controls and actuators, computer off to the side or mediating some sensor input) (b) Computer reads and interprets sensor data for operator (picture: the computer is now in the control loop with the operator, between the sensors and the displays) (c) Computer interprets and displays data for operator and issues commands; operator makes varying levels of decisions) (picture: there are now two loops, both of them incorporating the computer: one has the operator, the displays and the controls, while the other has the process, the sensors and the actuators) (d) Computer assumes complete control of process with operator providing advice or high-level direction (picture: the operator is now out of the loop, but merely acting as a peripheral to the computer. The computer is in the loop with the sensors, the actuators and the process) 51a MYTHS OF SOFTWARE SAFETY * 'Computers are cheaper than analogue or electromechanical devices' - shuttle software costs $108 pa to maintain * 'Software is easy to change' - but hard (and expensive) to change safely * 'Computer are more reliable' - shuttle software has had 16 potentially fatal bugs found since 1980 - and half of them had flown * 'Increasing software reliability increases safety' - perfectly functioning software still causes accidents * 'Testing or formal verification can remove all errors' - exhaustive testing usually impossible, and proofs can have errors too * 'Reuse increases safety' - using software in a new environment is likely to find more errors, eg F-16, ATC * 'Automation can reduce risk' - potential not always realised, humans still need to intervene but may not be 'in the loop' 52. TOOLS * We commonly use tools when some parameter of our task exceeds our native capability - heavy object: reaise with lever - tough object: cut with axe * Software engineering tools deal with complexity. There are two kinds of complexity: * Incidental complexity dominated programming in the early days. E.g., writing machine code is tedious and error prone. Solution: high level language * Intrinsic complexity of applications is the main problem nowadays. E.g., complex system with large team working on it. `Solution': waterfall/spirla model to structure development, project management tools, etc. * We can aim to eliminate the incidental complexity but we have to manage the intrinsic complexity 53a INCIDENTAL COMPLEXITY (1) * The greatest single improvement in programmer productivity came with the introduction of high level languages, starting with FORTRAN - 2000 KLOC/year goes much further in Java than assembler - code is easier to understand and maintain - more appropriate level of abstraction - data structures, functions, objects rather than bits, registers, branches - stracture enables many typos etc to be found at compile time - code may be portable; at least, the machine specific detail is hidden. Device drivers etc can be written once only rather than embedded in each application * Objections: - compilers have errors (but: programmers make more!) - performance (so: optimise only where needed) * Performance gain (of programmers) 5-10 times * Now that coding is about 1/6 of the total effort in a project, no similar performance gain is available from anything else 53b INCIDENTAL COMPLEXITY (2) * Most advances since the early high level languages have focussed on helping the programmer to structure and manage his code * Don't use `goto' )Dijkstra, 68); structured programming; pascal (Wirth, 71) * Basic idea: combining information hiding with `proper' control structures facilitates stepwise refinement and correct abstraction * Object-oriented programming - Simula (Nygaard, Dahl, 60s) - Smalltalk (Xerox, 70s) - C++, Java, ... * Basic idea: bundle the code and data into an `object'. Really a design philosophy rather than a family of languages - but increasingly successful as a result of language success * Well covered in the rest of the course. Don't forget the main purpose is to manage complexity! (Y2K, Arianne, Patriot, ...) 54. INCIDENTAL COMPLEXITY (3) * Early batch systems ere very tedious for the developer * Time sharing systems allowed online test - debug - fix- recompile - test * Still needed a lot of 'scaffolding' and carefully thought out debugging plan * Next iteration: tools for naspshots, dump analysis, ..., source level debuggers, ... * Led to integrated programming environments (TSS, Unix, Smalltalk, Turbo Pascal, .....) * Some of these start to support tools to deal with the intrinsic complexity of managing large projects - 'CASE' 55. FORMAL METHODS * Pioneers such as Turing talked of proving programs using mathematics * Program verification started with Floyd (67); followed up by Hoare (71) and others * Now there's a wide range of techniques and tolls for both software and hardware, ranging from the very general to the highly specialised - Z, based on set theory, for specifications - LOTOS for checking communication protocols - HOL for hardware - BAN for cryptographic protocols * Are not infallible - proofs have mistakes too - but force us to be very explicit and check designs in great detail. Many bugs found * Considerable debate on value for money etc. Personal view: effective in situations such as security protocols where intuition often fails 56. PROJECT MANAGEMENT * A manager's job is to - plan - motivate - control * The skills involved are primarily interpersonal rather than technical. Yet managers must retain the respect of technical staff * Growing capable managers has been one of the perpetual problems of the 'software crisis'. One hears saying such as 'managing programmers is like herding cats' * However there are a number of tools that can help with at least the planning and controlling aspects of the task. * A particular problem is managing the time allocated to subprojects 57. ACTIVITY CHARTS * Show a project's tasks and milestones (with allowable variation) (picture: sideways bar chart showing different tasks occupying varying lengths of time - Week 1 Week 2 Week 3 Week 4 Week 5 Task 1 XXXXX XXXXX ????? Task 2 XXXXX XXXXX XXXXX Task 3 XXXXX XXXXX etc) * Problem: relatively hard to visualise interdependencies and knock-on effects of any milestone being late. 58. CRITICAL PATH ANALYSIS * Drawing the activity chart as a graph with dependencies makes the critical path easier to find and monitor (picture: graph of tasks at nodes and dependencies as edges) * PERT charts similar but with pessimistic/expected/optimistic task durations * Such techniques can help maintain 'hustle' and warn of approaching trouble in time to take actions * However, a mechanical approach isn't enough. Skill and experience count. E.g., overestimates of duration come down steadily during the tasks; underestimates are usually covered up until a few weeks short of the deadline! 59. DOCUMENTATION * Project has number of management documents: - contracts - budgets - activity charts & graphs - staff schedules plus a number of engineering documents: - requirements - hazard analysis - specification - test plan - code * How do we keep all these in step? Computer science tells us it's hard to keep independent files in synch * Possible solutions - high tech: CASE tool - bureaucratic: plans and controls dept - convention: self documenting code 60. AN ALTERNATIVE PHILOSOPHY * Some programmers are very much more productive than others - by a factor of ten or more * 'Chief programmer teams', developed at IBM (70-72) seek to capitalise on this * Build teams of one chief programmer, one apprentice/assistant, plus a toolsmith, a librarian, an administrative assistant, etc to get the maximum productivity from the available talent * 'A surgical team, not a hog butchering team' * Can be very effective during he implementation stage of a project * However, each team can only do so much * Complementary to, rather than opposed to, Waterfall/Spiral and other project management methodologies 61. MORE ALTERNATIVE PHILOSOPHIES * 'Egoless programming' - the code should be owned by the team, not by any individual (Weinberg, 1971). In direct opposition to the 'chief programmer team' idea. * 'Literate programming' - the code should be a work of art, designed not just for the machine but for subsequent human readers/maintainers (Knuth et al) Objections: * Group can lead to wrong design decisions becoming more entrenched, and being defended and propagated more pasionately * 'Creeping elegance' may be symptomatic of a project sliding out of control No silver bullet! 62. CONFIGURATION MANAGEMENT & CHANGE CONTROL * One of the most critical, yet often poorly performed, software tasks - from the point of view of reliability, safety, security, ... * The idea is to control the process Development ----- \ -> test --------> production / Package purhase --- * The test process may have multiple stages (for home written software) or be a simple compatability check (for package upgrades) * Either way, someone must assess the residual risk and take responsibility for live running * Fewer changes are easier to manage (e.g., AT&T exchange code updated quarterly) * Need to manage: - backup and recovery - rollback - interim bug fixes 62. TESTING * Testing is neglected in academic tests, but is the focus of great industrial interest - being maybe half the cost * Bill Gates: 'are we in the business of writing software, or test harnesses?' * It takes place at a number of levels: - validation of the initial design - module test after coding - system test after integration - beta test 1 field trial - subsequent litigation - ... * cost per bug removed rises dramatically as we go down this list * Common failing is to test late, because testing early wasn't designed for. This is expensive. We must design for testability 64. TESTING (2) * Package software developers consider the main advance in software engineering over the past ten years has been in testing - design for testability, plus regression testing * Regression testing - checking that the new version of the software gives the same answers as the old version * Use a large database of test cases, including all bugs ever found. Specific advantages: - customers are much more upset by failure of a familiar feature than of a new one - otherwise each bug fix will have a ~ 20% probability of reintroducing a problem into set of already tested behaviours - reliability of software is relative to a set of inputs. Best test the inputs that users actually generate! * Try to model reliability growth, so we know 'when to stop testing' 65. TESTING (3) * Reliability growth model help us assess mean time to failure, mumber of bugs remaining, economics of further testing, ..... * Empirically, the failure rate of software drops exponentially at first, then settles down to decrease as K/T * Changing testers brings new bugs to light * Lessons learned: - early parallelism is best (most economic) - to get a mttf of 10^9 hours, need 10^9 hours testing 66. TESTING (4) * Failure to understand the conditions in which the system will actually be operated leads to expensive 'testing failures' - brown-out meter bug - 'Kentucky Fried Chip' - PCF * Also, some failures couldn't reasonably have been foreseen * Military approach: hostile review followed by prolonged field testing * Some sympathy among utilities, but in general, corporate politics prevents wider uptake - consultants hired for 'respectability' rather than capability 67. RISK REDUCTION VS DUE DILIGENCE * Most of the techniques we have discussed are about risk reduction * However, as we have seen, risk reduction can be fuzzy and open-ended. We may know 'how much' to test but not 'what' to test * Organisations are highly averse to such uncertainty and prefer to avoid residual risk issues * 'Tell me what I must do to be saved' * Strong cultural pressures (eg, aviation, banking) to do as the others do; legal pressures everywhere (negligence judged 'by the standards of the industry') * Hence risk reduction gets replaced with 'due diligence' - following a standard checklist, hiring a big-name consultant, complying with BS xxxx or ISO yyy * This is often more expensive than doing the job properly; it can also lead to 'structural' disasters 68. PARTICULAR PROBLEMS OF LARGE SYSTEMS * Study of problems with 17 large, demanding systems (Curtis, Krasner, Iscoe, 1988) * Team and organisational factors in project failure investigated in 97 interviews * Main findings - large software projects fail because of (1) thin spread of application domain knowledge (2) fluctuating an conflicting requirements (3) breakdown of communication and coordination * These were very often linked, and the typical progression to disaster was (1)-->(2) --->(3) 69. LARGE SYSTEM PROBLEMS (2) * Thin spread of application domain knowledge - how many people understand all aspects of running a telephone service/bank branch network/hospital? - many aspects are jealously guarded secrets - in some areas, structured effort to overcome this, eg in pilot training - otherwise, with luck, you may get a genuine 'guru' - even then (and certainly otherwise), expect specification mistakes * Even without specification mistakes, the specification may still change in midstream - computing products, new standards, new equipment, new focus on networking, fashion,... - changing company environment (takeover, election, recession, refocus, ...) - new customers, e.g. overseas, with different requirements Success and failure both bring their own changes! 70. LARGE SYSTEM PROBLEMS (3) * Problems with communications already mentioned in combinatoric terms - N participants means 1/2N (N-1) channels and 2^N subgroups * Traditional way of coping - hierarchy - has the problem that if information flows via 'lowest common manager', the managers get overloaded * Usual result - proliferation of of committees * Side effect - politicking, avoidance of responsibility, blame shifting * Fights between 'line' and 'staff' departments * Management attempts to gain control may result in constriction of particular interfaces, e.g. to the customer * Managers are often loth to believe bad news, much less pass it on * Informal networks are often vital, but are disrupted by 'reorganisation' Caius Petronius (AD 66): `We trained hard, but it seemed that every time we were beginning to form up into teams, we would be reorganised. I was to learn later in life that we tend to meet any new situation by reorganising, and a wonderful method it can be for creating the illusion of progress while producing confusion, inefficiency and demoralisation.' 71. THE CAPABILITY MATURITY MODEL * By the mid-80's, people had begun to realise the importance of keeping teams together. The ability to work as a team productively is something that grows over time * An emphasis shift from the 'product' to the 'process' has been found in many areas * A team itself isn't enough. We need repeatable, manageable performance, not an outcome that depends on individual genius or heroics * The 'market leading' approach to this problem is the capability maturity model (CMM), developed at CMU with DoD funding * It identifies five levels of increasing maturity in a software team or organisation, and provides a guide to moving up from one level to the next 72. CMM (2) * Empirical model based on observations and refined over a number of years * Level 1 (bottom) - chaotic. Success depends on luck + heroism * Level 2 - repeatable * Level 3 - defined * Level 4 - managed * Level 5 (top) - best in the industry: safety critical software teams of Boeing, IBM, etc 73. CMM (3) * How to move up the ladder - focus at each stage on what is most lacking * Much detail; see article in handout 74. CONCLUSIONS * Software engineering is hard, because it is about managing complexity * We can remove much of the incidental complexity using modern tools (such as high level languages and development environments) but the intrinsic complexity remains * Dealing with that means understanding the requirements, partitioning the problem into manageable subproblems, and using project management techniques to hold it all together * Further intrinsic complexity comes from the size of many modern products * Although a top down approach is necessary, it is not sufficient. We may need to iterate the design * The maturity of the process is important. Rome wasn't build in a day; neither was Microsoft