[ Home ] [ Table of Contents ] [ About Lee Goeller ] [ Search ]

Communications on Communication

Another disaster on which I commented without finding a publisher.

How the Glitch Stole
Martin Luther King's Birthday

There's an old saying among traffic people that "you don't engineer for Mothers' Day." That is, you don't install facilities that are needed only on days of peak demand. Martin Luther King's birthday turned out to be something else again. There is, as yet, no tradition for calling people up because of the nature of the holiday and, because the King birthday is not universally observed, there isn't as big a drop in business traffic as there is for Christmas. All things considered, one might expect slightly lighter traffic than normal, with those attempting calls encountering less delay or difficulty than usual.

This year it didn't work out that way. The AT&T toll network crashed at 2:30 on the afternoon of January 15, 1990, giving users an announcement saying "all circuits are busy, try again later," a fast busy signal (reorder tone), or nothing at all. But not universally. Every so often, a call would go through as if nothing were wrong, and once it was set up, it encountered no further difficulty. People who kept trying got through but sometimes it took eight or ten attempts. What had hit the company that invented reliability?

It appears to have been a glitch in some new software that had been installed back in December in all 114 of AT&T's 4ESS switches to improve their reliability. Once triggered by a minor and unrelated hardware failure in a 4ESS in New York, its results ricocheted through the signaling network that directs toll network operations, momentarily interrupting call set-up procedures. It is not known how many of the 4ESS switches were affected, but apparently those involved went in and out of service repeatedly until emergency crews were able to restore order about 11:30 p.m.

To understand what happened, a little history is required. The 4ESS, which first went into service in 1976, was designed to work with common channel interoffice signaling, or CCIS, a signaling network completely independent of the voice channels used by telephone customers. The original version of CCIS used signaling links running at 2.4 Kbps; although not high speed by today's standards, a link could handle the signaling and control information for more than 2000 trunks.

Because failure of a link handling the signaling for so many circuits could be a major catastrophe, the entire signaling network was duplicated with geographically diverse facilities. The country was divided into ten regions, each with a pair of widely separated packet switches called signal transfer points (STPs); every 4ESS in a region homed on both of its STPs, and every STP had a direct signaling link to all the others. In addition, the STP pairs within each region were tied together by signaling links so that if the desired link from one were down, the link from the other could be used.

The number of pairs of STPs was increased from 10 to 16, and the data link speed was doubled to 4.8 Kbps. By 1985, this was inadequate, so new STPs, based on AT&T's 3B20D computer, were installed and linked together and to their 4ESS clients via 56 Kbps channels running Signaling System 7. Finally, to off-load the signaling effort required of the 4ESS common control, each 4ESS was provided with its own 3B20D processor, in this instance called a Direct Link Node, or DLN, to act as an interface to the signaling network. A 3B20D is internally redundant and, as a result, highly reliable. It is also very powerful; as a signaling network interface, it is lazing along at about half occupancy.

The mid-December upgrade for the 4ESSs consisted of arranging these switches in pairs so that each could share the other's DLN, adding reliability by duplication and also by providing a second and diverse path into the signaling network. This required a separate data link from each 4ESS to its mate's DLN, but that is no problem for AT&T. The now infamous glitch was apparently in the DLN software upgrade required for this duplication procedure.

Everything appears to have worked fine for about a month. Then, on January 15, some small mechanical failure took place in a 4ESS in New York. The internal automatic maintenance capability of the system went into action to resolve the problem, a procedure that normally takes four to six seconds and is invisible to customers. To facilitate this process, the 4ESS, via its DLN, sent out a message on the signaling system telling its connected switches that it was momentarily not accepting new calls. This information was stored in status tables in the DLNs which can be thought of as part of the connected 4ESS switches.

As soon as the sick widget was taken care of, the New York 4ESS returned to originating its own calls. The DLNs in the connected switches, upon seeing new traffic coming from New York, began the process of resetting their routing tables to show New York back on line and ready to accept calls. However, because it was the middle of the afternoon busy hour, New York sent additional call originating messages before the DLNs had completed their table updates.

The glitch was such that, if the DLN at a distant 4ESS received two call origination messages from New York within 10 milliseconds of each other, while it was updating New York it its status tables, some data would be damaged and the DLN would "reinitialize;" i.e., it would tell all its connected switches that it was not accepting new calls until it checked itself out. Its own 4ESS could still use the DLN of its mate, thanks to the new upgrade, but two call originations to that DLN in the 10 ms window while it was fixing its own tables would cause it to reinitialize, too, leaving the 4ESS isolated. Note that both DLNs serving a given 4ESS had to be hit with a double burst of call originations in the 10 ms window to instigate the problem. Although extensive laboratory testing had been done, it appears that this particular combination had just not come up.

After reinitializing, the DLNs would let other switches know they had come back on stream by sending out call origination messages via the signaling network. But now, with traffic backing up, they would send two such messages during the 10 ms window, and shove other DLNs into reinitialization. This went on all afternoon and most of the evening. Trouble-shooting teams tried an array of standard procedures to stabilize the network, to no avail. What finally worked was cutting out the signaling via the back-up links to the mate DLNs, thereby reducing the load on the DLNs. This was finished by 11:30 p.m. The following morning, the faulty program update was removed and taken back to the lab for study where the experts were ultimately able to reproduce the problem. The previous program was put back in the system, where it is working fine.

AT&T claims it completed 83 million calls on January 15 (35 million during the problem), just about as many as it expected to, although the number of call ATTEMPTS shot up to 148 million. Customers making multiple attempts were delayed and annoyed. They were even more annoyed when AT&T operators refused to provide instructions for using Sprint or MCI, a practice eventually corrected. Those who persisted apparently got through. I made countless attempts to reach Roanoke, VA, and San Francisco, CA, for a total of three calls.

It is tempting to blame common channel signaling for the King Birthday Breakdown. After all, "...the concentration of the signaling software and hardware into a subnetwork means greater vulnerability than if the signaling function were spread through the entire network..." as it used to be prior to 1976, to quote from a somewhat bizarre report from the National Research Council entitled "Growing Vulnerability of the Public Switched Networks." But it would appear that Common Channel Interoffice Signaling and Signaling System 7 really had nothing to do with the problem. If one must blame modern technology, the concept of stored program control has to be considered. How else could 114 major computers be upgraded nearly simultaneously to an identical set of operating instructions all containing the same glitch? But before we rise up in righteous wrath to smite the villain hip and thigh, we must note it was stored program control that permitted all 114 switches to be fixed in one afternoon from a central point, an achievement of some moment.

This is the third major network failure AT&T has experienced in the last several years. The first was due to a human error in the Wayne, PA, office and the second to the accidental breaking of a fiber optic pipe in Raway, NJ. This is the first which can properly identified with an actual design glitch. If the New York office hadn't triggered it, it might have lain hidden for months before surfacing, perhaps at a more inconvenient time. AT&T's competitors have shown remarkable restraint in the aftermath of the debacle; it is possible that there are additional glitches in the complex programs that control their networks as well, just waiting for the right confluence of circumstances to unleash them. The one thing we can be sure of is that the King Birthday Breakdown is not the last example of human fallibility we are likely to encounter.

Sidebar: Star Wars, Anyone?

With nearly 100 million calls a day to exercise the telephone network's software, the most obvious glitches have long since been discovered and eliminated. But this software is perhaps an order of magnitude less complex than that which would be required for the "Star Wars" program, and the number of daily real-world tests would be appreciably smaller. Although famous physicists and crusading housewives may believe that such software can be developed and debugged on the basis of laboratory simulations alone, the King Birthday Breakdown suggests that it is just possibe that even AT&T’s Bell Labs, one of the few companies with appropriate credentials to bid on Star Wars software development, might no longer be as certain the job could be done as their spokesman before Congress suggested. A glitch in Star Wars software could provide something more annoying than reorder tone.

.

[ Top ] [ Next ] [ Table of Contents ]


Copyright 2005 Lee Goeller. All Rights Reserved.