Communications
on Communication
Another disaster on
which I commented without finding a publisher.
How the Glitch Stole
Martin Luther King's Birthday
There's an old saying among traffic people
that "you don't engineer for Mothers' Day." That is, you don't
install facilities that are needed only on days of peak demand.
Martin Luther King's birthday turned out to be something else again.
There is, as yet, no tradition for calling people up because of the
nature of the holiday and, because the King birthday is not
universally observed, there isn't as big a drop in business traffic
as there is for Christmas. All things considered, one might expect
slightly lighter traffic than normal, with those attempting calls
encountering less delay or difficulty than usual.
This year it didn't work out that way. The
AT&T toll network crashed at 2:30 on the afternoon of January 15,
1990, giving users an announcement saying "all circuits are busy,
try again later," a fast busy signal (reorder tone), or nothing at
all. But not universally. Every so often, a call would go through as
if nothing were wrong, and once it was set up, it encountered no
further difficulty. People who kept trying got through but sometimes
it took eight or ten attempts. What had hit the company that
invented reliability?
It appears to have been a glitch in some new
software that had been installed back in December in all 114 of
AT&T's 4ESS switches to improve their reliability. Once triggered by
a minor and unrelated hardware failure in a 4ESS in New York, its
results ricocheted through the signaling network that directs toll
network operations, momentarily interrupting call set-up procedures.
It is not known how many of the 4ESS switches were affected, but
apparently those involved went in and out of service repeatedly
until emergency crews were able to restore order about 11:30 p.m.
To understand what happened, a little history
is required. The 4ESS, which first went into service in 1976, was
designed to work with common channel interoffice signaling, or CCIS,
a signaling network completely independent of the voice channels
used by telephone customers. The original version of CCIS used
signaling links running at 2.4 Kbps; although not high speed by
today's standards, a link could handle the signaling and control
information for more than 2000 trunks.
Because failure of a link handling the
signaling for so many circuits could be a major catastrophe, the
entire signaling network was duplicated with geographically diverse
facilities. The country was divided into ten regions, each with a
pair of widely separated packet switches called signal transfer
points (STPs); every 4ESS in a region homed on both of its STPs, and
every STP had a direct signaling link to all the others. In
addition, the STP pairs within each region were tied together by
signaling links so that if the desired link from one were down, the
link from the other could be used.
The number of pairs of STPs was increased
from 10 to 16, and the data link speed was doubled to 4.8 Kbps. By
1985, this was inadequate, so new STPs, based on AT&T's 3B20D
computer, were installed and linked together and to their 4ESS
clients via 56 Kbps channels running Signaling System 7. Finally, to
off-load the signaling effort required of the 4ESS common control,
each 4ESS was provided with its own 3B20D processor, in this
instance called a Direct Link Node, or DLN, to act as an interface
to the signaling network. A 3B20D is internally redundant and, as a
result, highly reliable. It is also very powerful; as a signaling
network interface, it is lazing along at about half occupancy.
The mid-December upgrade for the 4ESSs
consisted of arranging these switches in pairs so that each could
share the other's DLN, adding reliability by duplication and also by
providing a second and diverse path into the signaling network. This
required a separate data link from each 4ESS to its mate's DLN, but
that is no problem for AT&T. The now infamous glitch was apparently
in the DLN software upgrade required for this duplication procedure.
Everything appears to have worked fine for
about a month. Then, on January 15, some small mechanical failure
took place in a 4ESS in New York. The internal automatic maintenance
capability of the system went into action to resolve the problem, a
procedure that normally takes four to six seconds and is invisible
to customers. To facilitate this process, the 4ESS, via its DLN,
sent out a message on the signaling system telling its connected
switches that it was momentarily not accepting new calls. This
information was stored in status tables in the DLNs which can be
thought of as part of the connected 4ESS switches.
As soon as the sick widget was taken care of,
the New York 4ESS returned to originating its own calls. The DLNs in
the connected switches, upon seeing new traffic coming from New
York, began the process of resetting their routing tables to show
New York back on line and ready to accept calls. However, because it
was the middle of the afternoon busy hour, New York sent additional
call originating messages before the DLNs had completed their table
updates.
The glitch was such that, if the DLN at a
distant 4ESS received two call origination messages from New York
within 10 milliseconds of each other, while it was updating New York
it its status tables, some data would be damaged and the DLN would
"reinitialize;" i.e., it would tell all its connected switches that
it was not accepting new calls until it checked itself out. Its own
4ESS could still use the DLN of its mate, thanks to the new upgrade,
but two call originations to that DLN in the 10 ms window while it
was fixing its own tables would cause it to reinitialize, too,
leaving the 4ESS isolated. Note that both DLNs serving a given 4ESS
had to be hit with a double burst of call originations in the 10 ms
window to instigate the problem. Although extensive laboratory
testing had been done, it appears that this particular combination
had just not come up.
After reinitializing, the DLNs would let
other switches know they had come back on stream by sending out call
origination messages via the signaling network. But now, with
traffic backing up, they would send two such messages during the 10
ms window, and shove other DLNs into reinitialization. This went on
all afternoon and most of the evening. Trouble-shooting teams tried
an array of standard procedures to stabilize the network, to no
avail. What finally worked was cutting out the signaling via the
back-up links to the mate DLNs, thereby reducing the load on the
DLNs. This was finished by 11:30 p.m. The following morning, the
faulty program update was removed and taken back to the lab for
study where the experts were ultimately able to reproduce the
problem. The previous program was put back in the system, where it
is working fine.
AT&T claims it completed 83 million calls on
January 15 (35 million during the problem), just about as many as it
expected to, although the number of call ATTEMPTS shot up to 148
million. Customers making multiple attempts were delayed and
annoyed. They were even more annoyed when AT&T operators refused to
provide instructions for using Sprint or MCI, a practice eventually
corrected. Those who persisted apparently got through. I made
countless attempts to reach Roanoke, VA, and San Francisco, CA, for
a total of three calls.
It is tempting to blame common channel
signaling for the King Birthday Breakdown. After all, "...the
concentration of the signaling software and hardware into a
subnetwork means greater vulnerability than if the signaling
function were spread through the entire network..." as it used to be
prior to 1976, to quote from a somewhat bizarre report from the
National Research Council entitled "Growing Vulnerability of the
Public Switched Networks." But it would appear that Common Channel
Interoffice Signaling and Signaling System 7 really had nothing to
do with the problem. If one must blame modern technology, the
concept of stored program control has to be considered. How else
could 114 major computers be upgraded nearly simultaneously to an
identical set of operating instructions all containing the same
glitch? But before we rise up in righteous wrath to smite the
villain hip and thigh, we must note it was stored program control
that permitted all 114 switches to be fixed in one afternoon from a
central point, an achievement of some moment.
This is the third major network failure AT&T
has experienced in the last several years. The first was due to a
human error in the Wayne, PA, office and the second to the
accidental breaking of a fiber optic pipe in Raway, NJ. This is the
first which can properly identified with an actual design glitch. If
the New York office hadn't triggered it, it might have lain hidden
for months before surfacing, perhaps at a more inconvenient time.
AT&T's competitors have shown remarkable restraint in the aftermath
of the debacle; it is possible that there are additional glitches in
the complex programs that control their networks as well, just
waiting for the right confluence of circumstances to unleash them.
The one thing we can be sure of is that the King Birthday Breakdown
is not the last example of human fallibility we are likely to
encounter.
Sidebar: Star Wars, Anyone?
With nearly 100
million calls a day to exercise the telephone network's
software, the most obvious glitches have long since been
discovered and eliminated. But this software is perhaps an
order of magnitude less complex than that which would be
required for the "Star Wars" program, and the number of
daily real-world tests would be appreciably smaller.
Although famous physicists and crusading housewives may
believe that such software can be developed and debugged on
the basis of laboratory simulations alone, the King Birthday
Breakdown suggests that it is just possibe that even AT&T’s
Bell Labs, one of the few companies with appropriate
credentials to bid on Star Wars software development, might
no longer be as certain the job could be done as their
spokesman before Congress suggested. A glitch in Star Wars
software could provide something more annoying than reorder
tone. |
.
|