Service outages, ISPnet Staten Island colocation customers, postmortem

Bob Tinkelman bob at TINK.COM
Fri Apr 3 14:40:31 EDT 2015


INTRODUCTION
------------

This ispnet-announce posting is being sent to all ISPnet
customers, but will be of interest primarily to those customers
with equipment at the Telehouse colocation facility at 7 Teleport
Drive, Staten Island.

ISPnet's infrastructure at this site includes two redundant sets
of routers, switches and circuits.  Dual-connected customers have
Telehouse cross-connects to both ISPnet cabinets and receive
service with no single points of failure outside their own
cabinets.  For single-connected customers, there are at least the
following single points of failure: the cross-connect and the
equipment on which the cross-connect terminates at the ISPnet and
the customer ends.

Over the past three months, there have been two incidents during
which power-related issues involving one of ISPnet's cabinets
caused a problem that interrupted service to some of ISPnet's
single-connected customers at the site.

The purpose of this email is to provide our customers additional
information about these events and to describe the steps we are
taking to minimize the chances for future reoccurences.


ELECTRICAL INFRASTRUCTURE
-------------------------

Similarly to what is found at most high-level data centers, the
Telehouse facility is designed to provide continuous high-quality
electrical power to customers, insulating them from fluctuations
or outages in the utility-provided power.

Telehouse's extensive electrical infrastructure includes backup
generators, UPSes with sufficient battery to power the entire
building for 15 minutes (much longer than it takes for the
generators to start) and multiple PDUs to distribute the power.

At all the sites where it has equipment, ISPnet extends this
redundancy into its own cabinets.  ISPnet cabinets are always
powered by pairs of electrical circuits configured as primary-
fallback pairs using one or more APC Automatic Transfer Switch
(ATS).

An ATS is designed to provide continuous power on its output side
as long as at least one of its input circuits has power,
switching between the two, generally in less than a half cycle
(1/120 of a second).


SEQUENCE OF EVENTS
------------------

Similar sequence of events occurred late afternoon, Dec 24, 2014,
and early morning, Mar 12, 2015. 

On both occasions, Telehouse recorded a "power hit" (a drop in
voltage typically lasting less than a second) on their ConEd
utility power feed but no loss or change in the power being
provided to customers through its UPSes and PDUs. Since this was
a very short lived event, during which backup batteries supported
the building power load. there was no cutover to generator power.

On both occasions, the circuit breakers tripped on both the
primary and backup power circuits to one of ISPnet's cabinets. 
All equipment in that cabinet lost power and single-connected
customers connected to that cabinet lost service.

The cabinet was repowered and service restored when Telehouse
reset the breakers that had tripped.

In both instances, Telehouse reported to ISPnet that as no other
Telehouse customer had been affected, they believed that there
was a fault with some equipment in the ISPnet cabinet.


POST-MORTUM ANALYSIS
---------------------

After the Dec 24 outage, ISPnet visited the site and tested its
Automatic Transfer Switches, detecting no problems.  We verified
they were still capable of transferring their load between their
A and B inputs, as needed.  

Despite the fact that the ATSes were functioning, we felt we
needed to understand what had happened so we could make all
changes necessary to prevent reoccurences.

We consulted with APC, the manufacturer of the ATS.  They
validated our configuration and that the units were running the
latest firmware.  They stated they knew of no reason why the
sequence of events should have occurred.

We purchased an additional ATS of the same model used in the
cabinet along with test gear we could use to conduct experiments
in a lab environment, varying the input voltage on the ATS's two
input circuits.  Our tests indicated that the ATS would continue
to function until both its input voltages dropped to a low level,
around 90 volts, much lower than Telehouse's PDUs are designed to
deliver.

The testing suggested a scenario in which a large voltage drop on
both input circuits could cause the ATS to act in a way that
would cause input breakers to trip.  This scenario would occur
only in the case where the phases of the two input circuits were
not in synch with each other, which was the case for the two
circuits supplying power to our cabinet.

It is normal for phases on different circuits to be out of synch.
In fact, in almost every circuit breaker panel, the circuits
served by adjacent breakers will be 180 degrees or 120 degrees
out of phase with each other, depending on whether the power is
2-phase or 3-phase.

APC indicated (both in their documentation and in response to our
queries) that the ATS had no restrictions regarding "synch'ed
phases" and that we were running an approved configuration.

All the above analysis was completed in January, at which point
we decided not to replace our ATSes at this site, immediately,
but to wait and see.  Among other reasons, we wanted to avoid
imposing on our single-connected customers the additional service
interruption that replacing the ATSes would necissitate.

A second outage occurred on March 12.  The symptoms were mostly
the same, with only a few differences.  On the positive side,
given our experience, we knew immediately what had happened and
Telehouse was able to restore power to our cabinet much more
quickly.  On the negative side, when we visited the site later
that day, we found that one of our transfer switches was
definitely damaged and was now unable to transfer power between
primary and secondary feeds.

For the short term, to prevent any issues involving "A vs B
feeds" from causing additional problems, we totally disconnected
the B feed from both ATSes in the cabinet, pending the
replacement of both ATSes.

We conferred with Telehouse's facilities department and,
together, we agreed on a plan that we believe has the best chance
to avoid future problems.


WHAT COMES NEXT
---------------

Telehouse will provide ISPnet with a pair of replacement
electrical circuits to the cabinet.  These will be engineered so
that their phases are in synch with each other.  At Telehouse's
suggestion, these circuits will be higher capacity than the
current ones, both allowing more room for growth and helping
avoid small surges from tripping breakers.

ISPnet has already purchased replacment transfer switchs.  Once
Telehouse has installed the new circuits we will test them and
our new ATS.  Then we will schedule a maintenance window in which
we will perform the upgrade.

Afterwards, we plan to ship the old transfer switches to APC for
analysis.

We will post an announcement when the maintenance is scheduled.
Our current expectations is that it will involve service
interruptions to the same set of single-connected customers as
were affected by the 12/24 and 3/12 outages.  No other customers
should be affected.


ACKNOWLEDGEMENTS
----------------

Over ISPnet's entire life, Telehouse has been one of our most
valued technology partners.  We have always found them to be
professional, technically competent and customer solution
oriented.  

Their willingness to provide replacement electrical circuits in a
situation where all their systems report that their equipment has
been functioning correctly is, on the one hand, "above and
beyond" but, on the other, totally consistent with our history
with them.



More information about the ispnet-announce mailing list