Service interruptions - 15-Jan-2015, 13:30-14:30 (approx)

Bob Tinkelman bob at TINK.COM
Fri Jan 15 15:50:30 EST 2016


SUMMARY
-------
Due to problems on one of ISPnet's core routers at 85 Tenth Ave,
some customers experienced problems reaching portions of the net,
ranging from a few web sites to the entire net.

The immediate problem was identified.  A fix was installed for
the router having difficulties and we will be making similar
changes to other routers in future announced maintenance windows
to prevent the problem form reoccuring elsewhere in our network.


THE PROBLEM (MORE DETAILS)
--------------------------
The ISPnet routers in question are cisco routers.  Cisco releases
new versions of the software ("IOS") fairly frequently to support
new features and new hardware and to fix security holes and other
bugs.  ISPnet's upgrade policy has been fairly conservative,
upgrading software images relatively infrequently and, as much as
possible, staying with releases that cisco labels "long term
support".

Today, we discovered a problem with the IOS version running on
gw1.nycsnyoo.ispnet.net while making a (trivial) config update.
The router became unstable, dropping many of its routing sessions
with other ISPnet routers and with some customer routers.

The first thing we did, of course, before realizing the problem
was IOS-related, was to back-out the config change.  When that
didn't work, we restarted the router with the newest appropriate
release, one that's documented to fix quite a few problems (most
of which we haven't encountered).  After the reload, everything
returned to normal operation.


PROBLEM SYMPTOMS
----------------
The problem occured on gw1.nycsnyoo.ispnet.net, one of a pair of
routers configured in a redundant manner at our 85th Street POP.
Traffic passing through the POP generally encounters one of gw1
or gw2, but not both.  If one of the two routers is down, the
other picks up all the traffic, but if both are "up", it's
difficult to predict, ahead of time, which router will be
involved in each traffic flow.

Customer traffic passing through gw1 was affected 13:30-14:30 EST
(approximately).  Traffic through gw2 was unaffected.

During this period, the path of various types of traffic changed
multiple times, partly due to work we were doing to fix the
problem.

The effect was that the portions of the net which were un-
reachable varied over time, not only customer by customer.  
If a customer's normal name servers were unreachable, the effect
was, of course, inability to reach any web site.


THE FIX
-------
As mentioned above, we have upgraded the software on gw1 and will
schedule a roll-out of the newer software to our other routers,
at this and at our other POPs, during announced maintenance
windows over the next month or so.

We should be able to make use of the fact that our routers are
all deployed in redundant pairs to perform these upgrades without
any service interruptions.

We'll include more details when we announce the windows.

While we'll continue to reassess our software update policy from
time to time, at present we expect to continue with the
conservative policy that seems to have served well over time.




In closing, I want to offer my apologies to customers who were
affected by today's problems and my assurance that stable network
operation is our number one priority.


--
Bob Tinkelman          <bob at tink.com>
ISPnet, Inc.    http://www.ispnet.net
+1 (718) 464-4747  office
+1 (800) 806-NETS  toll free



More information about the ispnet-announce mailing list