1997 Reports
23 December 1997
We were offline this evening from 1:30 am until 2:10am EST making adjustments
to our file sharing software. In addition, we have performed routine random
file erasing to insure the integrity of our system and to keep our customers
on their toes. Furthermore, we have disconnected the T1 lines and inserted
them into our freezer and are performing our own experiments with food irradiation.
As a side experiment, we are testing to see if frozen bits move faster.
We hypothesize that they do. If you notice any speed difference, please
send a note to us.
12 December 1997
BellSouth is experiencing some intermittent errors today in their routing
systems throughout the southeast US. The net result is that our T1 lines
are bouncing up and down at irregular intervals, though they rarely all
do so at the same time. Even then, any general outage resulting from a simultaneous
bounce on all lines lasts no more than about 10 seconds - not even long
enough for a timeout to occur. As of 3:15 PM eastern time, all telco is
up and running stable; for how long only BellSouth knows for sure.
4 December 1997
We have no idea...
The web servers crashed at 12:02 PM today for no apparent reason. They
were down for about two minutes and we're trying to track down the problem.
It seems (for now) that it was just a transient event that is not repeatable.
We will continue our attempt to determine what happened and find the appropriate
client to blame it on. We're thinking Gregory at this moment, but Michael
or Barbara might do just as well. :)
28 November 1997
We are provisioning in a new T1 line today. Unfortunately, things involving
BellSouth that require so many different people need to be done during business
hours. Installing a new T1 is one of those things. Since this is a holiday
weekend, we figured that it would be best to perform the install now rather
than wait till a normal weekday. What we are installing is a load balanced
T1 which is why this is so difficult and taking so long. Normally when a
new line goes up, it is a redundant T1 - not load shared. By load sharing,
we not only get increased throughput, but we also get full redundancy if
one or more of the T1s hose.
We will be bouncing up and down starting at 2:30 PM EST and we expect
the entire process to take about three hours to get all the telco lines
rebalanced. During that window, you will experience intermittent loss of
connectivity, though we expect to be up more than we are down. As always,
email will be stacked up for delivery when everything is stable.
Addendum:
Well, it looks like we weren't the only ones to make the decision to
do maintenance today. The internet as a whole looks pretty hosed. And the
problems are widespread with most carriers affected with at least a 30 percent
packet loss. As of 5:27 PM EST, we are back humming - fully load balanced.
Yeah...
27 November 1997
What are you people doing awake anyway?!? Everyone of you should be passed
out from turkey bloat by now which is why we decided that it would be a
good time to do some maintenance on the server. We took it down for about
an hour and when I get back to my email, there are a half-dozen messages
in there telling me the server is down. Duh... :)
Anyway, it was time to rebuild some of the drives and upgrade some of
the software, so that is now done. Next time, I'll give some warning specifically
when we do something like that. Now, go to sleep.
18 November 1997
Hurry hurry hurry. Step right up and get your fresh hot roasted "DNS
name not found" errors. There's only a limited supply folks so get
yours while they last. Where you say? Why, come on down to Big NIC's (InterNIC's
that is) J Root Server Emporium. Where you can find a fine selection of
corrupted and incorrect DNS names to suit just about anyone's tastes.
Now don't get hornswaggled by one of those other root servers that only
hand out "correct" DNS entries. What you need my friend is a genuine
leather bit bucket so you can load up on all these great domain names that
have gotton completely lost in the shuffle.
Deals deals deals is what you'll find at Big NIC's. Oh, and tell 'em
that Mikey sent ya!
17 November 1997
This morning was proof that if something can go wrong, it will. We have
a full system check in place that will warn us if anything goes wrong. Alarms
go off...bells ring...and lasers kill intruders if something in the system
fails. Well, we found a particular failure this morning that was completely
unexpected.
The web server went south at 8:37 AM EST. But the machine was still pingable.
And worse yet, the file sharing system was still up and running. As far
as anything was concerned, the machine was still up and running and the
web server was still listening for incoming requests -- it was responding
to those requests, but not sending anything back. At the same time, FTP
services, email, secure services, and database services were still performing
flawlessly so we didn't even have any back-up warning condition to trigger
an alarm (as we would have had if the server had died completely.)
Normally, even a situation as bizarre as that would have been caught
in moments IF anyone had been awake at the time. Unfortunately, this
morning was one of the rare times that all of us were asleep at the same
time. When I awoke at 11:30 AM, it was immediately obvious that something
was wrong and a simple relaunch got us up and running within seconds.
We will be spending the balance of the afternoon on the phone with vendors
and searching through technical documents to see if what happened is logged
anywhere. And then we will add still another checking mechanism to our testing
program to prevent any further occurrence of this same type.
Sorry if anyone was inconvenienced.
15 October 1997
The primary peering point for the Eastern United States, MAE-EAST, has
been experiencing major routing problems since last Thursday night. Traffic
through MAE-EAST has been intermittent at best. Today, MAE-EAST is completely
hosed and all traffic between major carriers is essentially non-existent.
This is gonna be a really long day for the Web.
We are fine and our telco link is fine. And almost without exception,
the ISP you are dialing into as well as their telco link is fine; it's just
no one is talking to each other. The only thing to do is ride it out. Time
to go to lunch... :)
14 October 1997
Today was going to be my day off. I had been programming furiously for
the last several months and it was time to just sit back and relax. So naturally,
I hopped in my car and drove down to the local RouterShack to gawk. I had
just gotten through looking at a WimpyBit 250 router when, in the distance,
I heard a blood-curdling scream. Naturally, I was concerned, but it was
probably that same concern that caused me to overlook a critical detail
about the scream. It would not be my last mistake.
I started making my way through the store looking for someone in distress.
As I neared the fusion-powered router aisle, I heard it again, but this
time, there were hundreds, or even thousands of screams! I rounded the corner
and then stared in shock. It couldn't be...
There in front of me was a demonstration of a RouteMaster 2000 (a.k.a.
the Packet-Knosher)! The screams of the millions of bits being slammed down
the demonstration T1 was deafening. Naturally, I bought one.
I hooked it up this morning and true to the demo, the bits are screaming
through it. I realize it's a little disconcerting to hear these types of
sounds coming from your modem but you can rest assured that no one is harming
the bits in any way. Those of you who are concerned that the bits are being
overworked can send your complaints to the Bit
Abuse Hotline.
1 October 1997
Problem. Big problem. InterNIC operates nine name servers named "A"
though "I". (Actually, they have several more, but they are experimental
and are not in hot use.) Anyway, when you type in a URL or hit one of your
bookmarks, your request goes to the local cache on your dial-up provider
to resolve that name. If the URL is not in the local cache, your local dial-up
provider sends out a request to one of the InterNIC name servers to resolve
it. Normally, the InterNIC name servers would return the information where
it would then be placed into the cache of your dial-up provider. That way
if you or anyone else from your dial-up service wanted to look at that domain
again, the URL request would go to cache, saving traffic on the InterNIC
name servers.
If one of the InterNIC name servers is completely down, then there is
no real harm done. Your request simply goes to another name server to pick
up the information. But if the information returned from the InterNIC name
server is wrong, then that wrong information is cached at your local dial-up
provider. If you or anyone else on that system then places the same URL
request, that wrong information is returned locally and the InterNIC name
servers are not even queried again.
What is happening today is that the "H" InterNIC name server
is partially corrupted. It appears that all the .com, .org, .and .net addresses
are corrupted. It does not appear that .edu or .gov are affected. If anyone
sends out a URL request to a .com address and the "H" name server
is hit then that "No DNS found" response is getting placed in
the dial-up provider's cache. Again, once that happens, any subsequent request
for that URL on that local system continues to show "No DNS found"
errors even though the domain is live and running just fine.
As the morning progresses, more and more of the requests are going to
statistically hit that "H" server and the erroneous DNS information
will get cached all over the place. Unless InterNIC intercedes quickly,
and I mean before the east coast of the United States wakes up and starts
browsing, today will be a very bad day for the World Wide Web.
This problem is not only affecting the Web, it is also affecting email,
FTP services, and anything else that relies on DNS name resolution. At best,
it will be an inconvenience if you can't update your pages or browse URLs
today. At worst, it will mean that emails will get trashed, never to be
delivered.
Unfortunately, all this is totally out of our control and the control
of your dial-up provider. The best thing that can be done is for your local
ISP to flush their cache regularly all day long until the problem with the
"H" server is resolved. Those flushes will have to take place
with increasing frequency as the day progresses and more and more people
hit that bad name server to cache bad information. We have our cache set
with a five minute time to live which means that any DNS entry in our cache
expires in five minutes. Many providers, though, set their cache for one
hour, 12 hours, 24 hours, or longer.
Let's just hope that InterNIC gets its act together and shuts down the
"H" name server quickly.
InterNIC Update:
We spoke with InterNIC at about 6:30 AM and told them of the problem.
At 6:50 AM, they called us back to inform us of the steps that have been
taken. According to InterNIC, an email has been sent to the sysop of the
"H" server explaining the situation. Unfortunately, the problem
may not resolve itself any time soon.
The way the system is set up, InterNIC has direct control over the "A"
server only; the administration of the others is contracted out and their
upkeep is dependent on the individuals who are contracted to maintain the
servers. The "H" server appears to be located at the Aberdeen
Proving Grounds in Maryland. Assuming that the individual who is responsible
for maintaining that server gets the email, reads it, and responds appropriately
it may be several hours before anything is done. Worst case is that the
responsible individual is off today and does not get the email. That will
mean that nothing will be done until the normal InterNIC updates which occur
overnight.
Begin Editorial Comment*****
You would think that InterNIC would have a better way of resolving root
server problems than simply sending an email. We are going to do a whois
on the "H" server and see if we can get a contact number.
End Editorial Comment*****
Additional InterNIC update:
We have reached the sysop by phone who is responsible for maintaining
the "H" server at 7:05 AM and notified him of the problem. It
is being attended to as I write this. Looks like the crisis is averted.
Yeah...
Begin Editorial Comment*****
We found that phone number with a simple whois. Now, why can't InterNIC
have done the same thing?
End Editorial Comment*****
Additional, additional InterNIC update:
At 7:40, the problem still exists with the "H" server and with
the east coast waking up and hitting the Web, sending emails, and the like,
things should start bogging down very shortly. Remember, once the "H"
server corrupted information gets cached on your ISP's system, it will not
matter even if "H" comes back up; that information will live in
cache and be corrupted until that cache flushes. It's beginning to look
like a long day again...
Another additional InterNIC update:
8:35 and still hosed. Oh well. I did what I can do so I'm going to sleep.
It looks like the way the system is set up, the only time you can correct
errors to the name server tables is on the overnight updates. At least that
is what I would have to presume; there has certainly been enough time to
pass to have done something by now -- even if it is to simply turn it off.
Final InterNIC update:
As of 12:25 PM, the "H" name server appears to be now responding
properly. I can't determine exactly when it started working again since
I just woke up. And now I am going back to sleep. See y'all this evening...
10 September 1997
We have been experiencing intermittent problems today with the web server.
The cause has been isolated to a piece of crap software called NetCloak
which is responsible for text (but not graphical) counters. We have removed
the offending piece of garbage which will cause all text counters to immediately
stop. If you are using a text counter, please go to our tutorials
page for instructions on temporarily substituting any one of our available
graphic counters for your text counter. Please email
or call us if you need assistance with this process. We apologize for
any inconvenience. We will get this straightened out as soon as possible.
For anyone using counters as page statistics, please email
us and we will be more than glad to provide log statistics free of charge
for the month of September.
5 September 1997
You may notice that nothing of any significant function is currently
working. We know that (and this is gonna be a long night.) We have completely
rewritten how back end processes are handled by the various servers and
are in the process of installing the new software. (For those of you keeping
up with the implementation of MGI, this is round one. The next major upgrade
comes in October.) The whole thing should take until about 5 AM to finish
at which time the servers will be running so fast and efficiently that you
will need SuperMikey to slow things down so that people can actually read
them. We will return control when you place $50,000 in small bills in a
brown paper bag...no wait, that's a different scenario (gotta quit watching
Showtime.)
Anyway, email, the web server, secure services, and all other non-GCI
functions will not go down.
22 August 1997
Well, this is the deal. We are experiencing some sort of bizarre instability
involving the web server on port 80. This has been going on for two days
now. There does not seem to be a particular pattern to the freezes, nor
have we altered anything in the past several weeks that would cause a problem.
The glitch seems to involve an interaction between our forms processor,
the server software, and AppleTalk. What happens is that the CGI will come
forward as the frontmost application and lock up, but the machine running
the processor does not itself freeze. From perspective of the internal LAN,
nothing seems to be wrong. AppleTalk still functions and the machine still
takes pings over TCP/IP. Nothing gets served out, though. And the machine
itself does not unmount from the network.
When this happens, you will be able to open an FTP connection and see
your hierarchy, load things into your folder, and get files from your folder.
Database services and secure services are not affected either, so if someone
has already entered those areas, they will be successful until they try
and get back to the standard web server machine. Then they get a No-DNS
error.
We have developed a process that constantly checks the web server and
sounds an alarm if a freeze should occur. Unfortunately, all it can now
do is warn us when something is wrong; the auto-reboot systems will not
work since there is technically nothing wrong with the computer or filesharing.
Fortunately, we live with the servers and are here to do a manual reboot
in the event of a freeze. In most cases, the reboot will occur so quickly
that an end user will not even time out, but occasionally they may see a
No-DNS found message. In that event, when they try again, the server will
be back up and running.
This problem has occurred five times in the past 65 hours and at this
time represents little more than a irritation for everyone involved except
us as we rip our hair out trying to figure out what the problem is. I'll
keep you posted.
20 August 1997 - Part Two
Computers are like knees; they act as precursors to evil things when
they start acting up. I should have known this morning that when we started
to have intermittent instability for no reason we were in for some real
problems. Well, those problems arrived in the form of a massive front line
shortly after 7 PM EDT. We're talking violent lightning, hail, tornadoes,
and other Acts of God extending in a line from Richmond, Virginia down,
and past, Atlanta, Georgia. Major damage...death and destruction...sheep
with anthrax roaming wild...rabid bats...
We are actually up and running; the UPS power backup kicked in perfectly
and everything worked just fine internally. Unfortunately, much of the telco
in the southeast region was down completely or having so many problems with
network overload that it was effectively down. We have engineers crawling
all over Raleigh and the surrounding area and we do know that our telco
back to the managed facility is running, but beyond that...well, reports
are coming in that there is nothing beyond that.
So there is only one thing left to do at this moment...grill some rib-eyes.
And you are welcome to join us. The only problem is that by the time the
network is back up and you have read this, we will have finished eating
and there will be none left. You snooze...you lose. :)
Number of disgusted Steves: potentially one if the steaks burn
on the grill
Number of frazzled Mikeys: one, who has been peering into the T1
throughout the night waiting to see packet movement. He'd better watch it
because if those packets start moving while he have that line to his eye,
he's liable to get pinged.
Number of culinary Vals: one, who is womaning the grill
Number of really ticked off telco engineers: many as they wander
around looking for problems while trying to avoid sheep with anthrax and
rabid bats.
20 August 1997 - Part One
We've been having some trouble with the secure server this morning. The
problems started at 3:34 AM and have caused some instability in the system.
We think we have it fixed now and we will continue to monitor it throughout
the day and tonight. It's interesting, though...when you have a bad head
cold is when things start messing up big time. Never happens when I'm wide
awake and well...
Number of ill Steves: one
Number of Vals who are tired of ill Steves: one
Number of Mikeys who refuse to come over and work on this himself for
fear of getting a really bad headcold: one
19 July 1997
Do I look like I'm in a bad mood? I have no idea why you would say that
just because the TELCO HAS BEEN OUT FOR FOUR HOURS!!!!!!
In a stunning example of irrational incompetence, GTE has demonstrated
just why many of their customers hate them. Oh, did I not tell you? Though
we contract with BellSouth (the most reliable and efficient provider of
telco services in the world,) because of the decision issued by Judge Green
(who would best be served by joining Judge Crater) BellSouth is not allowed
to provide their world-class services without interference...uh, mandatory
technical assistance. And that help comes in the form of GTE for one little
hop of a large copper wire as it passes through Research Triangle Park,
North Carolina. You see, RTP is serviced by GTE and GTE alone which is BS
because of Green. And there is no way around it. But when GTE drops the
ball like they are world famous for doing, everyone else is affected.
In this particular instance, their ball-dropping resulted in taking down
not only our telco, but also most of the T1 services in the region, including
but not limited to, credit card authorization services and ATM machines.
On a Saturday afternoon. If you think I was torqued, you should have seen
the retail store managers in Crabtree Mall who don't know a T1 from a Black
Snake (and wouldn't know what to do with either if handed to them.)
The problem turned out to be a card in a router in a building in RTP,
the control of which fell under GTE. Any other company would have gotten
a tech there is about 10 minutes (no wait...any other company would have
had techs on-site 24/7) and swapped out the card with one that they keep
in redundant reserve. Of course, that would be too much of an effort for
GTE, not only to get a tech on site in anything short of four hours, but
to actually have a redundant card handy. So what they did -- get this --
is to rig a temporary ring around the router to get the telco back up. And
once it was back up, they then waited until 2 AM to take the 30 seconds
it took to swap out the card.
From rumors I have heard, GTE is apparently being read the riot act by
some very high-up people at BellSouth and, with any luck and the constant
repetition of the words, "breech of contract," GTE may actually
get their act together.
12 July 1997
2:08 PM:
Mikey: We're down...
Steve: Duh...
Mikey: Check the web server and DNS. I'll check the router.
Steve: Duh...OK...
2:10 PM:
Mikey: The router is fine
Steve: All internal systems are fine.
Mikey: We're still down.
Steve: Duh...
2:11 PM:
Mikey: Time to call BellSouth.
Steve: OK...
Val: Good morning
Steve and Mikey: Good morning.
Steve: We're down.
Val: (Who is not quite awake yet...) Duh, OK...
2:12 PM
BellSouth: Network Operations Center, can I help you.
Steve: We're down.
BellSouth: Duh!
Mikey: What's going on?
BellSouth: Major network outage...seven states completely off line...router
burned...death and destruction...locusts, boils, and assorted pestilence
sweeping the landscape...
Steve: Duh.
Mikey: Any ETFTS? (Estimated Time to Fix The Screw-up)
BellSouth: We're on top of it. No time of ETFTS.
Mikey: Duh...we'll call back.
5:58 PM
During the past three and one-half hours, we have entertained ourselves
by making some software upgrades, doing the laundry, taking a moment to
create life in a test-tube and calling BellSouth every half-hour at the
top and bottom of every hour. We're beginning to get bored.
Steve: We're still down.
Mikey: Duh.
Val: Did you turn the air conditioning up last night?
Steve: No. Let's call BellSouth again.
Mikey: OK...
BellSouth: Network Operat....
Steve: We know who you are. We know where you live. When are we
going back on line?
BellSouth: Should be within the hour.
Val: Are you sure you didn't turn the air up last night?
Steve, Mikey, and BellSouth: YES!!! WE'RE SURE!!!
6:02 PM
Mikey: Hey...We're up!!!
Steve: Yeah!!!
Mikey report: Sent out a total of 24,000 packets, all of which
were lost during the outage. At one point, Mikey got so frustrated that
he grabbed a shotgun and started skeetshooting the bounced packets that
were getting returned.
Steve Report: Blood Pressure back to normal.
Val report: Still wondering if anyone messed with the air conditioning
settings last night. If you would like to take responsibility for the air
conditioning, please email Valand
do so.
25 May 1997
We just had our first real-life test of the UPS systems under actual
power outage conditions. Wow...that was cool!!! After 46 minutes, the battery
back-ups were still going strong without even breaking a sweat. Didn't
even come close to cranking up the generator. Of course, the beeping every
30 seconds drove me completely insane (not to mention all the phones, most
of which are not hooked up to the back-up power) and I found myself having
the strong desire to root around in the mud puddles forming outside.
Number of chirping phones: Five
Number of mud puddles: One, but it is very large.
Number of happy Mikeys: One - he slept through the entire incident.
Number of happy Steves: Zero - not that I am not ecstatic that the
system worked as intended, but the thunder storms are putting a real damper
on the IHOS festivities for this evening.
Number of happy Vals: One - she took the opportunity to take a shower
and now feels much better.
2 May 1997
Hi folks. I'm about at the end of my rope with this whole Internet thing.
The phone has been ringing off the hook for the past three days. PagePlop
customers are calling in left and right to say that they can't reach their
web sites or FTP accounts. Unfortunately, there is absolutely nothing I
can do about it.
PagePlop has not been down for one second. We have not had a single
packet loss in our transmission. Our provider, BellSouth is also completely
stable, up and running, without a single hitch. The problem is that Sprint,
UUNet, AlterNET, IBM network, and a half-dozen other telco providers can
not keep their systems up to save their lives.
This morning alone, all of the above mentioned providers have been completely
down for substantial periods of time and have had significant transmission
problems the rest of the time. If you are using a dial-up provider that
is on Sprint, then it is most likely that you can see anything else that
is also on the Sprint network, but that is all you're going to get. They
are cut off from the rest of the world. And you certainly can't reach us.
Neither can anyone else who is hung on any local ISP who uses Sprint. And
no one from the rest of the world can see anything hosted on a Sprint network
system.
One of the primary reasons that we chose BellSouth is that they do not
have these problems - at all. You are paying PagePlop to provide world-class
service when it comes to your web hosting needs. We are doing just that.
And part of that is going with a telco provider that will be our partner
in giving you that world-class service. You are also paying money to get
connectivity via your local ISP. They are apparently not concerned that
their providers couldn't care less about their customers.
So help me, if your automobile or refrigerator gave you this many problems
you would be screaming at the place that sold you the product, the Attorney
Generals office, and the Better Business Bureau. Why don't you do the same
about your dial-up providers and the national networks they get their service
from? Your providers can correct their problems anytime they want to by
installing the appropriate equipment for the task and by hiring a team
of professionals to manage it. They are not doing so. Make them.
26 April 1997
Holy bit bucket, Batman --
Something's wrong...
Right you are boy wonder. It seems that a pernicious presence provider
has been peddling problematic packets.
Oh no! Not neferious newbies nuking the net!
Fortunately, the people at PagePlop protected their patrons from peril.
What could cause such a cascading cosmic catastrophe?
Rogue routines running routers ragged, Robin.
Status of Mikey:
Sleepy (can't you tell :)
15 April 1997
Don't worry, nothing is wrong. It has been several months since the
last Mikey
update and some of you have expressed concern about my well-being. I'm
fine, but I've been very busy. About three months ago, we entered into
a series of secret internal discussions over a revolutionary change in
the way... Well, I can't give away too much -- yet. I have been working
on the project since then and all I can say is that it's big
and it's going to blow you and the competition (not that we have any :-)
away.
Stay tuned in the coming weeks for more info...
Oh, and on a totally unrelated note, I just thought I would point out
that today is our eight month anniversary! Here are some interesting statistics
compiled since we opened:
- Number of pizzas devoured: 436
- Number of blinky lights in the server room: 7000 (it's
so kewl!)
- Number of PagePlan designers
who have shed their vehicles to visit the spaceship behind the comet:
0
- Number of successfull attempts to hack
a Macintosh anywhere in the world: 0
4 April 1997
Whoops...my fault. We had a crash of the file server that lasted less
than twenty seconds, so short a time that no one noticed; we experienced
no timeouts. But I forgot to re-link the FTP server in the aftermath. So
although the FTP server was actually working, it could not find its homespace.
Fortunately it was Friday and no one cared to do any work anyway so it
was a moot point.
6 March 1997
The secure server was down for approximately one hour this afternoon
as we upgraded the software. It was necessary to do so in the middle of
the business day to coordinate efforts with secure service customers at
a time that was most convenient for them. No other services were affected.
In the next three weeks, we will be upgrading our web serving software,
the DNS software, and the email software. Each of those upgrades will occur
between 2 and 5 AM EST and will result in interruptions for only as long
as it takes to reboot the given machine.
Return to the Server Status Menu
|