PagePlop 1997 Web Hosting Server Status

1997 Reports

23 December 1997

We were offline this evening from 1:30 am until 2:10am EST making adjustments to our file sharing software. In addition, we have performed routine random file erasing to insure the integrity of our system and to keep our customers on their toes. Furthermore, we have disconnected the T1 lines and inserted them into our freezer and are performing our own experiments with food irradiation. As a side experiment, we are testing to see if frozen bits move faster. We hypothesize that they do. If you notice any speed difference, please send a note to us.

12 December 1997

BellSouth is experiencing some intermittent errors today in their routing systems throughout the southeast US. The net result is that our T1 lines are bouncing up and down at irregular intervals, though they rarely all do so at the same time. Even then, any general outage resulting from a simultaneous bounce on all lines lasts no more than about 10 seconds - not even long enough for a timeout to occur. As of 3:15 PM eastern time, all telco is up and running stable; for how long only BellSouth knows for sure.

4 December 1997

We have no idea...

The web servers crashed at 12:02 PM today for no apparent reason. They were down for about two minutes and we're trying to track down the problem. It seems (for now) that it was just a transient event that is not repeatable. We will continue our attempt to determine what happened and find the appropriate client to blame it on. We're thinking Gregory at this moment, but Michael or Barbara might do just as well. :)

28 November 1997

We are provisioning in a new T1 line today. Unfortunately, things involving BellSouth that require so many different people need to be done during business hours. Installing a new T1 is one of those things. Since this is a holiday weekend, we figured that it would be best to perform the install now rather than wait till a normal weekday. What we are installing is a load balanced T1 which is why this is so difficult and taking so long. Normally when a new line goes up, it is a redundant T1 - not load shared. By load sharing, we not only get increased throughput, but we also get full redundancy if one or more of the T1s hose.

We will be bouncing up and down starting at 2:30 PM EST and we expect the entire process to take about three hours to get all the telco lines rebalanced. During that window, you will experience intermittent loss of connectivity, though we expect to be up more than we are down. As always, email will be stacked up for delivery when everything is stable.

Addendum:

Well, it looks like we weren't the only ones to make the decision to do maintenance today. The internet as a whole looks pretty hosed. And the problems are widespread with most carriers affected with at least a 30 percent packet loss. As of 5:27 PM EST, we are back humming - fully load balanced. Yeah...

27 November 1997

What are you people doing awake anyway?!? Everyone of you should be passed out from turkey bloat by now which is why we decided that it would be a good time to do some maintenance on the server. We took it down for about an hour and when I get back to my email, there are a half-dozen messages in there telling me the server is down. Duh... :)

Anyway, it was time to rebuild some of the drives and upgrade some of the software, so that is now done. Next time, I'll give some warning specifically when we do something like that. Now, go to sleep.

18 November 1997

Hurry hurry hurry. Step right up and get your fresh hot roasted "DNS name not found" errors. There's only a limited supply folks so get yours while they last. Where you say? Why, come on down to Big NIC's (InterNIC's that is) J Root Server Emporium. Where you can find a fine selection of corrupted and incorrect DNS names to suit just about anyone's tastes.

Now don't get hornswaggled by one of those other root servers that only hand out "correct" DNS entries. What you need my friend is a genuine leather bit bucket so you can load up on all these great domain names that have gotton completely lost in the shuffle.

Deals deals deals is what you'll find at Big NIC's. Oh, and tell 'em that Mikey sent ya!

17 November 1997

This morning was proof that if something can go wrong, it will. We have a full system check in place that will warn us if anything goes wrong. Alarms go off...bells ring...and lasers kill intruders if something in the system fails. Well, we found a particular failure this morning that was completely unexpected.

The web server went south at 8:37 AM EST. But the machine was still pingable. And worse yet, the file sharing system was still up and running. As far as anything was concerned, the machine was still up and running and the web server was still listening for incoming requests -- it was responding to those requests, but not sending anything back. At the same time, FTP services, email, secure services, and database services were still performing flawlessly so we didn't even have any back-up warning condition to trigger an alarm (as we would have had if the server had died completely.)

Normally, even a situation as bizarre as that would have been caught in moments IF anyone had been awake at the time. Unfortunately, this morning was one of the rare times that all of us were asleep at the same time. When I awoke at 11:30 AM, it was immediately obvious that something was wrong and a simple relaunch got us up and running within seconds.

We will be spending the balance of the afternoon on the phone with vendors and searching through technical documents to see if what happened is logged anywhere. And then we will add still another checking mechanism to our testing program to prevent any further occurrence of this same type.

Sorry if anyone was inconvenienced.

15 October 1997

The primary peering point for the Eastern United States, MAE-EAST, has been experiencing major routing problems since last Thursday night. Traffic through MAE-EAST has been intermittent at best. Today, MAE-EAST is completely hosed and all traffic between major carriers is essentially non-existent. This is gonna be a really long day for the Web.

We are fine and our telco link is fine. And almost without exception, the ISP you are dialing into as well as their telco link is fine; it's just no one is talking to each other. The only thing to do is ride it out. Time to go to lunch... :)

14 October 1997

Today was going to be my day off. I had been programming furiously for the last several months and it was time to just sit back and relax. So naturally, I hopped in my car and drove down to the local RouterShack to gawk. I had just gotten through looking at a WimpyBit 250 router when, in the distance, I heard a blood-curdling scream. Naturally, I was concerned, but it was probably that same concern that caused me to overlook a critical detail about the scream. It would not be my last mistake.

I started making my way through the store looking for someone in distress. As I neared the fusion-powered router aisle, I heard it again, but this time, there were hundreds, or even thousands of screams! I rounded the corner and then stared in shock. It couldn't be...

There in front of me was a demonstration of a RouteMaster 2000 (a.k.a. the Packet-Knosher)! The screams of the millions of bits being slammed down the demonstration T1 was deafening. Naturally, I bought one.

I hooked it up this morning and true to the demo, the bits are screaming through it. I realize it's a little disconcerting to hear these types of sounds coming from your modem but you can rest assured that no one is harming the bits in any way. Those of you who are concerned that the bits are being overworked can send your complaints to the Bit Abuse Hotline.

1 October 1997

Problem. Big problem. InterNIC operates nine name servers named "A" though "I". (Actually, they have several more, but they are experimental and are not in hot use.) Anyway, when you type in a URL or hit one of your bookmarks, your request goes to the local cache on your dial-up provider to resolve that name. If the URL is not in the local cache, your local dial-up provider sends out a request to one of the InterNIC name servers to resolve it. Normally, the InterNIC name servers would return the information where it would then be placed into the cache of your dial-up provider. That way if you or anyone else from your dial-up service wanted to look at that domain again, the URL request would go to cache, saving traffic on the InterNIC name servers.

If one of the InterNIC name servers is completely down, then there is no real harm done. Your request simply goes to another name server to pick up the information. But if the information returned from the InterNIC name server is wrong, then that wrong information is cached at your local dial-up provider. If you or anyone else on that system then places the same URL request, that wrong information is returned locally and the InterNIC name servers are not even queried again.

What is happening today is that the "H" InterNIC name server is partially corrupted. It appears that all the .com, .org, .and .net addresses are corrupted. It does not appear that .edu or .gov are affected. If anyone sends out a URL request to a .com address and the "H" name server is hit then that "No DNS found" response is getting placed in the dial-up provider's cache. Again, once that happens, any subsequent request for that URL on that local system continues to show "No DNS found" errors even though the domain is live and running just fine.

As the morning progresses, more and more of the requests are going to statistically hit that "H" server and the erroneous DNS information will get cached all over the place. Unless InterNIC intercedes quickly, and I mean before the east coast of the United States wakes up and starts browsing, today will be a very bad day for the World Wide Web.

This problem is not only affecting the Web, it is also affecting email, FTP services, and anything else that relies on DNS name resolution. At best, it will be an inconvenience if you can't update your pages or browse URLs today. At worst, it will mean that emails will get trashed, never to be delivered.

Unfortunately, all this is totally out of our control and the control of your dial-up provider. The best thing that can be done is for your local ISP to flush their cache regularly all day long until the problem with the "H" server is resolved. Those flushes will have to take place with increasing frequency as the day progresses and more and more people hit that bad name server to cache bad information. We have our cache set with a five minute time to live which means that any DNS entry in our cache expires in five minutes. Many providers, though, set their cache for one hour, 12 hours, 24 hours, or longer.

Let's just hope that InterNIC gets its act together and shuts down the "H" name server quickly.

InterNIC Update:

We spoke with InterNIC at about 6:30 AM and told them of the problem. At 6:50 AM, they called us back to inform us of the steps that have been taken. According to InterNIC, an email has been sent to the sysop of the "H" server explaining the situation. Unfortunately, the problem may not resolve itself any time soon.

The way the system is set up, InterNIC has direct control over the "A" server only; the administration of the others is contracted out and their upkeep is dependent on the individuals who are contracted to maintain the servers. The "H" server appears to be located at the Aberdeen Proving Grounds in Maryland. Assuming that the individual who is responsible for maintaining that server gets the email, reads it, and responds appropriately it may be several hours before anything is done. Worst case is that the responsible individual is off today and does not get the email. That will mean that nothing will be done until the normal InterNIC updates which occur overnight.

Begin Editorial Comment*****

You would think that InterNIC would have a better way of resolving root server problems than simply sending an email. We are going to do a whois on the "H" server and see if we can get a contact number.

End Editorial Comment*****

Additional InterNIC update:

We have reached the sysop by phone who is responsible for maintaining the "H" server at 7:05 AM and notified him of the problem. It is being attended to as I write this. Looks like the crisis is averted. Yeah...

Begin Editorial Comment*****

We found that phone number with a simple whois. Now, why can't InterNIC have done the same thing?

End Editorial Comment*****

Additional, additional InterNIC update:

At 7:40, the problem still exists with the "H" server and with the east coast waking up and hitting the Web, sending emails, and the like, things should start bogging down very shortly. Remember, once the "H" server corrupted information gets cached on your ISP's system, it will not matter even if "H" comes back up; that information will live in cache and be corrupted until that cache flushes. It's beginning to look like a long day again...

Another additional InterNIC update:

8:35 and still hosed. Oh well. I did what I can do so I'm going to sleep. It looks like the way the system is set up, the only time you can correct errors to the name server tables is on the overnight updates. At least that is what I would have to presume; there has certainly been enough time to pass to have done something by now -- even if it is to simply turn it off.

Final InterNIC update:

As of 12:25 PM, the "H" name server appears to be now responding properly. I can't determine exactly when it started working again since I just woke up. And now I am going back to sleep. See y'all this evening...

10 September 1997

We have been experiencing intermittent problems today with the web server. The cause has been isolated to a piece of crap software called NetCloak which is responsible for text (but not graphical) counters. We have removed the offending piece of garbage which will cause all text counters to immediately stop. If you are using a text counter, please go to our tutorials page for instructions on temporarily substituting any one of our available graphic counters for your text counter. Please email or call us if you need assistance with this process. We apologize for any inconvenience. We will get this straightened out as soon as possible. For anyone using counters as page statistics, please email us and we will be more than glad to provide log statistics free of charge for the month of September.

5 September 1997

You may notice that nothing of any significant function is currently working. We know that (and this is gonna be a long night.) We have completely rewritten how back end processes are handled by the various servers and are in the process of installing the new software. (For those of you keeping up with the implementation of MGI, this is round one. The next major upgrade comes in October.) The whole thing should take until about 5 AM to finish at which time the servers will be running so fast and efficiently that you will need SuperMikey to slow things down so that people can actually read them. We will return control when you place $50,000 in small bills in a brown paper bag...no wait, that's a different scenario (gotta quit watching Showtime.)

Anyway, email, the web server, secure services, and all other non-GCI functions will not go down.

22 August 1997

Well, this is the deal. We are experiencing some sort of bizarre instability involving the web server on port 80. This has been going on for two days now. There does not seem to be a particular pattern to the freezes, nor have we altered anything in the past several weeks that would cause a problem. The glitch seems to involve an interaction between our forms processor, the server software, and AppleTalk. What happens is that the CGI will come forward as the frontmost application and lock up, but the machine running the processor does not itself freeze. From perspective of the internal LAN, nothing seems to be wrong. AppleTalk still functions and the machine still takes pings over TCP/IP. Nothing gets served out, though. And the machine itself does not unmount from the network.

When this happens, you will be able to open an FTP connection and see your hierarchy, load things into your folder, and get files from your folder. Database services and secure services are not affected either, so if someone has already entered those areas, they will be successful until they try and get back to the standard web server machine. Then they get a No-DNS error.

We have developed a process that constantly checks the web server and sounds an alarm if a freeze should occur. Unfortunately, all it can now do is warn us when something is wrong; the auto-reboot systems will not work since there is technically nothing wrong with the computer or filesharing. Fortunately, we live with the servers and are here to do a manual reboot in the event of a freeze. In most cases, the reboot will occur so quickly that an end user will not even time out, but occasionally they may see a No-DNS found message. In that event, when they try again, the server will be back up and running.

This problem has occurred five times in the past 65 hours and at this time represents little more than a irritation for everyone involved except us as we rip our hair out trying to figure out what the problem is. I'll keep you posted.

20 August 1997 - Part Two

Computers are like knees; they act as precursors to evil things when they start acting up. I should have known this morning that when we started to have intermittent instability for no reason we were in for some real problems. Well, those problems arrived in the form of a massive front line shortly after 7 PM EDT. We're talking violent lightning, hail, tornadoes, and other Acts of God extending in a line from Richmond, Virginia down, and past, Atlanta, Georgia. Major damage...death and destruction...sheep with anthrax roaming wild...rabid bats...

We are actually up and running; the UPS power backup kicked in perfectly and everything worked just fine internally. Unfortunately, much of the telco in the southeast region was down completely or having so many problems with network overload that it was effectively down. We have engineers crawling all over Raleigh and the surrounding area and we do know that our telco back to the managed facility is running, but beyond that...well, reports are coming in that there is nothing beyond that.

So there is only one thing left to do at this moment...grill some rib-eyes. And you are welcome to join us. The only problem is that by the time the network is back up and you have read this, we will have finished eating and there will be none left. You snooze...you lose. :)

Number of disgusted Steves: potentially one if the steaks burn on the grill
Number of frazzled Mikeys: one, who has been peering into the T1 throughout the night waiting to see packet movement. He'd better watch it because if those packets start moving while he have that line to his eye, he's liable to get pinged.
Number of culinary Vals: one, who is womaning the grill
Number of really ticked off telco engineers: many as they wander around looking for problems while trying to avoid sheep with anthrax and rabid bats.

20 August 1997 - Part One

We've been having some trouble with the secure server this morning. The problems started at 3:34 AM and have caused some instability in the system. We think we have it fixed now and we will continue to monitor it throughout the day and tonight. It's interesting, though...when you have a bad head cold is when things start messing up big time. Never happens when I'm wide awake and well...

Number of ill Steves: one
Number of Vals who are tired of ill Steves: one
Number of Mikeys who refuse to come over and work on this himself for fear of getting a really bad headcold: one

19 July 1997

Do I look like I'm in a bad mood? I have no idea why you would say that just because the TELCO HAS BEEN OUT FOR FOUR HOURS!!!!!!

In a stunning example of irrational incompetence, GTE has demonstrated just why many of their customers hate them. Oh, did I not tell you? Though we contract with BellSouth (the most reliable and efficient provider of telco services in the world,) because of the decision issued by Judge Green (who would best be served by joining Judge Crater) BellSouth is not allowed to provide their world-class services without interference...uh, mandatory technical assistance. And that help comes in the form of GTE for one little hop of a large copper wire as it passes through Research Triangle Park, North Carolina. You see, RTP is serviced by GTE and GTE alone which is BS because of Green. And there is no way around it. But when GTE drops the ball like they are world famous for doing, everyone else is affected.

In this particular instance, their ball-dropping resulted in taking down not only our telco, but also most of the T1 services in the region, including but not limited to, credit card authorization services and ATM machines. On a Saturday afternoon. If you think I was torqued, you should have seen the retail store managers in Crabtree Mall who don't know a T1 from a Black Snake (and wouldn't know what to do with either if handed to them.)

The problem turned out to be a card in a router in a building in RTP, the control of which fell under GTE. Any other company would have gotten a tech there is about 10 minutes (no wait...any other company would have had techs on-site 24/7) and swapped out the card with one that they keep in redundant reserve. Of course, that would be too much of an effort for GTE, not only to get a tech on site in anything short of four hours, but to actually have a redundant card handy. So what they did -- get this -- is to rig a temporary ring around the router to get the telco back up. And once it was back up, they then waited until 2 AM to take the 30 seconds it took to swap out the card.

From rumors I have heard, GTE is apparently being read the riot act by some very high-up people at BellSouth and, with any luck and the constant repetition of the words, "breech of contract," GTE may actually get their act together.

12 July 1997

2:08 PM:

Mikey: We're down...
Steve: Duh...
Mikey: Check the web server and DNS. I'll check the router.
Steve: Duh...OK...

2:10 PM:

Mikey: The router is fine
Steve: All internal systems are fine.
Mikey: We're still down.
Steve: Duh...

2:11 PM:

Mikey: Time to call BellSouth.
Steve: OK...
Val: Good morning
Steve and Mikey: Good morning.
Steve: We're down.
Val: (Who is not quite awake yet...) Duh, OK...

2:12 PM

BellSouth: Network Operations Center, can I help you.
Steve: We're down.
BellSouth: Duh!
Mikey: What's going on?
BellSouth: Major network outage...seven states completely off line...router burned...death and destruction...locusts, boils, and assorted pestilence sweeping the landscape...
Steve: Duh.
Mikey: Any ETFTS? (Estimated Time to Fix The Screw-up)
BellSouth: We're on top of it. No time of ETFTS.
Mikey: Duh...we'll call back.

5:58 PM

During the past three and one-half hours, we have entertained ourselves by making some software upgrades, doing the laundry, taking a moment to create life in a test-tube and calling BellSouth every half-hour at the top and bottom of every hour. We're beginning to get bored.

Steve: We're still down.
Mikey: Duh.
Val: Did you turn the air conditioning up last night?
Steve: No. Let's call BellSouth again.
Mikey: OK...
BellSouth: Network Operat....
Steve: We know who you are. We know where you live. When are we going back on line?
BellSouth: Should be within the hour.
Val: Are you sure you didn't turn the air up last night?
Steve, Mikey, and BellSouth: YES!!! WE'RE SURE!!!

6:02 PM

Mikey: Hey...We're up!!!
Steve: Yeah!!!

Mikey report: Sent out a total of 24,000 packets, all of which were lost during the outage. At one point, Mikey got so frustrated that he grabbed a shotgun and started skeetshooting the bounced packets that were getting returned.
Steve Report: Blood Pressure back to normal.
Val report: Still wondering if anyone messed with the air conditioning settings last night. If you would like to take responsibility for the air conditioning, please email Valand do so.

25 May 1997

We just had our first real-life test of the UPS systems under actual power outage conditions. Wow...that was cool!!! After 46 minutes, the battery back-ups were still going strong without even breaking a sweat. Didn't even come close to cranking up the generator. Of course, the beeping every 30 seconds drove me completely insane (not to mention all the phones, most of which are not hooked up to the back-up power) and I found myself having the strong desire to root around in the mud puddles forming outside.

Number of chirping phones: Five
Number of mud puddles: One, but it is very large.
Number of happy Mikeys: One - he slept through the entire incident.
Number of happy Steves: Zero - not that I am not ecstatic that the system worked as intended, but the thunder storms are putting a real damper on the IHOS festivities for this evening.
Number of happy Vals: One - she took the opportunity to take a shower and now feels much better.

2 May 1997

Hi folks. I'm about at the end of my rope with this whole Internet thing. The phone has been ringing off the hook for the past three days. PagePlop customers are calling in left and right to say that they can't reach their web sites or FTP accounts. Unfortunately, there is absolutely nothing I can do about it.

PagePlop has not been down for one second. We have not had a single packet loss in our transmission. Our provider, BellSouth is also completely stable, up and running, without a single hitch. The problem is that Sprint, UUNet, AlterNET, IBM network, and a half-dozen other telco providers can not keep their systems up to save their lives.

This morning alone, all of the above mentioned providers have been completely down for substantial periods of time and have had significant transmission problems the rest of the time. If you are using a dial-up provider that is on Sprint, then it is most likely that you can see anything else that is also on the Sprint network, but that is all you're going to get. They are cut off from the rest of the world. And you certainly can't reach us. Neither can anyone else who is hung on any local ISP who uses Sprint. And no one from the rest of the world can see anything hosted on a Sprint network system.

One of the primary reasons that we chose BellSouth is that they do not have these problems - at all. You are paying PagePlop to provide world-class service when it comes to your web hosting needs. We are doing just that. And part of that is going with a telco provider that will be our partner in giving you that world-class service. You are also paying money to get connectivity via your local ISP. They are apparently not concerned that their providers couldn't care less about their customers.

So help me, if your automobile or refrigerator gave you this many problems you would be screaming at the place that sold you the product, the Attorney Generals office, and the Better Business Bureau. Why don't you do the same about your dial-up providers and the national networks they get their service from? Your providers can correct their problems anytime they want to by installing the appropriate equipment for the task and by hiring a team of professionals to manage it. They are not doing so. Make them.

26 April 1997

Holy bit bucket, Batman -- Something's wrong...

Right you are boy wonder. It seems that a pernicious presence provider has been peddling problematic packets.

Oh no! Not neferious newbies nuking the net!

Fortunately, the people at PagePlop protected their patrons from peril.

What could cause such a cascading cosmic catastrophe?

Rogue routines running routers ragged, Robin.

Status of Mikey: Sleepy (can't you tell :)

15 April 1997

Don't worry, nothing is wrong. It has been several months since the last Mikey update and some of you have expressed concern about my well-being. I'm fine, but I've been very busy. About three months ago, we entered into a series of secret internal discussions over a revolutionary change in the way... Well, I can't give away too much -- yet. I have been working on the project since then and all I can say is that it's big and it's going to blow you and the competition (not that we have any :-) away.

Stay tuned in the coming weeks for more info...

Oh, and on a totally unrelated note, I just thought I would point out that today is our eight month anniversary! Here are some interesting statistics compiled since we opened:

Number of pizzas devoured: 436
Number of blinky lights in the server room: 7000 (it's so kewl!)
Number of PagePlan designers who have shed their vehicles to visit the spaceship behind the comet: 0
Number of successfull attempts to hack a Macintosh anywhere in the world: 0

4 April 1997

Whoops...my fault. We had a crash of the file server that lasted less than twenty seconds, so short a time that no one noticed; we experienced no timeouts. But I forgot to re-link the FTP server in the aftermath. So although the FTP server was actually working, it could not find its homespace. Fortunately it was Friday and no one cared to do any work anyway so it was a moot point.

6 March 1997

The secure server was down for approximately one hour this afternoon as we upgraded the software. It was necessary to do so in the middle of the business day to coordinate efforts with secure service customers at a time that was most convenient for them. No other services were affected.

In the next three weeks, we will be upgrading our web serving software, the DNS software, and the email software. Each of those upgrades will occur between 2 and 5 AM EST and will result in interruptions for only as long as it takes to reboot the given machine.

Return to the Server Status Menu