In case of error, crash server.

daveime

Had a client frantically calling and emailing me that his website wasn't loading, and was displaying a mysql "too many connections" error message.

So I dutifully logged on via SSH, tried to crank up mysql to have a look at the processlist, "no space on /var/lib/mysql to perform this operation".

Hmm, that's a new one ... how about disk space ? oops, 100% full. Which is strange on a dedicated box that has only ever been 4% full in the last two years of operation.

After a while trying to find out where my disk space has gone, I happened across the apache error logs. All 150 of them archived, and the current one at about 300GB - which in itself is strange as the cron should be rotating the logs after 7 archives. Then I noticed that all these error logs had timestamps within the last 24 hours, and the active log was filled with the same string repeated over and over again.

"PANIC: fatal region error detected; run recovery"

So it looks like apache has been filling god-knows how many petabytes of this string into the error logs, which are then gzipped and archived into 150 neat little "zip-bombs".

So kill the apache process, deleted all the error logs, restarted apache, and within 5 seconds the error log is back up to 100MB. Killed apache again.

Google to the rescue, and it was the work of a moment to discover the problem lay in the mod_gnutls module, which is an alternative to the usual mod_ssl module. It uses a Berkeley database "cache" and this cache had somehow become corrupted - bear in mind the cache itself is only 24 kilobytes.

When the cache becomes corrupted, mod_gnutls writes a helpful (sic) little message in the error log which neither identifies the module that is failing nor the real problem, nor any suggested solution. And then tries again to perform the same operation that lead to the error in the first place. Again, and again, and again until the end of the universe.

Turns out all that is needed is to delete the corrupted database cache thingy and restart apache, whereupon the mod will make a brand new cache database and sanity is restored.

So WHY can't this be automated as the solution to the error, instead of spamming the error log until the entire hard disk is filled with gzipped error logs ?

This has been known about since April 2010, and I it seems like 9 minor revisions later, it still hasn't been fixed.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576676

I can understand the fail-retry graceful recovery attempt, but wouldn't you think after 2^63-1 attempts, he might be thinking "hey, this doesn't seem to be working?".

Arghh.

@daveime said:

So WHY can't this be automated as the solution to the error

It can.
@daveime said:

This has been known about since April 2010, and I it seems like 9 minor revisions later, it still hasn't been fixed.

So fix it.

Cad_Delworth

sigh …

Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!' OS by comparison.

And I agree with SuperJames74: if you're running a Linux box (especially a server), you just have to accept that you'll spend a lot of time patching and fixing your way around other people's (largely undocumented) fails.

So … fix it,already. :)

Shoreline

popcorn

Sutherlands

This is why "total cost of ownership" is more important than the up-front licensing cost. ;)

atipico

@Cad Delworth said:

Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!'

My organization is required by law to use free/opensource software. Nearly-inexistent documentation, projects abandoned then forked then abandoned, bugs going years without being fixed... I see it all the time. And I've been here just a few months.

Helix

@Cad Delworth said:

sigh …
Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!' OS by comparison.

And I agree with SuperJames74: if you're running a Linux box (especially a server), you just have to accept that you'll spend a lot of time patching and fixing your way around other people's (largely undocumented) fails.

So … fix it,already. :)

Someone told me that

@some haxor said:

the difference between IIS and Apache doesn't matter at all.
Hence if the http server is responsibe for
maybe 0.5% of your users application's experience, so you shouldn't spend more
than 0.5% of your development time talking about it

Helix

@atipico said:

@Cad Delworth said:
Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!'

My organization is required by law to use free/opensource software. Nearly-inexistent documentation, projects abandoned then forked then abandoned, bugs going years without being fixed... I see it all the time. And I've been here just a few months.

Leave, while you are still sane

Cassidy

@Shoreline said:

*popcorn*

Don't mind if I do.

I'm gonna join you on this one, it looks --

EUUGH! SALTY POPCORN! HEATHEN!

Shoreline

@Cassidy said:

@Shoreline said:
popcorn

Don't mind if I do.
I'm gonna join you on this one, it looks --
EUUGH! SALTY POPCORN! HEATHEN!

Lol.

tchize

@daveime said:

So WHY can't this be automated as the solution to the error, instead of spamming the error log until the entire hard disk is filled with gzipped error logs ?

If you could automate all error management, you would making error less programs :)

@daveime said:

This has been known about since April 2010, and I it seems like 9 minor revisions later, it still hasn't been fixed.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576676

Also note that since April 2010 not a user did care about, so this isn't a priority. I guess if someone had same issue and commented on it, it would get more attention?

Also, now you have learned the hard way why you have to keep your data away from your log file. You can never predict when a bug in a code will go rogue and fill your logs :) I make sure to always have a separate partition for log files.

In the past i have seen server go down to a full disk of legitimate log entrie. The glorious days of some IIS virus trying to run cmd.exe like crazy, increasin traffic by 100 or 1000 times normal traffic and thus generating several Gigs of query logs every day.

Cassidy

@tchize said:

Also note that since April 2010 not a user did care about, so this isn't a priority.

If it happened often enough then users would sit up and take note and it'd be escalated in priority. I'm guessing it's infrequent enough to be off the radar.

@tchize said:

I guess if someone had same issue and commented on it, it would get more attention?

Yup - more incident reports -> more affected users -> greater impact and driver for change.

@tchize said:

Also, now you have learned the hard way why you have to keep your data away from your log file. You can never predict when a bug in a code will go rogue and fill your logs :)

I'm not sure separating out logs and data would have helped in this case - ISTR that Apache fails to start if logfile growth is impeeded (full disk partition, quota exceeded) on the basis that it can't write any auditing so prefers you to clear the files down and/or turn logging off.

@tchize said:

I make sure to always have a separate partition for log files.

The modern way is to dump all variable stuff into one partition and use quotas to limit their growth. I used to have plenty of separate partitions until I used LVM then stuck to managing quotas more dynamically.

The first WTF I spot is that there appears to be no server monitoring that would have drawn attention to an unexpected leap in disk consumption (logwatch is your friend). Managing servers and deliverable services should be a proactive activity, not a reactive one.

ASheridan2

The log rotators in Linux can have limits set on how large log files are allowed to become, and how many it should keep before discarding old ones. I know it's a bit late for that now, but it would have prevented this from happening.

DCRoss

@ASheridan2 said:

The log rotators in Linux can have limits set on how large log files are allowed to become, and how many it should keep before discarding old ones. I know it's a bit late for that now, but it would have prevented this from happening.

That's a nice thought, but...

@daveime said:

Then I noticed that all these error logs had timestamps within the last 24 hours

Logrotate usually only runs once a day. If your system is sufficiently determined and capable of filling up all disk space in less than that time, you're doomed anyway.

Soviut

@SuperJames74 said:

So fix it

FixItYourself(TM) a classic FOSS trademark. Who cares if said issue is outside your domain knowledge, you have the source!

error

@Soviut said:

@SuperJames74 said:
So fix it

FixItYourself(TM) a classic FOSS trademark. Who cares if said issue is outside your domain knowledge, you have the source!

Also, good luck getting your commits accepted upstream.

spamcourt

@Soviut said:

http://www.tmrepository.com/

WTF?

Mcoder

@Soviut said:

@SuperJames74 said:
So fix it

FixItYourself(TM) a classic FOSS trademark. Who cares if said issue is outside your domain knowledge, you have the source!

Yeah. "If you try fix, we'll sue you. But no, we'll not fix. And if you try to gather atention around it, we'll sue you (and if it a security flaw, we'll try to put you in jail)." is a much better way to deal with bugs.

And, yeah again, you can keep saying that IIS is better than Apache because IIS never fails and neither does Windows.

Severity_One

@Cad Delworth said:

sigh …
Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!' OS by comparison.

And I agree with SuperJames74: if you're running a Linux box (especially a server), you just have to accept that you'll spend a lot of time patching and fixing your way around other people's (largely undocumented) fails.

So … fix it,already. :)

So one obscure module with one obscure bug means that the whole of OSS is flawed? OK... good to know.

Two questions, really:

What's wrong with mod_ssl?
Why keep logs and data on the same filesystem on a production server?

brian_banana

@Cad Delworth said:

sigh …
Yet Another Reason why Windows people like me think that Linux is such a 'yee-haw!' OS by comparison.

And I agree with SuperJames74: if you're running a Linux box (especially a server), you just have to accept that you'll spend a lot of time patching and fixing your way around other people's (largely undocumented) fails.

So … fix it,already. :)

How long are you in IT again? Does any type of software come
without flaws, bugs or non missing features? What are you actually
trying to say? MS writes superior code? How to tell if it's closed
source? Your comparison of Linux and Windows does not make a lot of
sense in the realm of software development since you don't have the
possibility to see MS' code. I would rather have undocumented code than
no code at all... which is flawless and gets even better by installing
service packs.

TheLazyHase

The idea is, they suppose that because there is a commercial society behind their software, it will more or less work, and they can have a reasonably quick fix for the bug, while Open Software are supposed to never be backed by a society and never have any bugfixes.

Depending on the software you use, it range between being an exageration and flat out wrong. In particular, there is often one or more firms that will gladly repair your bug on open software for a fee, which is the exact same thing as paying licenses or maintenance contract for closed sources software.

Here the question is, why using an obscure and underused alternative (since it came from an alternative to mod_ssl) to the main thing without budgeting time to iron out bug ? When you change from one commercial software big shop to some obscure company, you take the time to test the product and see whether it seem acceptable or not. It's not different because it's open.

(and I don't even hope that they will change opinions anytime, even for a balanced stance. It require that they actually use open software for a task, that their experience is significantly better than with competing closed source software, and that they have the will to admit that ideology is almost alway a big exaggeration. And I can't see any direct benefit for them to even try)

Cassidy

@Severity One said:

Why keep logs and data on the same filesystem on a production server?

In my case, one of my servers was partitioned to have all static stuff in one FS and dynamic stuff in another, meaning that /var/www and /var/log were in the same filesystem. They didn't need to be, just that website content and webserver logs fell into the same partition.

Another server of mine (CentOS) shoves everything (except boot partition) in one big root FS and lets you use quotas to limit growth. So far this has been fairly successful - someone filling up their vhost area won't impact upon another's, and I can dynamically reallocate capacity by tweaking quota settings.

I think gone are the days when data used to be physically separated to avoud bus or disk contention - most production systems would be using some hardware RAID or NAS where the total storage is presented as one disk to the OS, and different partitions may not actually relate to dofferent physical disks anymore.

Cad_Delworth

@brian banana said:

How long are you in IT again?

37 years. You?

All I was trying to say was that IME MS-OS server PCs since NT Server tend on the whole to Just Work and not fall over. Also IME, Linux servers do by comparison seem to take an awful lot more work to get them running and keep them running.

@Severity One said:

So one obscure module with one obscure bug means that the whole of OSS is flawed? OK... good to know

Try reading my post. I was dissing LINUX, not OSS in general. I use a number of OSS applications in Windows, after testing them and making sure they work as least as well as paid-for alternatives (which they usually do); but all of them are kind of niche applications for specific purposes.

Cassidy

@Cad Delworth said:

Also IME, Linux servers do by comparison seem to take an awful lot more work to get them running and keep them running.

Which Linux distros/versions have you experienced? And did you find they all required large effort to bring up and keep up?

Strolskon

@Cad Delworth said:

... if you're running a <AnyOS> box (especially a server), you just have to accept that you'll spend a lot of time patching and fixing your way around other people's (largely undocumented) fails.
So … fix it,already. :)

FTFY

esoterik

@daveime said:

but wouldn't you think after 2^63-1 attempts, he might be thinking "hey, this doesn't seem to be working?".

Who exactly is this "Person" who is supposed to notice 2^63-1 attempts? Apache? The CPU? Are you assuming that Apache is e-mailing its developer a copy of every logline from every instance of Apache in existence?

WTF? How is any person supposed to notice some internal state of a program they wrote, let alone an instance of one that they can't attach a debugger to since it is running on some unknown PC in some unknown location? You are TRWTF!

Remember Computers are DUMB, they only execute instructions, if the computer is instructed to output a logline, it will output that logline ever single time that condition comes up.

Edit: why is CS eating my linebreaks?

brian_banana

@Cad Delworth said:

@brian banana said:
How long are you in IT again?

37 years. You?
All I was trying to say was that IME MS-OS server PCs since NT Server tend on the whole to Just Work and not fall over. Also IME, Linux servers do by comparison seem to take an awful lot more work to get them running and keep them running.

12 (8 years admin, networking with W and L and since 4 years code monkey)!

All I was trying to say was that IME there is no big difference between administering Linux or Windows once the server's initial setup is complete and specific software (Daemons, Services) is installed and configured. And when it comes to stability... well... in regard to the initial WTF these comments relate to I think that every buggen code can pull a server into the abyss... whether it be Windows or Linux.

Severity_One

@Cad Delworth said:

@Severity One said:
So one obscure module with one obscure bug means that the whole of OSS is flawed? OK... good to know
Try reading my post. I was dissing LINUX, not OSS in general. I use a number of OSS applications in Windows, after testing them and making sure they work as least as well as paid-for alternatives (which they usually do); but all of them are kind of niche applications for specific purposes.

Problems with computers usually start is when you start using obscure stuff. It could be an obscure peripheral with an obscure driver from some obscure Asian country, or an obscure piece of software from pretty much anywhere in the world.

For our Subversion/Jenkins/Artifactory server, I specfically went from Solaris x86 to Llinux (CentOS) because of the ease with which you can install packages in Linux, with those packages maintained by someone else. Something similar for Solaris either doesn't exist, or requires a subscription. And I value Solaris highly, higher than both Linux and Windows, but maintenance is just a lot more work. So when it comes to running our (Java) applications, everybody is happy with Solaris, but as soon as you start using third-party OSS software, Linux has the upper hand.

And yes, we run production services perfectly reliable on Linux. The only times that I need to reboot that Subversion/Jenkins/Artifactory server is when the SAN goes down.

I'm not a Linux fanboy, far from it, mostly because of the disturbing fanaticism that some of its followers show. But if you stick to stuff that is Not Obscure, it Just Works.

Zemm

I was going to post something but then Safari on iPad crashed. Apple. It Just Works.

boomzilla

@Zemm said:

I was going to post something but then Safari on iPad crashed. Apple. It Just Works.

Probably a javascript thing. You should have fired up a debugger.

daveime

@esoterik said:

@daveime said:
but wouldn't you think after 2^63-1 attempts, he might be thinking "hey, this doesn't seem to be working?".
Who exactly is this "Person" who is supposed to notice 2^63-1 attempts? Apache? The CPU? Are you assuming that Apache is e-mailing its developer a copy of every logline from every instance of Apache in existence?
WTF? How is any person supposed to notice some internal state of a program they wrote, let alone an instance of one that they can't attach a debugger to since it is running on some unknown PC in some unknown location? You are TRWTF!

Remember Computers are DUMB, they only execute instructions, if the computer is instructed to output a logline, it will output that logline ever single time that condition comes up.

Edit: why is CS eating my linebreaks?

Presumably, the "person" in question is the database library that mod_gnutls uses to cache data. Are you seriously telling me that this is a good design pattern ?

10 READ SOMETHING FROM DATABASE

20 IF SUCCESS,RETURN

30 ADD MESSAGE TO ERROR LOG

40 GOTO 10

No limits on the number of retries, just keep trying, and trying, and trying to redo an operation that you know has just failed, until the end of the universe ?

This isn't fucking rocket science, this is programming 101 - excuse the language, but your tone and implication that it's somehow MY fault because I was foolish enough to trust that an Apache module might be stable enough to not go into an endless loop and eat my fucking hard disk, rather than just exit gracefully.

Bear in mind, this isn't multiple page requests triggering one error each, this is a SINGLE page request that causes an endless loop. It has been known about for getting on 3 years, it's marked as "important", but seemingly because it's a rare edge case, the developers felt it better to leave it unfixed, and users only find out the hard way once our server goes down due to a design pattern my dog would laugh at.

Your comment is everything that is wrong with FOSS, it's just a small mercy you didn't tell me to fix it myself. After all, I've got nothing better to do than patch obscure bugs in software that runs on the majority of the worlds webservers.

daveime

@TheLazyHase said:

Here the question is, why using an obscure and underused alternative (since it came from an alternative to mod_ssl) to the main thing without budgeting time to iron out bug ?

http://modgnutls.sourceforge.net/

It's been running for 2 years without a hitch on our server, the endless loop condition only occurs when the underlying storage database becomes corrupted. And the reason for using the TLS mod rather than SSL mod was to allow SSL certificates for multiple virtual domains on the same IP. All the modern browsers support this protocol, and it's curious that the SSL group itself hasn't implemented it.

I'm not sure how we could have budgeted time to iron out a bug that only manifested itself in response to a corruption issue after 2 years operation ...

Shoreline

I'm on your side in general, but specifically for:

@daveime said:

I'm not sure how we could have budgeted time to iron out a bug that only manifested itself in response to a corruption issue after 2 years operation ...

It reminds me of a situation I found myself dealing with more than once in one of my previous companies. We fixed a large number of issues during the time I was there to do with environmental development and deployment (it started out that we had a test and production environment on the same server with no version control). I am specifically reminded of a situation where I had developed a cronned script which started outputting errors on the production environment (after it was made separate from the testing environment) but not the test system. Since the result was that it generated all its data but did not kick off the next script to generate the rest of the data, we added a cron job to kick off the next script. The next day the first script kicked off its next script as it was designed to, and the cron kicked off that script as well, slowing down the server and corrupting the data.

It turned out the mass of data on the production environment was causing it to run out of memory (my senior had told me to program it a certain way to make it faster, or at least executable in a reasonable time, and I didn't feel like I had the right not to follow instructions of my senior). I'd like to say we modified the design of the script, but my senior simply expanded the amount of memory available. I got told I should be testing things more carefully (by somebody less technical, of course). At least the problem never occurred again.

It's quite a tangent from your issue, as we had access to our code, and I knew its structure, but it was more typical than it should have been for things not to work on the production environment despite working in the test environment, so we may as well have budgeted time for it.

mod: fixed linebreaks –dh

esoterik

@daveime said:

... but your tone and implication that it's somehow MY fault because I was foolish enough to trust that an Apache module might be stable enough to not go into an endless loop and eat my fucking hard disk, rather than just exit gracefully.

You seem to believe that computers can think. Guess what, they can't!

(i was going to just post whoosh, and leave it as that; you can't seriously believe that Apache is sentient and can notice itself being in an infinite loop.)

Ben L.

@esoterik said:

@daveime said:
... but your tone and implication that it's somehow MY fault because I was foolish enough to trust that an Apache module might be stable enough to not go into an endless loop and eat my fucking hard disk, rather than just exit gracefully.
You seem to believe that computers can think. Guess what, they can't!

(i was going to just post whoosh, and leave it as that; you can't seriously believe that Apache is sentient and can notice itself being in an infinite loop.)

I would expect a fatal error to stop execution of the program rather than make a log and retry.

nonpartisan

@daveime said:

I'm not sure how we could have budgeted time to iron out a bug that only manifested itself in response to a corruption issue after 2 years operation ...

You can't always.

We discovered a bug with the operating system of a stackable network switch. A stackable network switch is similar to a modular switch, except each network switch can act on its own or participate as part of a unit. The purpose of a stackable network switch is to be able to add additional members, of whatever flavor port, whenever you want (put two 48 port 10/100 switches together, then add a 24-port gig switch, then add a 12-port gig SFP-based switch, then another 48 port 10/100, etc.). When the switch was freshly booted, no problems. But some time down the road (where "some time" was undefined but generally considered to be greater than 1 year), if you removed the cables on the backplane ring to be able to wire in a new member, RPC communication was lost/got screwed up and the whole stack went into a tizzy. Members would reboot, or become unmanageable (although they still passed Layer 2 traffic, CLI [yes Blakey] and SNMP would report the member didn't exist), or otherwise end up in an unpredictable state.

Their response? "We can't replicate the bug, but it's fixed anyway because we rewrote the stack management code for this latest version."

powerlord

@daveime said:

@TheLazyHase said:
Here the question is, why using an obscure and underused alternative (since it came from an alternative to mod_ssl) to the main thing without budgeting time to iron out bug ?
http://modgnutls.sourceforge.net/
It's been running for 2 years without a hitch on our server, the endless loop condition only occurs when the underlying storage database becomes corrupted. And the reason for using the TLS mod rather than SSL mod was to allow SSL certificates for multiple virtual domains on the same IP. All the modern browsers support this protocol, and it's curious that the SSL group itself hasn't implemented it.
I'm not sure how we could have budgeted time to iron out a bug that only manifested itself in response to a corruption issue after 2 years operation ...

Support for SNI (the feature you're referring to) was implemented in the version of mod_ssl included Apache 2.2.12 when compiled against OpenSSL 0.9.8j or newer. That was released in July 2009...