effective logging management system & policy ?

2011-11-10 12:16:43 +08:00
 zhuang
I've been long time dreaming about an effective system & policy for logging management. Ideally, it should be *usable at extreme condition.

Yes, I'm talking about USABILITY. I guess most system administrators, experienced or newbie, may have failed at disaster recovery at least once, even you were fulfilled with thousands of backups. Backups are important, while what matters here is that the approach to rehabilitate is missing.

When it comes to logging systems or policies, the question becomes, are you ready for the crime scene investigation? Unfortunately this is not a joke to me. I always define myself a detective, or sometimes a firefighter. Imagine such a scenario, a server is down and it's not maintained by you. Now it's your time to find what happened and to make it right by any means. You'd better be fast.

So you grab a copy of log files, expecting some obvious clues to be found. I have to admit that I'd take a deep breath before diving into a deep sea of information. Wait a moment, you think you've got all available log files? You are too naive. Unix and Linux, despite commercial or free distos, they vary system-widely.

Take a typical linux-based web server as example, you may first check rotating configs and estimate the time intervals that exceptions occurred. The syslog is general one but is far from enough. Web server and database daemons have its own ones. Since you start digging into the problem, you may need network-related logs as well, say iptables etc. If nothing seems weird, you may take package management systems into consideration. Sometimes account auditing will force you checking su/secure/auth logs. In a fatal condition like hacker invasion, these logs are probably no longer reliable and you have to ensure no rootkit exists at first. By the way, if the machine unluckily is kernel-hardened, all your work would time 3 or even more before you can get close to your target.

Remember I've said *alienation? Some developers tried so hard to keep the management work easy and clear, so applauses to Gentoo communities. Commercial powers could do better, Mac OS X seems to reflow log information system-widely. But I still have complaints. To Solaris, what the hell are there 30+ directories under /var/log/ ? To HP, can you explain your philosophy, if your logging system is define by roles like admins/users, who the hell is network named nettl? Could log filename be more ugly than nettl.LOG000? And to AIX, does your proprietary implementation give you business success?

Do blame me on my dirty words. Actually I tried so hard to be calm. This kind of additional work f*cks me so often, and no pleasure at all.

Now we are just about to read logs, but usually several hours have passed. As I can say, cat/grep/tail are among the most powerful tools for log analyzing, especially you are familiar to regular expressions. When trouble-shooting, any visual solution like a web search engine which connects to log database can't provide more details.

If you happen to have some knowledges about software development, you must know that end users rarely understand what the errors mean. Nor do system administrators. A more common case is like this, you sorted logs by levels, some FATAL ones appeared to be interesting. But a 30-minutes research proved out to be a waste of time, because either it was a segmentation fault, or an out-of-memory failure. Absolutely profiling a web application is other topic.

Believe me this is not the worst case. Some of logs are naturally unreadable since it was not written for system administrators. Among the readable lines, find what really useful is somehow a word guessing game. A log file is generally rotated at 500KB or per week, so read it through is mission impossible. What I can do is to try different keyword combinations, if I'm lucky enough there will be some hints. (Web application coders may understand this well, if someone used automated sql-injection scripts and broke into the system, you probably had to read every http request to locate your bug.)

Here is my story about why logging systems may fail. It does log well, but it is not handy enough to reproduce the crime scene. I wonder if you have any advices or solutions. Thank you.


P.S. I originally post this article in my mailing list. I will post a Chinese summary lately when I get my pc.
3192 次点击
所在节点    Linux
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/21157

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX