对于进程的监控最常见的需求就是进程挂了如何被自动拉起来,现在可以由 Kubernetes 等先进的容器化技术来自动化管理,那原来再物理服务器或者虚拟机中的进程有什么好的办法呢?答案就是 Monit/Supervisor 等第三方应用来解决,因为线上环境分别使用 Monit 来监控 Core Logical Service,Supervisor 用在 Codis Dashboard/FE/Proxy 上,使用下来的感受和网上的对比分析报告类似,具体内容会在文章内引用,推荐大家使用 Monit 替代 Supervisor 自动化管理和监控服务。
使用 Monit 替代 Supervisor 自动化管理和监控服务小结
2020 年 01 月 15 日 - 初稿
阅读原文 - https://wsgzao.github.io/post/monit/
扩展阅读
NAME
Monit - utility for monitoring services on a Unix system
SYNOPSIS
monit [options] <arguments></arguments>
DESCRIPTION
Monit is a utility for managing and monitoring processes, programs, files, directories and filesystems on a Unix system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations. E.g. Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. You can use Monit to monitor files, directories and filesystems for changes, such as timestamps changes, checksum changes or size changes.
Monit is controlled via an easy to configure control file based on a free-format, token-oriented syntax. Monit logs to syslog or to its own log file and notifies you about error conditions via customisable alert messages. Monit can perform various TCP/IP network checks, protocol checks and can utilise SSL for such checks. Monit provides a HTTP(S) interface and you may use a browser to access the Monit program.
WHAT TO MONITOR?
You can use Monit to monitor daemon processes or similar programs running on localhost. Monit is particularly useful for monitoring daemon processes, such as those started at system boot time. For instance sendmail, sshd, apache and mysql. In contrast to many other monitoring systems, Monit can act if an error situation should occur, e.g.; if sendmail is not running, monit can start sendmail again automatically or if apache is using too many resources (e.g. if a DoS attack is in progress) Monit can stop or restart apache and send you an alert message. Monit can also monitor process characteristics, such as how much memory or cpu cycles a process is using.
You can also use Monit to monitor files, directories and filesystems on localhost. Monit can monitor these items for changes, such as timestamps changes, checksum changes or size changes. This is also useful for security reasons - you can monitor the md5 or sha1 checksum of files that should not change and get an alert or perform an action if they should change.
Monit can monitor network connections to various servers, either on localhost or on remote hosts. TCP, UDP and Unix Domain Sockets are supported. Network test can be performed on a protocol level; Monit has built-in tests for the main Internet protocols, such as HTTP, SMTP etc. Even if a protocol is not supported you can still test the server because you can configure Monit to send any data and test the response from the server.
Monit can be used to test programs or scripts at certain times, much like cron, but in addition, you can test the exit value of a program and perform an action or send an alert if the exit value indicates an error. This means that you can use Monit to perform any type of check you can write a script for.
Finally, Monit can be used to monitor general system resources on localhost such as overall CPU usage, Memory and System Load.
https://mmonit.com/monit/documentation/monit.html
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”. Instead it is meant to be used to control processes related to a project or a customer, and is meant to start like any other program at boot time.
除了 Monit 还有一些其他的第三方监控方案(eg. Supervisor), 我们考虑选择额 Monit 作为监控的原因有
优点
缺点
优点
缺点
这样看起来还是 monit 更为普适一点。
不过这催生了一个大胆的想法,使用 supervisor 管理容器内多进程,monit 作为一个被监控进程挂在 supervisor 之下。这样对于无法前台运行的程序,就可以通过 monit 监控,而对服务中断感知强烈的则直接挂在 supervisor 之下。看起来似乎是个好办法,有机会试试,哈哈哈。
从实际容器中运行的表现看,monit 经常出现各种未知异常,而 supervisor 表现得十分稳定。
# monit -h
Usage: monit [options]+ [command]
Options are as follows:
-c file Use this control file
-d n Run as a daemon once per n seconds
-g name Set group name for monit commands
-l logfile Print log information to this file
-p pidfile Use this lock file in daemon mode
-s statefile Set the file monit should write state information to
-I Do not run in background (needed when run from init)
--id Print Monit's unique ID
--resetid Reset Monit's unique ID. Use with caution
-B Batch command line mode (do not output tables or colors)
-t Run syntax check for the control file
-v Verbose mode, work noisy (diagnostic output)
-vv Very verbose mode, same as -v plus log stacktrace on error
-H [filename] Print SHA1 and MD5 hashes of the file or of stdin if the
filename is omited; monit will exit afterwards
-V Print version number and patchlevel
-h Print this text
Optional commands are as follows:
start all - Start all services
start <name> - Only start the named service
stop all - Stop all services
stop <name> - Stop the named service
restart all - Stop and start all services
restart <name> - Only restart the named service
monitor all - Enable monitoring of all services
monitor <name> - Only enable monitoring of the named service
unmonitor all - Disable monitoring of all services
unmonitor <name> - Only disable monitoring of the named service
reload - Reinitialize monit
status [name] - Print full status information for service(s)
summary [name] - Print short status information for service(s)
report [up|down|..] - Report state of services. See manual for options
quit - Kill the monit daemon process
validate - Check all services and start if not running
procmatch <pattern> - Test process matching pattern
想要让 Monit 可靠的为我们工作, 学习成本非常低, 只需要学习一些 Monit 命令行和配置文件写法
# options - 选项
- monit
- monit -t
- monit -c /var/monit/monitrc # 指定配置文件
- monit -g <groupname> start/stop # Monit 可以对各个监控分组, 如果需要对某个分组统一操作, 可以用这个命令
# arguments - 参数
- monit reload
- monit quit
- monit start/stop/restart/monitor/unmonitor <name>/all # <name>: 每个监控都有一个独一无二的名字, 具体后面会提到; all: 所有监控服务
详细配置, 共计 9 种, 所有配置中, 都符合以下规则
- Process
CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
<path> pid-file 的绝对路径. 不存在 pid-file 文件或者 pid-file 文件没有对应的正在运行的程序, Monit 会执行 start 方法
<regex> 进程名称的正则表达来监控进程, 可以通过命令行测试正则是否写对了: monit procmatch "regex-pattern"
- File
CHECK FILE <unique name> PATH <path>
<path> file 的绝对路径.
- Fifo
CHECK FIFO <unique name> PATH <path>
<path> fifo 的绝对路径.
- Filesystem
CHECK FILESYSTEM <unique name> PATH <path>
<path> 设备 /磁盘, 挂载点的路径 或 NFS/CIFS/FUSE 链接字符串. 如果文件系统不可用, Monit 会执行 start 方法
- Directory
CHECK DIRECTORY <unique name> PATH <path>
<path> 目录问价的绝对路径
- Remote host
CHECK HOST <unique name> ADDRESS <host>
<host> 可以是域名或者 IP 地址. eg: "tildeslash.com" or "64.87.72.95".
- System
CHECK SYSTEM <unique name>
<unique name> 通常来说是本机名称(可以用 $HOST), 也可以是其他名称. 用于邮件报警或者 M/Monit 的初始化名称
这类配置可以监控系统资源(CPU, memory, load average...)
- Program
CHECK PROGRAM <unique name> PATH <executable file> [TIMEOUT <number> SECONDS]
<path> 可执行程序或脚本的绝对路径. 允许检查程序退出状态.如果程序没能在 <number> 秒内执行完成, Monit 会终结这个程序, 默认是 300s
程序的输出会被记录, 用于用户界面或者报警, 默认 512 bytes(可以通过 set limits 修改)
- Network
CHECK NETWORK <unique name> <ADDRESS <ipaddress> | INTERFACE <name>>
# <ipaddress> 是被监控的 IPv4/IPv6 网卡地址. 用 eth0 也是可以的
更多配置信息可以参考 Monit 官方文档和实例
https://mmonit.com/documentation/
https://mmonit.com/wiki/Monit/ConfigurationExamples
# 创建通用配置,配置日志,邮件告警
vim basic.j2
# log to monit.log
set logfile /var/log/monit.log
set daemon {{ monit_poll_interval }}
set eventqueue basedir /var/lib/monit/events slots 5000
set mailserver smtp.xxx.com port 465
set alert xxx@xxx.com { nonexist, timeout, resource }
set mail-format {
from: xxx@xxx.com
subject: monit alert -- $SERVICE $EVENT at $DATE
message: $EVENT Service $SERVICE
Date: $DATE
Action: $ACTION
Host: $HOST
Description: $DESCRIPTION
Your faithful employee,
Monit
}
# 创建标准应用监控
vim daemon_set.j2
check process xxx with pidfile /run/xxx/daemon.pid
start program = "/usr/bin/python2 /bin/xxx restart"
stop program = "/usr/bin/python2 /bin/xxx stop"
if 10 restarts within 10 cycles then unmonitor
check process xxxx with matching xxxx
start program = "/etc/init.d/xxxx start"
stop program = "/etc/init.d/xxxx stop"
if 10 restarts within 10 cycles then unmonitor
# 创建非标准应用监控
vim logic_service.j2
check process {{ service_name }} with pidfile {{ root_dir }}/{{ service_name }}/deploy/{{ monit_name }}.pid
start program = "/bin/bash -c 'cd {{ root_dir }}/{{ service_name }}/deploy && ./start.sh &>start.log '"
stop program = "/bin/bash -c 'cd {{ root_dir }}/{{ service_name }}/deploy && ./stop.sh &>stop.log '"
if 5 restarts within 15 cycles then unmonitor
{% if memory_usage is defined %}
if total memory usage > {{ memory_usage }} for 10 cycles then restart
{% endif %}
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.