MySQL不断 crash 是怎么回事?

我在 1G 内存的 linode 上跑了个 Django 站: http://readfree.me/
数据库是 MySQL 5.5, 存储引擎是5.5开始默认的 InnoDB. 一般同时在线用户数不超过20.

我每天会收到几十次 Django 发来的出错邮件, 内容都是各种数据库查询失败:
......
OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)")

但这些查询都是正常的查询, 本地调试从来没有失败过, 只是在服务器上偶尔会失败.

我自己访问网站时, 也时不时会遇到500错误, 但一般刷新下又好了. 接着就会收到上面的邮件.
偶尔也有网站挂掉起不来的情况, ssh 到服务器, 发现 mysql 服务停止了, 启动就好了.

我很不解, 数据库明明跑得好好的, 访问量也不大, 为什么会时不时中断呢?
看了下 MySQL 的 error.log , 发现原来数据库在频繁的 crash . 见帖子最后.
大部分时候可以自动恢复, 但是也会出现恢复时分配内存失败, mysql 挂掉, 从而导致网站挂掉的情况.

最近我反复调整 my.cnf 的参数, 但是问题一直没有得到彻底解决.
请问有人遇到过类似的问题吗? 能否提供点思路?

==================MySQL error.log=====================
140109 11:26:10 InnoDB: The InnoDB memory heap is disabled
140109 11:26:10 InnoDB: Mutexes and rw_locks use GCC atomic builtins
140109 11:26:10 InnoDB: Compressed tables use zlib 1.2.3.4
140109 11:26:10 InnoDB: Initializing buffer pool, size = 180.0M
140109 11:26:10 InnoDB: Completed initialization of buffer pool
140109 11:26:10 InnoDB: highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 3846798084
140109 11:26:10 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Warning: database page corruption or a failed
InnoDB: file read of space 0 page 15686.
InnoDB: Trying to recover it from the doublewrite buffer.
InnoDB: Recovered the page from the doublewrite buffer.
InnoDB: Warning: database page corruption or a failed
InnoDB: file read of space 0 page 49.
InnoDB: Trying to recover it from the doublewrite buffer.
InnoDB: Recovered the page from the doublewrite buffer.
InnoDB: Doing recovery: scanned up to log sequence number 3846799817
140109 11:26:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
140109 11:26:13 InnoDB: Waiting for the background threads to start
140109 11:26:14 InnoDB: 5.5.34 started; log sequence number 3846799817
140109 11:26:14 [Note] Server hostname (bind-address): '127.0.0.1'; port: 3306
140109 11:26:14 [Note] - '127.0.0.1' resolves to '127.0.0.1';
140109 11:26:14 [Note] Server socket created on IP: '127.0.0.1'.
140109 11:26:15 [Note] Event Scheduler: Loaded 0 events
140109 11:26:15 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.34-0ubuntu0.12.04.1-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 (Ubuntu)

arbeitandy

2014-01-10 07:40:37 +08:00

* 看 errorlog，不是mysql自己的問題。似系統無法分配足夠內存，oom機制殺掉了mysql進程，可以檢查系統日誌 syslog
參考 http://dba.stackexchange.com/questions/25077/mysql-innodb-crash-post-mortem
在 crash 時也許有其它進程快速佔用了比計劃多得多的內存（比如python, php都是潛在的內存大戶，如果還有文件io操作。。）
* 順便建議mysqltuner 的測試要啟動一段時間後再進行
Up for: 27m 23s 這時可能cache沒warmup，hit rate會偏低。不過僅僅看數字這份my.cnf沒受這個影響。
* http://dba.stackexchange.com/questions/25165/intermittent-mysql-crashes-with-error-fatal-error-cannot-allocate-memory-for-t
這裡還有個比較長比較全面討論低配機器的mysql配置檢查。
* 如果短期訪問量不會增加，又沒有慢速查詢，my.cnf裡 max_connections可以再降低點。

guoqiao

2014-01-10 16:36:46 +08:00

@arbeitandy @VYSE
多谢二位, 你们的分析引导我找到了问题所在.
我的网站上有几个实时统计的页面, 需要联合查询大量数据, 每次查询都好慢.
由于太卡, 我用的不多, 也一直没太在意.
今天早上特地看了下,每次点击统计页面, mysql 某个进程的 CPU 就会彪到100%多, 整个服务器都很卡顿. 应该就是它导致了 mysql 崩溃.
我去掉了统计页面, 今天观察了一天, 目前为止没有再收到任何 mysql 错误.
现在想来, 这几个页面因为很卡,所以我自己很少点, 都被我遗忘了. 我想点过的用户也不想点第二次. 但是每次有新用户进来, 他可能在无意识的情况下随手点了它, 这就导致 mysql 崩溃. 根据我目前网站上的用户数, 我每天收到的出错邮件次数, 也差不多符合这个假设.
当然,目前观察的时间还太短,不能下结论. 但是也应该差不多.
这个问题一度让我怀疑 mysql 的性能, 现在看来, 问题出在我自己的不当查询上面.
惭愧.