MySQL不断 crash 是怎么回事?

我在 1G 内存的 linode 上跑了个 Django 站: http://readfree.me/
数据库是 MySQL 5.5, 存储引擎是5.5开始默认的 InnoDB. 一般同时在线用户数不超过20.

我每天会收到几十次 Django 发来的出错邮件, 内容都是各种数据库查询失败:
......
OperationalError: (2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)")

但这些查询都是正常的查询, 本地调试从来没有失败过, 只是在服务器上偶尔会失败.

我自己访问网站时, 也时不时会遇到500错误, 但一般刷新下又好了. 接着就会收到上面的邮件.
偶尔也有网站挂掉起不来的情况, ssh 到服务器, 发现 mysql 服务停止了, 启动就好了.

我很不解, 数据库明明跑得好好的, 访问量也不大, 为什么会时不时中断呢?
看了下 MySQL 的 error.log , 发现原来数据库在频繁的 crash . 见帖子最后.
大部分时候可以自动恢复, 但是也会出现恢复时分配内存失败, mysql 挂掉, 从而导致网站挂掉的情况.

最近我反复调整 my.cnf 的参数, 但是问题一直没有得到彻底解决.
请问有人遇到过类似的问题吗? 能否提供点思路?

==================MySQL error.log=====================
140109 11:26:10 InnoDB: The InnoDB memory heap is disabled
140109 11:26:10 InnoDB: Mutexes and rw_locks use GCC atomic builtins
140109 11:26:10 InnoDB: Compressed tables use zlib 1.2.3.4
140109 11:26:10 InnoDB: Initializing buffer pool, size = 180.0M
140109 11:26:10 InnoDB: Completed initialization of buffer pool
140109 11:26:10 InnoDB: highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 3846798084
140109 11:26:10 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Warning: database page corruption or a failed
InnoDB: file read of space 0 page 15686.
InnoDB: Trying to recover it from the doublewrite buffer.
InnoDB: Recovered the page from the doublewrite buffer.
InnoDB: Warning: database page corruption or a failed
InnoDB: file read of space 0 page 49.
InnoDB: Trying to recover it from the doublewrite buffer.
InnoDB: Recovered the page from the doublewrite buffer.
InnoDB: Doing recovery: scanned up to log sequence number 3846799817
140109 11:26:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
140109 11:26:13 InnoDB: Waiting for the background threads to start
140109 11:26:14 InnoDB: 5.5.34 started; log sequence number 3846799817
140109 11:26:14 [Note] Server hostname (bind-address): '127.0.0.1'; port: 3306
140109 11:26:14 [Note] - '127.0.0.1' resolves to '127.0.0.1';
140109 11:26:14 [Note] Server socket created on IP: '127.0.0.1'.
140109 11:26:15 [Note] Event Scheduler: Loaded 0 events
140109 11:26:15 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.34-0ubuntu0.12.04.1-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 (Ubuntu)

cloud107202

2014-01-09 14:17:51 +08:00

曾遇到过类似的情况，对mysql不是很懂，把当时的步骤总结一下，希望帮到楼主：
环境：512M内存的vps，mysql 5.6，搭了wordpress
问题：mysql能启动，wordpress一被访问，数据库就crash
尝试1：将innodb_buffer_pool_size=128M 改为5M，wp可以访问了，只是坚持不了多久还会down
尝试2：
调整以下参数至：
performance_schema_max_table_instances=600
table_definition_cache=400
table_open_cache=256
这时mysql启动后内存就只占用40–60M内存了
以下是5.6默认的设置，会占用至少400M的内存，可能是导致crash的原因
performance_schema_max_table_instances 12500
table_definition_cache 1400
table_open_cache 2000
尝试3：经过尝试2后，问题解决了。但一个月后随着文章的增多，数据库又crash掉，直接动态扩展了vps内存到1G 问题彻底解决

综上感觉内存不足是最可能的主因

guoqiao

2014-01-09 14:55:59 +08:00

@timchou
=========================
完整的 error.log:
https://dl.dropboxusercontent.com/u/55214241/error.log
说明: 这台 linode 上还给一个同学跑了个 discuzz 论坛, 日志中可以看到ultrax相关的表 crash 了:
140109 8:24:35 [ERROR] /usr/sbin/mysqld: Table './ultrax/pre_forum_promotion' is marked as crashed and should be repaired

开始我怀疑是这个discuzz 导致的. 但我把它关掉观察了一段时间, 数据库还是会不断崩溃, 所以和这个关系不大. 而且,在有这个 discuzz 之前,我也会经常收到出错邮件.
=========================
完整的 my.cnf:
https://dl.dropboxusercontent.com/u/55214241/my.cnf
=========================
htop 的内存占用截图:
https://dl.dropboxusercontent.com/u/55214241/htop.png

guoqiao

2014-01-09 19:14:14 +08:00

@mahone3297
1. htop 中, mysql 的进程有20多个, 我也不太明白这个是由哪个参数决定的
2. innodb_buffer_pool_size 之所有设置成这么多, 是因为我用mysqltuner这个工具测试了一下我的配置, 结果如下:
-------- Performance Metrics -------------------------------------------------
[--] Up for: 27m 23s (19K q [11.997 qps], 941 conn, TX: 48M, RX: 4M)
[--] Reads / Writes: 96% / 4%
[--] Total buffers: 344.0M global + 2.7M per thread (20 max threads)
[OK] Maximum possible memory usage: 397.8M (40% of installed RAM)
[OK] Slow queries: 0% (11/19K)
[OK] Highest usage of available connections: 35% (7/20)
[OK] Key buffer size / total MyISAM indexes: 64.0M/1.5M
[OK] Key buffer hit rate: 100.0% (456K cached / 7 reads)
[OK] Query cache efficiency: 40.7% (6K cached / 15K selects)
[OK] Query cache prunes per day: 0
[OK] Sorts requiring temporary tables: 0% (0 temp sorts / 1K sorts)
[OK] Temporary tables created on disk: 23% (191 on disk / 827 total)
[OK] Thread cache hit rate: 99% (7 created / 941 connections)
[OK] Table cache hit rate: 25% (450 open / 1K opened)
[OK] Open file limit used: 62% (641/1K)
[OK] Table locks acquired immediately: 100% (10K immediate / 10K locks)
[OK] InnoDB data size / buffer pool: 146.0M/180.0M

这里面最后一条,提到我的数据库大小是146M, 而我的 buffer size 要比这个大才行. 因此我才设置为180M. 如果设置成小于146M 的值, 数据库 crash 之后, 就会分配内存失败, 起不来.
3. 其他的几个参数, 我也是根据 mysqltuner 的建议修改的.

mahone3297

2014-01-09 20:23:14 +08:00

@guoqiao htop虽然显示20多进程,但是内存的占用并不是每个都那么多, 是总共12%左右.
1. 如何看出是总共12%？
我觉得是单个10。9%，然后20个就是200%吧。。。
free -m 结果看下？

2. 为什么我看到我服务器上只有一个mysqld进程，你为什么会那么多？大家为什么对这个没疑问？大家都是那么多进程的？

3. mysql配置参数，看你的意思，都是根据 mysqltuner 这个工具来配置的。这个工具有考虑到你其他情况吗（比如你开了discuz）？他的这个测试的前提条件是不是这个服务器只跑mysql呢（事实上你还跑了其他，比如web server）？

4. 这里面最后一条,提到我的数据库大小是146M, 而我的 buffer size 要比这个大才行. 因此我才设置为180M. 如果设置成小于146M 的值, 数据库 crash 之后, 就会分配内存失败, 起不来.
按你的说法，buffer_pool必须比数据库大。。。那假如我10g的数据库，我buffer_pool要开到10g吗？
ps：我在我这边线上环境，buffer_pool只开了128m，我的db size肯定大雨128m（事实上是10多g）。当然，我没说我的buffer_pool配置的正确，因为我看网上都说要配置的比较大。但至少我这边运行的还好，也没出什么问题，至少没crash。当然，我这边也不是什么高并发的查询。

以上仅供参考。。。