AWS的验尸报告~

2011-05-02 00:16:05 +08:00
 xatest
http://aws.amazon.com/message/65648/

AWS这次严重的故障,影响到了Quora和Foursquare等多家客户,不知道有看了官方验尸报告的吗?
我花了2个小时看完,整个事故的过程分析起来惊心动魄啊!
5617 次点击
所在节点    Amazon Web Services
6 条回复
MarkFull
2011-05-02 06:11:40 +08:00
ry_wang
2011-05-02 11:24:24 +08:00
发现异常为啥不能及时回滚呢?其实我好奇这个
xatest
2011-05-02 11:48:30 +08:00
@ry_wang 他们错误的回滚策略造成了雪崩,引起更大范围的错误~
ry_wang
2011-05-02 12:10:05 +08:00
@xatest 哦?有详细点的说明么?官方文档看不懂呀
xatest
2011-05-02 14:52:17 +08:00
@ry_wang 这还不够详细?5700字了~
Sidney
2011-05-02 22:02:25 +08:00
看着有种心惊肉跳的感觉

两点感受
1. 人为的操作容易出现错误。之前碰到过hosting的人员在配置Firewall的时候误操作,Firewall Hang住,导致网站无法访问。
The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.

2. 对回滚操作的后果估计不足,带来连锁雪崩效应。除非对系统各个模块和它们之间的交互有足够的理解,说不定就会踩中地雷。
Incorrect traffic shift was rolled back and network connectivity was restored. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica.

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/12413

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX