[求助] nginx 服务器在并发达到 300 就出现响应慢，甚至到 8 秒的情况

服务器版本：centos
服务器配置：阿里云实例规格：ecs.c5.large 2 核 4G
服务器环境：nginx 1.14.1 + php 5.6.36 + mysql 5.7.22

问题描述：在并发量在 300 的时候，不知道为什么响应时间特别长达到了 8 秒甚至 10 秒以上

测试描述：使用 apache 的 ab 压力测试工具测试网站的页面，这个页面什么也没有就是单纯的用 php return 了一个 json 字符串。使用的命令是.\abs -c 300 -n 500 地址

测试结果
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256

Document Path: /
Document Length: 166 bytes

Concurrency Level: 300
Time taken for tests: 12.325 seconds
Complete requests: 500
Failed requests: 257
(Connect: 0, Receive: 0, Length: 257, Exceptions: 0)
Non-2xx responses: 243
Total transferred: 124917 bytes
HTML transferred: 47791 bytes
Requests per second: 40.57 [#/sec] (mean)
Time per request: 7395.060 [ms] (mean)
Time per request: 24.650 [ms] (mean, across all concurrent requests)
Transfer rate: 9.90 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 557 1737 783.3 1517 3500
Processing: 30 2514 2053.8 1862 6267
Waiting: 28 2360 2152.5 1711 6267
Total: 1248 4251 2325.3 5031 8428

Percentage of the requests served within a certain time (ms)
50% 5031
66% 5784
75% 6284
80% 6574
90% 7450
95% 7755
98% 8107
99% 8338
100% 8428 (longest request)

配置信息：
nginx.conf

user www www;

worker_processes auto;

error_log /home/wwwlogs/nginx_error.log crit;

pid /usr/local/nginx/logs/nginx.pid;

#Specifies the value for maximum file descriptors that can be opened by this process.
worker_rlimit_nofile 51200;

events
{
use epoll;
worker_connections 51200;
accept_mutex off;
multi_accept on;
}

http
{
include mime.types;
default_type application/octet-stream;

server_names_hash_bucket_size 128;
client_header_buffer_size 32k;
large_client_header_buffers 4 32k;
client_max_body_size 50m;

sendfile on;
tcp_nopush on;

keepalive_timeout 0;

tcp_nodelay on;

fastcgi_connect_timeout 300;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
fastcgi_buffer_size 64k;
fastcgi_buffers 4 64k;
fastcgi_busy_buffers_size 128k;
fastcgi_temp_file_write_size 256k;

gzip on;
gzip_min_length 1k;
gzip_buffers 4 16k;
gzip_http_version 1.1;
gzip_comp_level 2;
gzip_types text/plain application/javascript application/x-javascript text/javascript text/css application/xml application/xml+rss;
gzip_vary on;
gzip_proxied expired no-cache no-store private auth;
gzip_disable "MSIE [1-6]\.";

#limit_conn_zone $binary_remote_addr zone=perip:10m;
##If enable limit_conn_zone,add "limit_conn perip 10;" to server section.

server_tokens off;
access_log off;
}

php-fpm.conf

[global]
pid = /usr/local/php/var/run/php-fpm.pid
error_log = /home/wwwlogs/php-fpm.log
log_level = notice

[www]
listen = /tmp/php-cgi.sock
listen.backlog = -1
listen.allowed_clients = 127.0.0.1
listen.owner = www
listen.group = www
listen.mode = 0666
user = www
group = www
pm = static
pm.max_children = 32
pm.start_servers = 30
pm.min_spare_servers = 30
pm.max_spare_servers = 200
request_terminate_timeout = 100
request_slowlog_timeout = 0
slowlog = var/log/slow.log

ryd994

2019-05-27 17:33:24 +08:00

可能是虚拟网络处理能力不够了。特别是还有其他用户，不能把资源全给你。
要排除很简单。走 loopback，通过 127.0.0.1 来测试就知道了。

排除网络问题后，比较有可能是 PHP worker 太少。可以根据 Nginx 里的统计来判断。

你说 CPU 占用 56%，这没有任何意义。可能某个核已经满了，但其他核还空着。要看分核的统计。比如 top 然后按数字 1。

刚看到你说开启 keepalive 还是一样性能。那可能就不是网络问题。当然，排除网络问题之前先抓包确认 keepalive 有效。配置不对的话 keepalive 可能实际没启用。

如果确认是网络问题可以找客服。或者使用网络增强实例，带 sriov 的那种，再试试看。

q937298063

2019-05-28 11:17:02 +08:00

目前最后的测试结果，请求静态页面。使用命令 .\abs -k -c 300 -n 500 地址。

参数说明
$upstream_connect_time $upstream_header_time $upstream_response_time $request_time
- - - 0.000

日志记录
125.109.131.170 - - [28/May/2019:11:08:26 +0800] "GET /500.html HTTP/1.0" 200 26 "-" "ApacheBench/2.3" "-"- - - 0.000
125.109.131.170 - - [28/May/2019:11:08:26 +0800] "GET /500.html HTTP/1.0" 200 26 "-" "ApacheBench/2.3" "-"- - - 0.000
125.109.131.170 - - [28/May/2019:11:08:26 +0800] "GET /500.html HTTP/1.0" 200 26 "-" "ApacheBench/2.3" "-"- - - 0.000
125.109.131.170 - - [28/May/2019:11:08:26 +0800] "GET /500.html HTTP/1.0" 200 26 "-" "ApacheBench/2.3" "-"- - - 0.000
125.109.131.170 - - [28/May/2019:11:08:26 +0800] "GET /500.html HTTP/1.0" 200 26 "-" "ApacheBench/2.3" "-"- - - 0.000

测试返回
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256

Document Path: /500.html
Document Length: 26 bytes

Concurrency Level: 300
Time taken for tests: 3.758 seconds
Complete requests: 500
Failed requests: 0
Keep-Alive requests: 500
Total transferred: 127500 bytes
HTML transferred: 13000 bytes
Requests per second: 133.07 [#/sec] (mean)
Time per request: 2254.504 [ms] (mean)
Time per request: 7.515 [ms] (mean, across all concurrent requests)
Transfer rate: 33.14 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1286 1262.2 1119 3688
Processing: 18 30 4.8 30 38
Waiting: 18 30 4.8 30 38
Total: 18 1315 1263.2 1153 3719

Percentage of the requests served within a certain time (ms)
50% 1153
66% 1979
75% 2440
80% 2699
90% 3213
95% 3471
98% 3624
99% 3676
100% 3719 (longest request)

多的就不贴了全是这个。我觉得应该像是

@dragonsunmoon 说的那样应该是阿里云服务器的问题