readme.fr

Hot opensource news

Troubleshooting session (MaxClients exceeded / http timeout)

troubleshootingThe last week we had a production incident on one of our customer

Monitoring was red, check_http was in timeout state and the web server doesn’t wanted to handle new connexions.

So what is our problem, what happened to this web site, it was fine for long time and today, we have just a lot of connexions timeout ?

I will give you directly the answer, each web page made a mysql select query with order on big table without index on 97 000 raw.

select … ORDER BY DESC LIMIT 1;

One of my colleague (guits) handled this alert. The solution was pretty simple but I want share with you the way this issue was troubleshooted.

Reproduce the issue

Try yourself the troubleshooting part with this quick solution to reproduce the issue.

The closer way to have the same issue is to write a small troubleshooting.php script like this :

<?php $dbuser = "username"; $dbpass = "password"; echo 'start...'; mysql_connect("localhost", $dbuser, $dbpass); mysql_query('select SLEEP(60);'); echo '...end'; ?>
troubleshooting.php

To have more time than 60 sec, you could have a look php and nginx timeout configuration.

And then curl your php page :

curl http://foo.com/troubleshooting.php
curl troubleshooting.php

 

For lazy one, just open a mysql term :

mysql -H localhost --user=... -p
mysql -H localhost --user=... -p

It not really looks like a production environment but, it works too.

Troubleshooting

On production server we mainly have mysql on different server than the web server. So we are in TCP socket.

Note : Even if you are not using mysql server with unix socket, I really recommend you to read the “Unix socket section” of this article anyway.

Let’s go, launch your curl and go ahead.

Start the most difficult part, find the php process that handle the curl.

ps faux | grep fpm

www-data 17899 0.0 0.1 221468 6508 ? S 00:32 0:00 \_ php-fpm: pool www
www-data 18916 0.0 0.1 221468 6500 ? S 00:35 0:00 \_ php-fpm: pool www
www-data 18932 0.0 0.8 225400 33884 ? S 00:35 0:01 \_ php-fpm: pool www
www-data 25070 0.0 0.1 221468 6460 ? S 00:47 0:00 \_ php-fpm: pool www
ps faux

 

Have a look of what all these process are doing.

Start to strace process. On this sandbox server I will strace all process for demo purpose. In production it could display a huge amount of logs. Prefer to do that on one of them and, try to be lucky ;).

strace -p $(echo $(ps faux | grep '[f]pm: pool' | awk '{print $2}' | tr '\n' ' ') | sed 's/ / -p /g')
strace -p 17899 -p 18916 -p 18932 -p 25070 -e read
[pid 29969] write(4, "8\0\0\0\3select SLEEP(60)
[pid 29969] read(4,
strace

On this example the write output is pretty obvious. In a bigger php script, it could be not so clear. Let’s ignore this and continue.

The important thing is your process number 29969 is waiting to read something on the file descriptor 4.

Let’s find on which inode are pointing the file descriptor :

ls -l /proc/29969/fd/4
lrwx------ 1 www-data www-data 64 oct. 17 01:00 /proc/29969/fd/4 -> socket:[509706165]

Nice, the file descriptor is pointing on a socket with the inode 509706165.

Call our friend netstat to find the established connexion refering to this inode

netstat -laputnewx   | grep 509706165
tcp 0 0 10.0.253.2:37133 10.0.253.2:3306 ESTABLISHED 33 509706165 29969/php-fpm: pool
netstat

That netstat tell us : This socket go to a mysql “10.0.253.2:3306” with the source port “10.0.253.2:37133

Now check mysql queries and try to find what is the query for our process on this source port :

mysql -e 'show full processlist;' | grep 37133
| 83 | user | 10.0.253.2:37133 | NULL | Query   |   27 | User sleep | select SLEEP(60) |
mysql processlist

That’s it, We find the query launched !

Note : You can skip the search of the inode, a simple netstat with the PID can allow you to determine directly.

netstat -laputnewx | grep 29969
tcp 0 0 10.0.253.2:37133 10.0.253.2:3306 ESTABLISHED 33 509706165 29969/php-fpm: pool
tcp 8 0 127.0.0.1:9000 127.0.0.1:47319 ESTABLISHED 33 509706131 29969/php-fpm: pool
netstat

Unix socket

In case of unix socket it’s more difficult for 2 reasons :

  • On kernel < 3.3 you can’t easily find the link between fd and the socket destination.
  • Mysql processlist will show you only localhost for the host identifier.

kernel < 3.3

First strace give you very good informations on the php file and socket called.

strace -p 17899 -p 18916 -p 18932 -p 25070
[pid 13421] stat("/var/www/foo.com/troubleshooting.php", {st_mode=S_IFREG|0644, st_size=413, ...}) = 0
[pid 13421] connect(4, {sa_family=AF_FILE, path="/var/run/mysqld/mysqld.sock"}, 110) = 0
[pid 13421] write(4, "8\0\0\0\3select SLEEP(60)
[pid 13421] read(4,
strace socket

You can see the connect call to /var/run/mysqld/mysqld.sock. With command like ss -xp you will be able to know what process handle this socket.

ss -xp | grep sock
u_str ESTAB 0 0 /var/run/mysqld/mysqld.sock 513132085 * 0 users:(("mysqld",11414,20))
ss

 

Don’t worry, with the file descriptor it’s still possible to deduce the socket peer from the fd number : (http://unix.stackexchange.com/questions/16300/whos-got-the-other-end-of-this-unix-socketpair)

The number shown in /proc/$pid/fd/$fd is the socket’s inode number in the virtual socket filesystem. When you create a pipe or socket pair, each end successively receives an inode number. The numbers are attributed sequentially, so there is a high probability that the numbers differ by 1, but this is not guaranteed (either because the first socket was N and N+1 was already in use due to wrapping, or because some other thread was scheduled between the two inode allocations and that thread created some inodes too).

Let’s try get the PID and the FD:

strace -p 17899
[pid 16882] write(4, "\22\0\0\0\3select sleep(60);", 22) = 22
[pid 16882] read(4, ^C <unfinished ...>
strace

Find the inode for this FD :

ls -l /proc/16882/fd/4 
lrwx------ 1 www-data www-data 64 oct. 17 17:01 /proc/16882/fd/4 -> socket:[510406006]
ls

As described above, try to see what we get if we increment the inode number (inode + 1) :

netstat -laputnewx | grep $((510406006+1))
unix 3 [ ] STREAM CONNECTE 510406007 11414/mysqld /var/run/mysqld/mysqld.sock
# or with ss
ss -xp | grep $((510406006+1))
u_str ESTAB 0 0 /var/run/mysqld/mysqld.sock 510406007 * 0 users:(("mysqld",11414,45))
netstat / ss

 

Note : That it’s also possible to do that in the reverse order to know who is connected to mysql.

Since kernel 3.3

Good news for OPS, since kernel 3.3, the Local and Peer port for a socket is now displayed.

So, It is possible to find what we want without increment the inode number anymore.

# Get inode from the fd
ls -l /proc/603/fd/3
lrwx------ 1 root root 64 Oct 17 15:48 3 -> socket:[80514]

# Get informations from the inode number
ss -xp | grep 80514
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 0 0 * 80514 * 81937 users:(("mysql",pid=603,fd=3))
u_str ESTAB 0 0 /var/run/mysqld/mysqld.sock 81937 * 80514
ss since 3.3 kernel

 

Important thing to note is the Local port go to the Peer port (80514 -> 81937) and that is our mysql. Bingo !

 

Update : There is also some others commands to help you debug this

 

lsof -ni | grep fpm
others commands
gaelL
gaelL on GithubgaelL on LinkedingaelL on Wordpress

, , ,

Leave a Reply

Your email address will not be published. Required fields are marked *