19 10 2015
Troubleshooting session (MaxClients exceeded / http timeout)
The last week we had a production incident on one of our customer
Monitoring was red, check_http was in timeout state and the web server doesn’t wanted to handle new connexions.
So what is our problem, what happened to this web site, it was fine for long time and today, we have just a lot of connexions timeout ?
I will give you directly the answer, each web page made a mysql select query with order on big table without index on 97 000 raw.
select … ORDER BY DESC LIMIT 1;
One of my colleague (guits) handled this alert. The solution was pretty simple but I want share with you the way this issue was troubleshooted.
Reproduce the issue
Try yourself the troubleshooting part with this quick solution to reproduce the issue.
The closer way to have the same issue is to write a small troubleshooting.php script like this :
To have more time than 60 sec, you could have a look php and nginx timeout configuration.
And then curl your php page :
For lazy one, just open a mysql term :
It not really looks like a production environment but, it works too.
Troubleshooting
On production server we mainly have mysql on different server than the web server. So we are in TCP socket.
Note : Even if you are not using mysql server with unix socket, I really recommend you to read the “Unix socket section” of this article anyway.
Let’s go, launch your curl and go ahead.
Start the most difficult part, find the php process that handle the curl.
Have a look of what all these process are doing.
Start to strace process. On this sandbox server I will strace all process for demo purpose. In production it could display a huge amount of logs. Prefer to do that on one of them and, try to be lucky ;).
On this example the write output is pretty obvious. In a bigger php script, it could be not so clear. Let’s ignore this and continue.
The important thing is your process number 29969 is waiting to read something on the file descriptor 4.
Let’s find on which inode are pointing the file descriptor :
Nice, the file descriptor is pointing on a socket with the inode 509706165.
Call our friend netstat to find the established connexion refering to this inode
That netstat tell us : This socket go to a mysql “10.0.253.2:3306” with the source port “10.0.253.2:37133”
Now check mysql queries and try to find what is the query for our process on this source port :
That’s it, We find the query launched !
Note : You can skip the search of the inode, a simple netstat with the PID can allow you to determine directly.
Unix socket
In case of unix socket it’s more difficult for 2 reasons :
- On kernel < 3.3 you can’t easily find the link between fd and the socket destination.
- Mysql processlist will show you only localhost for the host identifier.
kernel < 3.3
First strace give you very good informations on the php file and socket called.
You can see the connect call to /var/run/mysqld/mysqld.sock. With command like ss -xp you will be able to know what process handle this socket.
Don’t worry, with the file descriptor it’s still possible to deduce the socket peer from the fd number : (http://unix.stackexchange.com/questions/16300/whos-got-the-other-end-of-this-unix-socketpair)
The number shown in /proc/$pid/fd/$fd is the socket’s inode number in the virtual socket filesystem. When you create a pipe or socket pair, each end successively receives an inode number. The numbers are attributed sequentially, so there is a high probability that the numbers differ by 1, but this is not guaranteed (either because the first socket was N and N+1 was already in use due to wrapping, or because some other thread was scheduled between the two inode allocations and that thread created some inodes too).
Let’s try get the PID and the FD:
Find the inode for this FD :
As described above, try to see what we get if we increment the inode number (inode + 1) :
Note : That it’s also possible to do that in the reverse order to know who is connected to mysql.
Since kernel 3.3
Good news for OPS, since kernel 3.3, the Local and Peer port for a socket is now displayed.
So, It is possible to find what we want without increment the inode number anymore.
Important thing to note is the Local port go to the Peer port (80514 -> 81937) and that is our mysql. Bingo !
Update : There is also some others commands to help you debug this
Rabbitmq split brain Openshift 3 : My node can’t spaw more than 40 pods, why ?