Deauville - Trouville - Caen - Bayonne
License: GPLv2, available here on GitHub
A tool for detailed tracking and logging of PostgreSQL on Linux CPU use by origin, database and user.
installation:
Compile on a machine with same Debian version or distribution version and CPU type (instructions shown here for Debian).
Compilation has been tested on Wheezy and Jessie with both g++ and clang++.
user@devel:~$ sudo apt-get install build-essential libprocps-dev
user@devel:~$ git clone http://github.com/JMD-tech/pgcputrack
user@devel:~$ cd pgcputrack
user@devel:~/pgcputrack$ make
g++ -std=c++11 -c -o pgcputrack.o pgcputrack.cpp
cc -lprocps -lstdc++ -lrt -g pgcputrack.o -o pgcputrack
g++ -std=c++11 -c -o showpgcpuse.o showpgcpuse.cpp
cc -lprocps -lstdc++ -lrt -g showpgcpuse.o -o showpgcpuse
Then copy the binaries to the server: note: runtime version of libprocps is already installed by default for ps/top etc. TODO: Debian package
user@devel:~/pgcputrack$ scp pgcputrack showpgcpuse root@db-server:/root/
usage:
First use the pgcputrack tool to record PostgreSQL CPU use:
root@db-server:~# ./pgcputrack > pgcpustats.txt
Use CTRL+C to stop recording.
Or use nohup+kill to capture without keeping console open:
root@db-server:~# nohup ./pgcputrack > pgcpustats.txt &
root@db-server:~# killall pgcputrack
TODO: add daemon mode
Then use the showpgcpuse tool to display statistics:
root@db-server:~# ./showpgcpuse pgcpustats.txt
CPU BY DB:
1869 13320 ms 4%
2325 13330 ms 4%
2809 20620 ms 7%
7148 149455 ms 54%
8016 11310 ms 4%
collector 3805 ms 1%
global 30290 ms 10%
launcher 785 ms 0%
worker 2825 ms 1%
writer 65 ms 0%
CPU BY USER:
apiv1 95875 ms 34%
apiv2 144985 ms 52%
autovacuum 3610 ms 1%
dataupdater 27035 ms 9%
stats 3805 ms 1%
wal 65 ms 0%
CPU BY ORIGIN:
10.45.8.79 10070 ms 3%
127.0.0.1 6965 ms 2%
172.19.203.101 134300 ms 48%
172.19.203.102 106165 ms 38%
[local] 395 ms 0%
process 7480 ms 2%
Total CPU USED: 275.385 s (avg 59.1% during 465.504 s)
note: The totals doesn't add-up as for readability i've removed most of the less significant data (30 subsidiaries) and also changed users, IP and DB names. Doesn't add-up with actual data anyway (about 85%), the rest being connection overhead and very short queries that fall under the polling interval. But still shows which subsidiary had a performance problem.
TODO: make some Lazarus frontend to display nice shiny graphics charts... useless but more marketing value.
While faced with increasing load on 2 database servers at one of our main customers,
we needed to identify where was the most CPU used.
As PostgreSQL forks a client handler per connection, and sets its
process title to include database and user, it would not be too difficult to
track this by process information tools and APIs, right?
Obviously top and htop only show an instant view, and with about 100+ concurrent
connections it isn't practical at all.
After a quick tour of existing PostgreSQL monitoring solutions,
none was providing this accurate and persistent picture of DB CPU load.
Also as it was to run on production servers that experienced high load,
it was mandatory that it was an external tool with no interaction on
the database server process, ie not requiring an active DB connection nor
injection of some debugging library or additionnal module into postgres process.
So i decided it was time to write the highly needed but not yet existing tool :)
So the idea was to write a small tool that does exactly 3 things:
This sounds simple, however i stumbled upon many annoyances, each
was addressable with various workarounds or switching of API/libs but
the process was less straightforward that i first thought.
While searching for process operations notifications, i quickly came across
an excellent piece of code by BewareOfGeek that shows an example of netlink sockets
for process notification events.
A sample code self contained in a single c file that compiles out of the box
on default options, produces a minimalist tool that can already be useful on its own,
and as the code is very neat and readable it makes a perfect starting block
to build upon for everything process-tracking based.
(another unrelated idea came to mind for a reliable and tamper-proof way to keep track
of all commands issued by sudoers even when they would sudo su out of lazy habit
or maliciously spawn a shell from some userland app, maybe i'll code this one too
if i have the need)
This was easy, straightforward and relaxing.
Handle fork and identify postgres forked processes, handle exit and get cmdline, stime+utime
Next, on to extracting process "title" and resource usage.
getrusage sounded good, if only it could retrieve information on any process instead of
caller and its child processes.
Some searches and digging into ps source code directed me at libprocps.
Had some doubt on the ability to extract process information on exit, as
we are notified by a socket, so the process information we need might as well
have been cleaned by the kernel by the time we get the event.
Working around this would simply require storing information we got at start in an associative array
by pid and use those data when notified of process exit. Oh, except that stime+utime
actually change during process life, so we would need periodic updates too...
sounds worse both for implementation cleanliness and CPU overhead.
Decided to first ignore this concern and implement reading data at exit anyway, and see how that'll do.
Worked half-good:
utime and stime data were present, and some tests with a dummy-controlled-load process
showed those values to be consistent and accurate. good.
The process name was there in proc_t.cmd but with only 'postgres' name,
proc_t.cmdline with full process title was definately empty at exit.
And trying to get it at fork time... obviously at that time the cmdline
is simply 'postgres' because the just forked child process has not yet
changed its title to include origin, database and user name.
2 options left:
Finally tested various polling interval, and settled to 10ms as it used about 2% CPU use on peak-load while achieving maximal accuracy (utime and stime provided by libprocps are in 0.01s units)