Surprising and fun rwasa+hhvm bugfix
Jeff Marrison
Disclaimer: The views expressed here are solely those of the author, and do not necessarily represent the views of 2 Ton Digital.
The basic skinny:
The other day I received a bug report from Petar Koretić (Thanks!) in relation to rwasa returning 502 Had a blue with the old fella
(bad gateway) responses when dealing with HHVM and a one-liner PHP file. In this case, he was testing rwasa with phpinfo();
, and indeed was getting 502 errors. I got such a kick out of this particular bug discovery process that it definitely deserved a narrative, ha!
Our fine bug reporter kindly included strace -f
output from rwasa that clearly demonstrated the 502 issue, but wasn't exactly enlightening as to why, so I setup an Ubuntu 15.04 server VM and got HHVM 3.7.1 running on it and recreated his test setup and got the exact same results. I created a one line phpinfo();
file and stuck it in /srv/http/index.php
and started up both as:
# hhvm --mode server -vServer.Type=fastcgi -vServer.FileSocket=/dev/shm/hhvm.sock -p 9000 # ./rwasa -bind 80 -sandbox /srv/http -fastcgi .php /dev/shm/hhvm.sock -cpu 1 -foreground -indexfiles index.php
Here's where things got interesting
I enabled webserver_debug
, added some FastCGI debug output for 502 errors, and recompiled rwasa. As it turns out, rwasa was seeing socket closed errors on the small percentage of requests that were returning 502, indicating that no data was received prior to the close event. Running this same test with "plain jane" PHP in FastCGI mode did not cause the error whatsoever, so I decided to strace -p ...
all of the HHVM processes and run the test again to see if I could get anymore information out of the other side. Much to my surprise, with strace
running against HHVM NO 502 ERRORS OCCURRED.
Come again?
Ha! I have been dealing with commercial software development for something like 28 years, and in all that time, the process of watching what a given program is doing via strace does not make the problem go away. In this case however, much like a quantum physics experiment, the act of watching HHVM with strace
completely eliminated the problem and not a single 502 error was returned.
What was revealed however in reviewing the strace
output from HHVM was that in very rapid succession, for each short-lived connection from rwasa there were two sendmsg
calls, immediately followed by a close
call. Interestingly, for any larger-than-one-liner PHP activity, the error also did not occur. It was only for these one-liner simple and highly optimized PHP responses that it was occurring.
Enlightenment
I am not entirely sure how no other test scenario that was run against rwasa or webslap in the entire history of its development and subsequent releases never caught this case. What was actually happening was that the EPOLLHUP and EPOLLIN were coming through to the HeavyThing library's epoll event at the same time. Since versions prior to this bug always took precedent on EPOLLHUP and EPOLLERR, the FastCGI layer for rwasa was receiving no data and doing exactly what I told it to do, return a 502 error.
The bit that I found so interesting about it is that the act of running strace
against HHVM was causing them to be sent separately, thus completely eliminating the issue. Suffice it to say that programming is rarely like a quantum physics experiment, and watching things has never before caused behaviour changes like it did with this issue. Ha!
Fixing the bug wasn't difficult at all, and finding it was fun! Version 1.11 and up of all of our library and programs contains the fix.
Very cool update!
Thanks to JohnFound over on the board for pointing out that this situation actually has a proper name that I had never heard before: Heisenbug! Learn something new every day, haha! I am sure I would not still be laughing about this had I not been so readily able to figure it out.