Honey Pot Details
Apache HTTP Server
I ran the honey pot with an Apache HTTP server. I used Apache's mod_rewrite module to aid in spoofing requests for (emulated) blogs whose URI I couldn't know in advance, and also for spoofing requests for PHP files in plugins or themes that the fake blog had claimed to successfully install, but actually had just saved the zip file.
I used an extended Apache log file format: LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined . A typical line in access_log looks like this:
192.168.1.114 - - [08/Jul/2013:22:16:46 -0600] "GET / HTTP/1.1" 403 179 "-" "Mozilla/5.0 (X11; Linux i686; rv:22.0) Gecko/20100101 Firefox/22.0"
p0f Passive Scanner
I ran p0f v2.0, a passive identification daemon, on the same server. This allows OS guessing/identification on a basis other than the user agent string. A typical output line (one per SYN packet:
<Thu Sep 26 21:26:15 2013> 10.0.0.3:48234 - Linux 3.0-1 (1) (possibly Ubuntu 11.10, FC 16, Gentoo 11.2, OpenSUSE 12.x) (up: 247 hrs) -> 89.187.142.144:80 (distance 0, link: ethernet/modem)
p0f v3 became available shortly after I started using p0f v2, but i like the single-line format of p0f v2's output. I used fingerprint file p0f.fp.2012032901.
Database
I used a Postgresql database to select and coordinate access_log entries, p0f observations and what information the WordPress blog emulation logged. It has a "star schema" layout, with three fact tables, and a number of supporting dimension tables.
It is pssible to match p0f OS identifications to Apache log file entries, so my database design does that as thoroughly as possible. Since the Apache HTTP server runs the PHP interpreter, it's relatively easy to match the per-request information my honey pot and malware emulation code saved, to a given HTTP request, since both have the Internet Protocol address that the request came from, the URI that the request wants, and a timestamp. The tricky parts involve not duplicating matches between honey pot emulation-saved information and a single Apache log file line. If an exact match on IP address, URI and timestamp (that doesn't also match a previous honey pot request) doesn't exist, you have to look for Apache log file entries within a few seconds of the honey pot info's timestamp.
Emulated Software
I captured HTML from a real WordPress 2.9.2 installation. I wrote small amounts of PHP code to display the captured HTML appropriately. The PHP that executed on honey pot page access logged the values in $_SERVER, $_REQUEST, $_COOKIE, and kept the files named in $_FILE, along with other info. Most of the interaction with my fake-WordPress blog was with programs - the CSS files referenced in the HTML rarely got requested. Any access by a real browser would have requested CSS files, or at least favicon.ico.
The ability to appear to install malware was key to tricking downloads of much of the malware. In addition to faking WordPress theme and plugin downloads and installs, and faking accepting new posts, I emulated parts of the popular WSO 2.5.1 web shell. I installed a WSO shell file on a second machine behind a firewall. I captured HTML for the WSO login screen, the main screen and the results of a successful file upload. I wrote PHP code that served the various chunks of HTML at appropriate times. The PHP code also captured any uploaded files, and recorded a lot of the information available to PHP at the time of access. I believe that generating WSO-like HTML that shows a file got downloaded is essential to getting lots of downloads. Showing a "main page" without the new file listed on it after a download attempt doesn't seem to give attackers the cue to try their download, and maybe download something else. WSO emulators that show a directory listing with more than just the last file it uploaded may draw more invocations of downloaded malware.
I also emulated some pieces of malware that came along: an STMP tester downloaded via the 7c334.php backdoor, and the qq.php emailer. I had to read and understand all code downloaded before I could emulate it adequately, so I did not write qq.php emulation in time to catch any spamming attempts. I did catch malware with a 7c334.php emulator.