Sunday, 20 January 2013

Tornado IOLoop Web Server Statistics Collector

Last week I blogged how often certain web-server are used in the public internet. Here is the script I used to collect that data. I used it to test async network-coding, coroutines, closures, multi-threading in python, also to test the scalability of my OSs (Darwin/Linux) and tornado. It wasn't a well defined test, but Darwin died at 10'000 concurrent connections and Linux easily managed 80'000 connections on the same hardware.
The most important rule of async programming: Never ever block!
#           _______  _______  _       _________ _        _______ 
# |\     /|(  ___  )(  ____ )( (    /|\__   __/( (    /|(  ____ \
# | )   ( || (   ) || (    )||  \  ( |   ) (   |  \  ( || (    \/
# | | _ | || (___) || (____)||   \ | |   | |   |   \ | || |      
# | |( )| ||  ___  ||     __)| (\ \) |   | |   | (\ \) || | ____ 
# | || || || (   ) || (\ (   | | \   |   | |   | | \   || | \_  )
# | () () || )   ( || ) \ \__| )  \  |___) (___| )  \  || (___) |
# (_______)|/     \||/   \__/|/    )_)\_______/|/    )_)(_______)
#
# If you use this script, your ISP might think you've got a trojan
# and sandbox you, ban you or whatevery they think is appropriate.
#
# This script collects the Monte Carlo web-server statistic-data by
# connecting to random web-servers and asking it for its name.
# The results are stored in a dictionary with each identification string
# as key and the count of web-servers found as value.

#
# If you want to test the maximum speed / concurrent connections
# remove these lines
#        if hcount > 10000:
#            time.sleep(1)
# and run a process per core on your machine. Processes have to have
# different working directories!
#
# Features:
#
# * Defining maximum number of concurrent connections. This is important
#   for OS X and maybe other BSD based systems. They tend to lockup beyond
#   9000 connections. I even had random reboots on OSX.
# * Linux on the other hand just scales and scales and scales. ;-)
# * I was able to maintain 80'000 connections on linux with four processes
#   -> Then I hit the limit of the upstream-bandwidth at home.
# * It only tries to access valid IPs (ie. ignores private IPs)
# * It dumps snapshots of the collected data every 5000 sucessful connections
# * It uses tornados supercool read_until_regex function
# * IPs are feed to the ioloop by a seperate thread
# * it properly cleanups used connections after 6 seconds
# -> To make the script faster you can reduce this timeout, although then
#    you might miss some slow servers/connections.
# * It locks shared datastructures.
# * I used tornade.gen to write async-code as single function using
#   coroutines. Coroutines are one reason I love lua and python!
#   Async-code gets so much more readable!
# * Its not tested on python2 use python3.2 or higher
# * Use 3to2-3.x to convert the iptools module
#    3to2-3.2 -w
#    python setup.py install
# !! CONFIGURE YOUR OS to the maximum concurrent connections you want
#    test. If the hardlimit is already hight enough the script will
#    set a limit of 10240.
# * Remove the resource.setrlimit code if your OS doesn't support it.
# * It uses closure to settings to callbacks
# * I hope the tornado.iostream methods are threadsafe. In a production system
#   you should definitely move these calls to the main thread.

Monday, 14 January 2013

Monte Carlo Web-Server Statistics using Pandas and Matplotlib


I collected the Monte Carlo web-server statistic-data by connecting to random web-servers and asking it for its name. I'll blog about this next time, it was quite exciting I was able to maintain 80'000 concurrent connections on linux using tornados ioloop when I hit the limit of the upstream-bandwidth at home.

Download webservers.json.gz here. The file is a dictionary with each identification string as key and the count of web-servers found as value.

Forgive me for using terms like "Monte Carlo": it sounds great as a blog title and I hope it is not completely utterly wrong.

In [1]:
import pandas
import adsy.ipython as ip
import re
import io
import adsy.plotenhance as mp
import sys
import json
%pylab inline
('Python', sys.version, 'Pandas', pandas.version.version)
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
Out[1]:
('Python',
 '3.2.3 (default, Oct 19 2012, 19:53:16) \n[GCC 4.7.2]',
 'Pandas',
 '0.10.0')
In [2]:
wstat = json.load(io.open("webservers.json", "r"))
The file is read into a pandas DataFrame. Version 0.10 of pandas seems to support literal indexes, I believe that is new. I add a column 'wtype' for the most common web-server types.
In [3]:
df = pandas.DataFrame.from_dict(wstat, orient='index')
df.columns = ['wcount']
df = df.sort(columns=['wcount'], ascending=False)
df['wtype'] = "other"
I always use adsy-python's display_html because it suppresses the creation of vertical scrollbars, which pandas adds by default.
In [4]:
ip.dh(df[:10])
Out[4]:
wcount wtype
RomPager/4.07 UPnP/1.0 37162 other
Apache 27085 other
AkamaiGHost 25167 other
Microsoft-IIS/6.0 14432 other
micro_httpd 10862 other
Microsoft-IIS/7.5 8838 other
Apache/2.2.3 (CentOS) 8336 other
GoAhead-Webs 7807 other
nginx/1.0.11 5128 other
Microsoft-IIS/7.0 3343 other
I detect the most common web-servers and write the result to the wtype column, in the next step I'll group by this column.
In [5]:
re_rompager = re.compile('rompager', re.IGNORECASE)
re_apache = re.compile('apache', re.IGNORECASE)
re_iis = re.compile('microsoft-iis', re.IGNORECASE)
re_akamai = re.compile('akamai', re.IGNORECASE)
re_nginx = re.compile('nginx', re.IGNORECASE)
re_micro_httpd =  re.compile('micro_httpd', re.IGNORECASE)
def getwtype(arg):
    if re_rompager.search(arg.name):
        arg['wtype'] = 'rompager'
    elif re_apache.search(arg.name):
        arg['wtype'] = 'apache'
    elif re_iis.search(arg.name):
        arg['wtype'] = 'iis'
    elif re_akamai.search(arg.name):
        arg['wtype'] = 'akamai'    
    elif re_nginx.search(arg.name):
        arg['wtype'] = 'nginx'
    elif re_micro_httpd.search(arg.name):
        arg['wtype'] = 'micro_httpd'
    else:
        arg['wtype'] = 'other'
    return arg
In [6]:
df = df.apply(getwtype, axis=1)
In [7]:
ip.dh(df[:10])
Out[7]:
wcount wtype
RomPager/4.07 UPnP/1.0 37162 rompager
Apache 27085 apache
AkamaiGHost 25167 akamai
Microsoft-IIS/6.0 14432 iis
micro_httpd 10862 micro_httpd
Microsoft-IIS/7.5 8838 iis
Apache/2.2.3 (CentOS) 8336 apache
GoAhead-Webs 7807 other
nginx/1.0.11 5128 nginx
Microsoft-IIS/7.0 3343 iis
Now the beautiful pandas statement: first group by wtype and then sum wcount.
In [8]:
ndf = df.groupby('wtype').sum()
I can pass the DataFrames wcount column directly to matplotlib. Note that the metallic piechart from my previous post is now part of adsy-python.
In [9]:
figure(1, figsize=(12,12))
ax = axes([0.1, 0.1, 0.8, 0.8])
ax.pie(
    ndf['wcount'],
    explode=[0.02 for x in ndf['wcount']],
    labels=ndf.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=[
        '#FF8A8A',
        '#86BCFF',
        '#33FDC0',
        '#FFFFAA',
        '#A6CAA9',
        '#F0C4F0',
        '#BBEBFF'
    ]
);
mp.metallic_pie(ax)
This is how the summed DataFrame looks:
In [10]:
ip.dh(ndf.sort(columns=['wcount'], ascending=False))
Out[10]:
wcount
wtype
apache 79055
other 60610
rompager 41158
iis 28556
akamai 25167
nginx 13845
micro_httpd 10862
Lets find out what is in the 'other' group. Group by wtype again:
In [11]:
grp = df.groupby('wtype')
Get the others group sort it by wcount and use the original DataFrame to display these entries.
In [12]:
ndf = df.ix[grp.groups['other']].sort(columns=['wcount'], ascending=False)
In [13]:
ip.dh(ndf[:10])
Out[13]:
wcount wtype
GoAhead-Webs 7807 other
Microsoft-HTTPAPI/2.0 2703 other
cisco-IOS 2688 other
NET-DK/1.0 2629 other
mini_httpd/1.19 19dec2003 2093 other
httpd 2083 other
lighttpd/1.4.28 2020 other
SonicWALL 1503 other
Mini web server 1.0 ZTE corp 2005. 1465 other
Boa/0.94.14rc21 978 other
At the end of the table are some of the more exotic web-servers.
In [14]:
ip.dh(ndf[-30:])
Out[14]:
wcount wtype
Crucial Web Hosting 1 other
LANCOM 1611+ 7.58.0045 / 14.11.2008 1 other
iptoX GmbH 1 other
pvparena 1 other
WEBrick/1.3.1 (Ruby/1.8.5/2006-08-25) 1 other
EWS-NIC5/98.41 1 other
VPOP3 Mail Http Server 1 other
BSTNMA-VFTTP-113 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
ArtBlast/3.5.5 1 other
QTSS/5.5.4 (Build/489.0.5; Platform/MacOSX; Release/Update; ) 1 other
LSANCA-VFTTP-155 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
EWS-NIC4/10.26 1 other
SR-S716C2 1 other
Werkzeug/0.8.3 Python/2.6.5 1 other
NWRKNJ-VFTTP-132 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
BT Web Server 1 other
PHLAPA-VFTTP-83 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
kangle/2.9.9 1 other
NVFWS 1 other
kangle/2.9.6 1 other
s2.33.2 1 other
Helix Universal Media Server/15.0.0.289 (win-x86_64-vc10) 1 other
eIDC32 WebServer 1 other
HP HTTP Server; HP Photosmart eStn C510 series - CQ140A; Serial Number: CN08N1N0AU05KN; Zeus Built:Mon Jul 25, 2011 04:08:52PM {ZEP1CN1130AR, ASIC id 0x00320104} 1 other
ECAcc (fcn/40AA) 1 other
Jetty/4.2.27 (Linux/2.4.22-1.2174.nptlsmp i386 java/1.4.1_04) 1 other
ECAcc (tko/1222) 1 other
Cougar/9.01.01.3844 1 other
CISCO IOS 12a Copyright (c) 1995-2002 by Cisco Systems mod_perl/2.0.4 Perl/v5.10.1 1 other
LANCOM 1781A 8.62.0029 / 20.06.2012 1 other