Monday, 14 January 2013

Monte Carlo Web-Server Statistics using Pandas and Matplotlib


I collected the Monte Carlo web-server statistic-data by connecting to random web-servers and asking it for its name. I'll blog about this next time, it was quite exciting I was able to maintain 80'000 concurrent connections on linux using tornados ioloop when I hit the limit of the upstream-bandwidth at home.

Download webservers.json.gz here. The file is a dictionary with each identification string as key and the count of web-servers found as value.

Forgive me for using terms like "Monte Carlo": it sounds great as a blog title and I hope it is not completely utterly wrong.

In [1]:
import pandas
import adsy.ipython as ip
import re
import io
import adsy.plotenhance as mp
import sys
import json
%pylab inline
('Python', sys.version, 'Pandas', pandas.version.version)
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
Out[1]:
('Python',
 '3.2.3 (default, Oct 19 2012, 19:53:16) \n[GCC 4.7.2]',
 'Pandas',
 '0.10.0')
In [2]:
wstat = json.load(io.open("webservers.json", "r"))
The file is read into a pandas DataFrame. Version 0.10 of pandas seems to support literal indexes, I believe that is new. I add a column 'wtype' for the most common web-server types.
In [3]:
df = pandas.DataFrame.from_dict(wstat, orient='index')
df.columns = ['wcount']
df = df.sort(columns=['wcount'], ascending=False)
df['wtype'] = "other"
I always use adsy-python's display_html because it suppresses the creation of vertical scrollbars, which pandas adds by default.
In [4]:
ip.dh(df[:10])
Out[4]:
wcount wtype
RomPager/4.07 UPnP/1.0 37162 other
Apache 27085 other
AkamaiGHost 25167 other
Microsoft-IIS/6.0 14432 other
micro_httpd 10862 other
Microsoft-IIS/7.5 8838 other
Apache/2.2.3 (CentOS) 8336 other
GoAhead-Webs 7807 other
nginx/1.0.11 5128 other
Microsoft-IIS/7.0 3343 other
I detect the most common web-servers and write the result to the wtype column, in the next step I'll group by this column.
In [5]:
re_rompager = re.compile('rompager', re.IGNORECASE)
re_apache = re.compile('apache', re.IGNORECASE)
re_iis = re.compile('microsoft-iis', re.IGNORECASE)
re_akamai = re.compile('akamai', re.IGNORECASE)
re_nginx = re.compile('nginx', re.IGNORECASE)
re_micro_httpd =  re.compile('micro_httpd', re.IGNORECASE)
def getwtype(arg):
    if re_rompager.search(arg.name):
        arg['wtype'] = 'rompager'
    elif re_apache.search(arg.name):
        arg['wtype'] = 'apache'
    elif re_iis.search(arg.name):
        arg['wtype'] = 'iis'
    elif re_akamai.search(arg.name):
        arg['wtype'] = 'akamai'    
    elif re_nginx.search(arg.name):
        arg['wtype'] = 'nginx'
    elif re_micro_httpd.search(arg.name):
        arg['wtype'] = 'micro_httpd'
    else:
        arg['wtype'] = 'other'
    return arg
In [6]:
df = df.apply(getwtype, axis=1)
In [7]:
ip.dh(df[:10])
Out[7]:
wcount wtype
RomPager/4.07 UPnP/1.0 37162 rompager
Apache 27085 apache
AkamaiGHost 25167 akamai
Microsoft-IIS/6.0 14432 iis
micro_httpd 10862 micro_httpd
Microsoft-IIS/7.5 8838 iis
Apache/2.2.3 (CentOS) 8336 apache
GoAhead-Webs 7807 other
nginx/1.0.11 5128 nginx
Microsoft-IIS/7.0 3343 iis
Now the beautiful pandas statement: first group by wtype and then sum wcount.
In [8]:
ndf = df.groupby('wtype').sum()
I can pass the DataFrames wcount column directly to matplotlib. Note that the metallic piechart from my previous post is now part of adsy-python.
In [9]:
figure(1, figsize=(12,12))
ax = axes([0.1, 0.1, 0.8, 0.8])
ax.pie(
    ndf['wcount'],
    explode=[0.02 for x in ndf['wcount']],
    labels=ndf.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=[
        '#FF8A8A',
        '#86BCFF',
        '#33FDC0',
        '#FFFFAA',
        '#A6CAA9',
        '#F0C4F0',
        '#BBEBFF'
    ]
);
mp.metallic_pie(ax)
This is how the summed DataFrame looks:
In [10]:
ip.dh(ndf.sort(columns=['wcount'], ascending=False))
Out[10]:
wcount
wtype
apache 79055
other 60610
rompager 41158
iis 28556
akamai 25167
nginx 13845
micro_httpd 10862
Lets find out what is in the 'other' group. Group by wtype again:
In [11]:
grp = df.groupby('wtype')
Get the others group sort it by wcount and use the original DataFrame to display these entries.
In [12]:
ndf = df.ix[grp.groups['other']].sort(columns=['wcount'], ascending=False)
In [13]:
ip.dh(ndf[:10])
Out[13]:
wcount wtype
GoAhead-Webs 7807 other
Microsoft-HTTPAPI/2.0 2703 other
cisco-IOS 2688 other
NET-DK/1.0 2629 other
mini_httpd/1.19 19dec2003 2093 other
httpd 2083 other
lighttpd/1.4.28 2020 other
SonicWALL 1503 other
Mini web server 1.0 ZTE corp 2005. 1465 other
Boa/0.94.14rc21 978 other
At the end of the table are some of the more exotic web-servers.
In [14]:
ip.dh(ndf[-30:])
Out[14]:
wcount wtype
Crucial Web Hosting 1 other
LANCOM 1611+ 7.58.0045 / 14.11.2008 1 other
iptoX GmbH 1 other
pvparena 1 other
WEBrick/1.3.1 (Ruby/1.8.5/2006-08-25) 1 other
EWS-NIC5/98.41 1 other
VPOP3 Mail Http Server 1 other
BSTNMA-VFTTP-113 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
ArtBlast/3.5.5 1 other
QTSS/5.5.4 (Build/489.0.5; Platform/MacOSX; Release/Update; ) 1 other
LSANCA-VFTTP-155 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
EWS-NIC4/10.26 1 other
SR-S716C2 1 other
Werkzeug/0.8.3 Python/2.6.5 1 other
NWRKNJ-VFTTP-132 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
BT Web Server 1 other
PHLAPA-VFTTP-83 (12.1.1 patch-0.3 [BuildId 14015]) 1 other
kangle/2.9.9 1 other
NVFWS 1 other
kangle/2.9.6 1 other
s2.33.2 1 other
Helix Universal Media Server/15.0.0.289 (win-x86_64-vc10) 1 other
eIDC32 WebServer 1 other
HP HTTP Server; HP Photosmart eStn C510 series - CQ140A; Serial Number: CN08N1N0AU05KN; Zeus Built:Mon Jul 25, 2011 04:08:52PM {ZEP1CN1130AR, ASIC id 0x00320104} 1 other
ECAcc (fcn/40AA) 1 other
Jetty/4.2.27 (Linux/2.4.22-1.2174.nptlsmp i386 java/1.4.1_04) 1 other
ECAcc (tko/1222) 1 other
Cougar/9.01.01.3844 1 other
CISCO IOS 12a Copyright (c) 1995-2002 by Cisco Systems mod_perl/2.0.4 Perl/v5.10.1 1 other
LANCOM 1781A 8.62.0029 / 20.06.2012 1 other

No comments:

Post a Comment