I collected the Monte Carlo web-server statistic-data by connecting to random web-servers and asking it for its name. I'll blog about this next time, it was quite exciting I was able to maintain 80'000 concurrent connections on linux using tornados ioloop when I hit the limit of the upstream-bandwidth at home.
Download webservers.json.gz here. The file is a dictionary with each identification string as key and the count of web-servers found as value.
Forgive me for using terms like "Monte Carlo": it sounds great as a blog title and I hope it is not completely utterly wrong.
In [1]:
import pandas
import adsy.ipython as ip
import re
import io
import adsy.plotenhance as mp
import sys
import json
%pylab inline
('Python', sys.version, 'Pandas', pandas.version.version)
Out[1]:
In [2]:
wstat = json.load(io.open("webservers.json", "r"))
The file is read into a pandas DataFrame. Version 0.10 of pandas seems to support literal indexes, I believe that is new. I add a column 'wtype' for the most common web-server types.
In [3]:
df = pandas.DataFrame.from_dict(wstat, orient='index')
df.columns = ['wcount']
df = df.sort(columns=['wcount'], ascending=False)
df['wtype'] = "other"
I always use adsy-python's display_html because it suppresses the creation of vertical scrollbars, which pandas adds by default.
In [4]:
ip.dh(df[:10])
Out[4]:
I detect the most common web-servers and write the result to the wtype column, in the next step I'll group by this column.
In [5]:
re_rompager = re.compile('rompager', re.IGNORECASE)
re_apache = re.compile('apache', re.IGNORECASE)
re_iis = re.compile('microsoft-iis', re.IGNORECASE)
re_akamai = re.compile('akamai', re.IGNORECASE)
re_nginx = re.compile('nginx', re.IGNORECASE)
re_micro_httpd = re.compile('micro_httpd', re.IGNORECASE)
def getwtype(arg):
if re_rompager.search(arg.name):
arg['wtype'] = 'rompager'
elif re_apache.search(arg.name):
arg['wtype'] = 'apache'
elif re_iis.search(arg.name):
arg['wtype'] = 'iis'
elif re_akamai.search(arg.name):
arg['wtype'] = 'akamai'
elif re_nginx.search(arg.name):
arg['wtype'] = 'nginx'
elif re_micro_httpd.search(arg.name):
arg['wtype'] = 'micro_httpd'
else:
arg['wtype'] = 'other'
return arg
In [6]:
df = df.apply(getwtype, axis=1)
In [7]:
ip.dh(df[:10])
Out[7]:
Now the beautiful pandas statement: first group by wtype and then sum wcount.
In [8]:
ndf = df.groupby('wtype').sum()
I can pass the DataFrames wcount column directly to matplotlib. Note that the metallic piechart from my previous post is now part of adsy-python.
In [9]:
figure(1, figsize=(12,12))
ax = axes([0.1, 0.1, 0.8, 0.8])
ax.pie(
ndf['wcount'],
explode=[0.02 for x in ndf['wcount']],
labels=ndf.index,
autopct='%1.1f%%',
startangle=90,
colors=[
'#FF8A8A',
'#86BCFF',
'#33FDC0',
'#FFFFAA',
'#A6CAA9',
'#F0C4F0',
'#BBEBFF'
]
);
mp.metallic_pie(ax)
This is how the summed DataFrame looks:
In [10]:
ip.dh(ndf.sort(columns=['wcount'], ascending=False))
Out[10]:
Lets find out what is in the 'other' group. Group by wtype again:
In [11]:
grp = df.groupby('wtype')
Get the others group sort it by wcount and use the original DataFrame to display these entries.
In [12]:
ndf = df.ix[grp.groups['other']].sort(columns=['wcount'], ascending=False)
In [13]:
ip.dh(ndf[:10])
Out[13]:
At the end of the table are some of the more exotic web-servers.
In [14]:
ip.dh(ndf[-30:])
Out[14]:

No comments:
Post a Comment