Saturday, 15 December 2012

DBpedia in SQLite

When I learned about DBpedia, I wanted to have it installed locally, I read the tutorials on sparql and how to install DBpedia: that has to be simpler. I kind of worship Simple and I didn't want to learn yet another query language. So I hacked an SQLite import script for DBpedia types and infobox-properties in one evening. The script takes about 50 hours to read instance_types_en.nt and infobox_properties_en.nt. It creates a table per type and a table per property, after tables are created everything gets an index, the whole database is analyzed and finally vacuumed.

All the fields are parsed, URLs and other noise is stripped away. I also parse the values, the script supports string, date, int and float. What type the script has recognized is stored in the field type_hint of every prop_* table. The parsed fields have the suffix _s. There is the resources ID in the field resource and a stripped version in the field resource_s. The original value is in the field value and the parsed value in the field value_s. The value field is SQLite TEXT, actually in SQLite everything is TEXT anyway. You can cast to the correct type if you need sorting. The parsing is NOT completely error-proof, actually it's quite sloppy, but it's one of those good enough solutions. You can always fall back to the original value, although these usually don't allow sorting. Dates are stored in the python toordinal() format.

The database structure doesn't allow displaying all properties of one resource, that's where I trade functionality for simplicity. But you can enter a resource's URL in your browser and get all properties or you can enter the name in Wikipedia and get the article.

The DB doesn't enforce uniqueness or other constraints, duplicated data in Wikipedia will end-up in the database. So you have to hide it by yourself, often the duplicated data has different values, humans made Wikipedia after all.

Below in the chapter Navigating in DBpedia you can learn how to inspect the tables of the database. I also recommend using Wikipedia and DBpedia to find resources, the name in the URL of a wikipedia article is the resource ID. You can use this name to find the DBpedia resource and that page will show you what type this resource has.

Links (these might not work forever, sorry):


type tables structure:
property tables structure:
.
In [ ]:
from adsy.ipython import *
import sqlite3
If you need adsy.ipython: git clone git://github.com/adfinis-sygroup/adsy-python.git adsy
In [2]:
db = sqlite3.connect("dbpedia.sqlite")
db.execute('PRAGMA cache_size   = -1048576');
Analyizing the database improved query performance.
In [76]:
dh(db.execute("""
    SELECT
        count(*)
    FROM
         sqlite_stat1
"""))
Out[76]:
count(*)
280029
In [14]:
dh(db.execute("""
    SELECT
        count(*)
    FROM
         sqlite_stat3
"""))
Out[14]:
count(*)
1769580

Countries by size

In [15]:
dh(db.execute("""
    SELECT DISTINCT
        c.resource_s,
        a.value_s AS areaKm
    FROM
        `type_Country` c
    JOIN
         prop_callingCode cc
    USING("resource_s")
    JOIN
         prop_areaKm a
    USING("resource_s")
    ORDER BY
        CAST(a.value_s AS INTEGER) DESC
"""))
Out[15]:
resource_sareaKm
African_Union29757900
Russia17075400
Canada9984670
United_States9826675
China9640821
Brazil8514877
Australia7617930
Australian_Antarctic_Territory5896500
India3287263
Argentina2780400
Kazakhstan2724900
Algeria2381741
Democratic_Republic_of_the_Congo2345409
Saudi_Arabia2250000
Denmark2220093
Greenland2166086
Mexico1972550
Indonesia1919440
Sudan1886068
Libya1759541
Iran1648195
Mongolia1564116
Peru1285216
Chad1284000
Niger1267000
Angola1246700
Mali1240192
South_Africa1221037
Colombia1141748
Ethiopia1104300


Wallis_and_Futuna264
Saint_Kitts_and_Nevis261
Niue260
Akrotiri_and_Dhekelia254
Saint_Pierre_and_Miquelon242
Cook_Islands240
Tristan_da_Cunha207
American_Samoa199
Marshall_Islands181
Aruba179
Easter_Island164
Liechtenstein160
British_Virgin_Islands153
Christmas_Island135
Saint_Helena122
Jersey119
Rodrigues109
Montserrat102
Wallis_(island)96
Nevis93
Anguilla91
Ascension_Island88
Guernsey78
San_Marino61
Bermuda53
Collectivity_of_Saint_Martin53
Pitcairn_Islands47
Rotuma46
Carriacou_and_Petite_Martinique38
Norfolk_Island35
Sint_Maarten34
Macau30
Tuvalu26
Nauru21
Sint_Eustatius21
Saint_Barth%C3%A9lemy21
Cocos_(Keeling)_Islands14
Avatele14
Saba13
%C3%8Eles_des_Saintes13
Tokelau10
Gibraltar7
Sark5
Monaco2
Vatican_City0
State_of_Palestine"Palestine: 1 6,020"

Oldest Elvises

In [20]:
dh(db.execute("""
    SELECT DISTINCT
        b.resource_s,
        ((d.value_s - b.value_s) / 365) AS age,
        d.type_hint AS deathType,
        d.value AS deathDate,
        d.value_s AS parsedDeath,
        b.type_hint AS birthType,
        b.value AS birthDate,
        b.value_s as parsedBirth
    FROM
        prop_birthDate b
    JOIN
        prop_deathDate d
    USING("resource_s")
    WHERE
        b.resource_s like "elvis%"
        AND
        b.property = "<http://dbpedia.org/property/birthDate>"
        AND
        d.property = "<http://dbpedia.org/property/deathDate>"
        AND
        b.type_hint = "date"
        AND
        d.type_hint = "date"
    ORDER BY
        age DESC
    LIMIT 10
"""))
Out[20]:
resource_sagedeathTypedeathDateparsedDeathbirthTypebirthDateparsedBirth
Elvis_Jacob_Stahr,_Jr.82date"1998-11-11"^^<http://www.w3.org/2001/XMLSchema#date>729704date"1916-03-09"^^<http://www.w3.org/2001/XMLSchema#date>699507
Elvis_Presley42date"1977-08-16"^^<http://www.w3.org/2001/XMLSchema#date>721947date"1935-01-08"^^<http://www.w3.org/2001/XMLSchema#date>706386

Oldest Johns (filtering for sensible data)

In [79]:
dh(db.execute("""
    SELECT DISTINCT
        b.resource_s,
        ((d.value_s - b.value_s) / 365) AS age,
        d.type_hint AS deathType,
        d.value AS deathDate,
        d.value_s AS parsedDeath,
        b.type_hint AS birthType,
        b.value AS birthDate,
        b.value_s as parsedBirth
    FROM
        prop_birthDate b
    JOIN
        prop_deathDate d
    USING("resource_s")
    WHERE
        b.resource_s like "John%"
        AND
        birthDate like '%#date>'
        AND
        deathDate like '%#date>'
        AND
        b.property = "<http://dbpedia.org/property/birthDate>"
        AND
        d.property = "<http://dbpedia.org/property/deathDate>"
        AND
        birthType = "date"
        AND
        deathType = "date"
    ORDER BY
        age DESC
    LIMIT 30
"""))
Out[79]:
resource_sagedeathTypedeathDateparsedDeathbirthTypebirthDateparsedBirth
John_Painter_(supercentenarian)112date"2001-03-01"^^<http://www.w3.org/2001/XMLSchema#date>730545date"1888-09-20"^^<http://www.w3.org/2001/XMLSchema#date>689476
John_Babcock109date"2010-02-18"^^<http://www.w3.org/2001/XMLSchema#date>733821date"1900-07-23"^^<http://www.w3.org/2001/XMLSchema#date>693799
John_J._Doles109date"2004-04-16"^^<http://www.w3.org/2001/XMLSchema#date>731687date"1895-04-26"^^<http://www.w3.org/2001/XMLSchema#date>691885
John_Whitfield_Bunn_and_Jacob_Bunn106date"1920-06-07"^^<http://www.w3.org/2001/XMLSchema#date>701058date"1814-03-18"^^<http://www.w3.org/2001/XMLSchema#date>662261
John_Gray_(American_Revolutionary_War)104date"1868-03-29"^^<http://www.w3.org/2001/XMLSchema#date>681996date"1764-01-06"^^<http://www.w3.org/2001/XMLSchema#date>643928
John_Aikman_Stewart104date"1926-12-18"^^<http://www.w3.org/2001/XMLSchema#date>703443date"1822-08-26"^^<http://www.w3.org/2001/XMLSchema#date>665344
John_Kenley103date"2009-10-23"^^<http://www.w3.org/2001/XMLSchema#date>733703date"1906-02-20"^^<http://www.w3.org/2001/XMLSchema#date>695837
John_Hubert_Craigie101date"1989-02-26"^^<http://www.w3.org/2001/XMLSchema#date>726159date"1887-12-08"^^<http://www.w3.org/2001/XMLSchema#date>689189
John_Wingblade101date"1984-05-14"^^<http://www.w3.org/2001/XMLSchema#date>724410date"1883-02-26"^^<http://www.w3.org/2001/XMLSchema#date>687443
John_Daley_(baseball)101date"1988-08-08"^^<http://www.w3.org/2001/XMLSchema#date>725957date"1887-05-25"^^<http://www.w3.org/2001/XMLSchema#date>688992
John_N._Heiskell100date"1972-12-28"^^<http://www.w3.org/2001/XMLSchema#date>720255date"1872-11-02"^^<http://www.w3.org/2001/XMLSchema#date>683675
John_L._Hines100date"1968-10-13"^^<http://www.w3.org/2001/XMLSchema#date>718718date"1868-05-21"^^<http://www.w3.org/2001/XMLSchema#date>682049
John_William_Finn100date"2010-05-27"^^<http://www.w3.org/2001/XMLSchema#date>733919date"1909-07-23"^^<http://www.w3.org/2001/XMLSchema#date>697086
John_C._Zimmerman,_Sr.100date"1935-10-26"^^<http://www.w3.org/2001/XMLSchema#date>706677date"1835-05-12"^^<http://www.w3.org/2001/XMLSchema#date>669986
John_Richard_Anthony_Oldfield100date"1999-12-11"^^<http://www.w3.org/2001/XMLSchema#date>730099date"1899-07-05"^^<http://www.w3.org/2001/XMLSchema#date>693416
John_Chessell_Buckler100date"1894-01-10"^^<http://www.w3.org/2001/XMLSchema#date>691414date"1793-12-08"^^<http://www.w3.org/2001/XMLSchema#date>654857
John_Nance_Garner99date"1967-11-07"^^<http://www.w3.org/2001/XMLSchema#date>718377date"1868-11-22"^^<http://www.w3.org/2001/XMLSchema#date>682234
John_Wooden99date"2010-06-04"^^<http://www.w3.org/2001/XMLSchema#date>733927date"1910-10-14"^^<http://www.w3.org/2001/XMLSchema#date>697534
John_R._Steelman99date"1999-07-14"^^<http://www.w3.org/2001/XMLSchema#date>729949date"1900-06-23"^^<http://www.w3.org/2001/XMLSchema#date>693769
John_Tileston_Edsall99date"2002-06-12"^^<http://www.w3.org/2001/XMLSchema#date>731013date"1902-11-03"^^<http://www.w3.org/2001/XMLSchema#date>694632
John_Hollison99date"1969-08-19"^^<http://www.w3.org/2001/XMLSchema#date>719028date"1870-03-03"^^<http://www.w3.org/2001/XMLSchema#date>682700
Johnny_Williams_(blues_musician)99date"2006-03-06"^^<http://www.w3.org/2001/XMLSchema#date>732376date"1906-05-15"^^<http://www.w3.org/2001/XMLSchema#date>695921
John_Spencer_Hardy99date"2012-05-01"^^<http://www.w3.org/2001/XMLSchema#date>734624date"1913-05-07"^^<http://www.w3.org/2001/XMLSchema#date>698470
John_Cramer_(Australian_politician)98date"1994-05-18"^^<http://www.w3.org/2001/XMLSchema#date>728066date"1896-02-18"^^<http://www.w3.org/2001/XMLSchema#date>692183
John_R._Alison98date"2011-06-06"^^<http://www.w3.org/2001/XMLSchema#date>734294date"1912-11-21"^^<http://www.w3.org/2001/XMLSchema#date>698303
John_Aloysius_Morgan98date"2008-05-21"^^<http://www.w3.org/2001/XMLSchema#date>733183date"1909-10-09"^^<http://www.w3.org/2001/XMLSchema#date>697164
John_Weld98date"2003-06-14"^^<http://www.w3.org/2001/XMLSchema#date>731380date"1905-02-24"^^<http://www.w3.org/2001/XMLSchema#date>695476
John_Sanford_(author)98date"2003-03-05"^^<http://www.w3.org/2001/XMLSchema#date>731279date"1904-05-31"^^<http://www.w3.org/2001/XMLSchema#date>695207
John_Gill_(coach)98date"1997-03-04"^^<http://www.w3.org/2001/XMLSchema#date>729087date"1898-11-27"^^<http://www.w3.org/2001/XMLSchema#date>693196
John_Kenneth_Galbraith97date"2006-04-29"^^<http://www.w3.org/2001/XMLSchema#date>732430date"1908-10-15"^^<http://www.w3.org/2001/XMLSchema#date>696805

Navigating in DBpedia

Is there a type for vehicels?
In [41]:
dh(db.execute("""
    SELECT
        *
    FROM
        sqlite_master
    WHERE
        name like "type_veh%"
    LIMIT 10
"""))
Out[41]:
typenametbl_namerootpagesql
No, wikipedia calls it human powered-transport: http://en.wikipedia.org/wiki/Vehicle
In [45]:
dh(db.execute("""
    SELECT
        *
    FROM
        sqlite_master
    WHERE
        name like "type_%transport%"
    LIMIT 10
"""))
Out[45]:
typenametbl_namerootpagesql
tabletype_MeanOfTransportationtype_MeanOfTransportation152CREATE TABLE `type_MeanOfTransportation` ( `type_MeanOfTransportationID` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, `resource` TEXT, `type` TEXT, `resource_s` TEXT )
tabletype_RouteOfTransportationtype_RouteOfTransportation201CREATE TABLE `type_RouteOfTransportation` ( `type_RouteOfTransportationID` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, `resource` TEXT, `type` TEXT, `resource_s` TEXT )
indextype_MeanOfTransportation_resource_idxtype_MeanOfTransportation71278CREATE INDEX `type_MeanOfTransportation_resource_idx` ON `type_MeanOfTransportation` (`resource`)
indextype_MeanOfTransportation_type_idxtype_MeanOfTransportation71280CREATE INDEX `type_MeanOfTransportation_type_idx` ON `type_MeanOfTransportation` (`type`)
indextype_MeanOfTransportation_resource_s_idxtype_MeanOfTransportation71281CREATE INDEX `type_MeanOfTransportation_resource_s_idx` ON `type_MeanOfTransportation` (`resource_s`)
indextype_RouteOfTransportation_resource_idxtype_RouteOfTransportation71407CREATE INDEX `type_RouteOfTransportation_resource_idx` ON `type_RouteOfTransportation` (`resource`)
indextype_RouteOfTransportation_type_idxtype_RouteOfTransportation71408CREATE INDEX `type_RouteOfTransportation_type_idx` ON `type_RouteOfTransportation` (`type`)
indextype_RouteOfTransportation_resource_s_idxtype_RouteOfTransportation71409CREATE INDEX `type_RouteOfTransportation_resource_s_idx` ON `type_RouteOfTransportation` (`resource_s`)
No, that doesn't seem to be consistant to DBpedia.

Lets find a specific car model on DBpedia and navigate from there:
http://www.google.ch/search?q=dbpedia+BMW+3-Series
  • Ah, mean of transportation was correct: http://live.dbpedia.org/page/BMW_3_Series_(E21)
  • Mean of transportation has the property wheelbase. lets find the longest! (update: I just learned that in DBpeadia types don't define the properties as in OO languages)
In [49]:
dh(db.execute("""
    SELECT
        *
    FROM
        type_MeanOfTransportation
    LIMIT 10
"""))
Out[49]:
type_MeanOfTransportationIDresourcetyperesource_s
1<http://dbpedia.org/resource/Airbus_A300><http://dbpedia.org/ontology/MeanOfTransportation>Airbus_A300
2<http://dbpedia.org/resource/Ariane_5><http://dbpedia.org/ontology/MeanOfTransportation>Ariane_5
3<http://dbpedia.org/resource/Lockheed_AC-130><http://dbpedia.org/ontology/MeanOfTransportation>Lockheed_AC-130
4<http://dbpedia.org/resource/CIM-10_Bomarc><http://dbpedia.org/ontology/MeanOfTransportation>CIM-10_Bomarc
5<http://dbpedia.org/resource/Boeing_767><http://dbpedia.org/ontology/MeanOfTransportation>Boeing_767
6<http://dbpedia.org/resource/North_American_B-25_Mitchell><http://dbpedia.org/ontology/MeanOfTransportation>North_American_B-25_Mitchell
7<http://dbpedia.org/resource/Northrop_Grumman_B-2_Spirit><http://dbpedia.org/ontology/MeanOfTransportation>Northrop_Grumman_B-2_Spirit
8<http://dbpedia.org/resource/Boeing_747><http://dbpedia.org/ontology/MeanOfTransportation>Boeing_747
9<http://dbpedia.org/resource/Benjamin_Franklin_class_submarine><http://dbpedia.org/ontology/MeanOfTransportation>Benjamin_Franklin_class_submarine
10<http://dbpedia.org/resource/Blue_Streak_(missile)><http://dbpedia.org/ontology/MeanOfTransportation>Blue_Streak_(missile)
There are planes, do they have a wheelbase? No, not all resources of the type are enforced to have wheelbase in DBpedia.
In [58]:
dh(db.execute("""
    SELECT
        wb.resource_s,
        wb.value_s AS wheelbase,
        wb.value AS unparsedValue
    FROM
        prop_wheelbase wb
    ORDER BY
        wb.value_s DESC
    LIMIT 10
"""))
Out[58]:
resource_swheelbaseunparsedValue
Honda_Magna99600.0"99600.0"^^<http://dbpedia.org/datatype/second>
DSB_Class_EA9940.0"9940.0"^^<http://dbpedia.org/datatype/millimetre>
Ford_Tempo99.9"99.9"^^<http://dbpedia.org/datatype/inch>
Ford_Tempo99.9"99.9"^^<http://dbpedia.org/datatype/inch>
Honda_EV_Plus99.6"99.6"^^<http://dbpedia.org/datatype/inch>
Aston_Martin_DB_Mark_III99.0"99.0"^^<http://dbpedia.org/datatype/inch>
Ducati_989898"9898"^^<http://www.w3.org/2001/XMLSchema#int>
Mitsubishi_Eclipse98.8"98.8"^^<http://dbpedia.org/datatype/inch>
Buick_Reatta98.5"98.5"^^<http://dbpedia.org/datatype/inch>
Mercury_Tracer98.4"98.4"^^<http://dbpedia.org/datatype/inch>
BTW not joining to the type is cheaper and in this case okay, I suppose no other type has a wheelbase.
Units inch and milimeter are mixed, we take those and ignore the rest.
In [75]:
limit = 20
dh(db.execute("""
    SELECT * FROM (
        SELECT
            wb.resource,
            wb.value_s * 25.4 / 1000 AS wheelbase,
            wb.value AS unparsedValue
        FROM
            prop_wheelbase wb
        WHERE
            unparsedValue like "%inch>"
        ORDER BY
            wheelbase DESC
        LIMIT :limit
    )
    UNION SELECT * FROM (
        SELECT
            wb.resource,
            wb.value_s / 1000 AS wheelbase,
            wb.value AS unparsedValue
        FROM
            prop_wheelbase wb
        WHERE
            unparsedValue like "%millimetre>"
        ORDER BY
            wheelbase DESC
        LIMIT :limit
    )
    ORDER BY
        wheelbase DESC
    LIMIT :limit
        
""", locals()))
Out[75]:
resourcewheelbaseunparsedValue
<http://dbpedia.org/resource/DSB_Class_EA>9.94"9940.0"^^<http://dbpedia.org/datatype/millimetre>
<http://dbpedia.org/resource/Dodge_Ram>5.189220000000001"204.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Dodge_Ram>4.88442"192.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Dodge_Ram>4.78282"188.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Dodge_Ram>4.27482"168.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Cadillac_Series_70>4.19862"165.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Dodge_Ram>4.173220000000001"164.3"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Thames_Trader>4.064"160.0"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Ford_Super_Duty>4.013199999999999"158.0"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Ford_Super_Duty>3.9674799999999997"156.2"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Checker_Aerobus>3.9242999999999997"154.5"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Thames_Trader>3.8608"152.0"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Cadillac_Series_70>3.8481"151.5"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Hispano-Suiza_J12>3.81"150.0"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Rolls-Royce_Phantom_II>3.81"150.0"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Cadillac_Series_70>3.80492"149.8"^^<http://dbpedia.org/datatype/inch>
<http://dbpedia.org/resource/Karsan_J10>3.3"3300.0"^^<http://dbpedia.org/datatype/millimetre>
<http://dbpedia.org/resource/Citro%C3%ABn_Traction_Avant>3.27"3270.0"^^<http://dbpedia.org/datatype/millimetre>
<http://dbpedia.org/resource/Peugeot_402>3.15"3150.0"^^<http://dbpedia.org/datatype/millimetre>
<http://dbpedia.org/resource/Citro%C3%ABn_Traction_Avant>3.09"3090.0"^^<http://dbpedia.org/datatype/millimetre>
Wheelbase is in meters, the only sane length unit :-) Note: the multiple values from the same resource are because the infobox describes the wheelbase of different sub-models.

What the hell is a DSB_Class_EA?
  • http://dbpedia.org/page/DSB_Class_EA
  • http://en.wikipedia.org/wiki/DSB_Class_EA
It's a lokomotive
That's cheating!

Most queries against my dbpedia.sqlite need manual filtering or additional filters to remove noise and garbage.

No comments:

Post a Comment