When I learned about DBpedia, I wanted to have it installed locally, I read the tutorials on sparql and how to install DBpedia: that has to be simpler. I kind of worship Simple and I didn't want to learn yet another query language. So I hacked an SQLite import script for DBpedia types and infobox-properties in one evening. The script takes about 50 hours to read instance_types_en.nt and infobox_properties_en.nt. It creates a table per type and a table per property, after tables are created everything gets an index, the whole database is analyzed and finally vacuumed.
All the fields are parsed, URLs and other noise is stripped away. I also parse the values, the script supports string, date, int and float. What type the script has recognized is stored in the field type_hint of every prop_* table. The parsed fields have the suffix _s. There is the resources ID in the field resource and a stripped version in the field resource_s. The original value is in the field value and the parsed value in the field value_s. The value field is SQLite TEXT, actually in SQLite everything is TEXT anyway. You can cast to the correct type if you need sorting. The parsing is NOT completely error-proof, actually it's quite sloppy, but it's one of those good enough solutions. You can always fall back to the original value, although these usually don't allow sorting. Dates are stored in the python toordinal() format.
The database structure doesn't allow displaying all properties of one resource, that's where I trade functionality for simplicity. But you can enter a resource's URL in your browser and get all properties or you can enter the name in Wikipedia and get the article.
The DB doesn't enforce uniqueness or other constraints, duplicated data in Wikipedia will end-up in the database. So you have to hide it by yourself, often the duplicated data has different values, humans made Wikipedia after all.
Below in the chapter Navigating in DBpedia you can learn how to inspect the tables of the database. I also recommend using Wikipedia and DBpedia to find resources, the name in the URL of a wikipedia article is the resource ID. You can use this name to find the DBpedia resource and that page will show you what type this resource has.
Links (these might not work forever, sorry):
type tables structure:
property tables structure:
.
All the fields are parsed, URLs and other noise is stripped away. I also parse the values, the script supports string, date, int and float. What type the script has recognized is stored in the field type_hint of every prop_* table. The parsed fields have the suffix _s. There is the resources ID in the field resource and a stripped version in the field resource_s. The original value is in the field value and the parsed value in the field value_s. The value field is SQLite TEXT, actually in SQLite everything is TEXT anyway. You can cast to the correct type if you need sorting. The parsing is NOT completely error-proof, actually it's quite sloppy, but it's one of those good enough solutions. You can always fall back to the original value, although these usually don't allow sorting. Dates are stored in the python toordinal() format.
The database structure doesn't allow displaying all properties of one resource, that's where I trade functionality for simplicity. But you can enter a resource's URL in your browser and get all properties or you can enter the name in Wikipedia and get the article.
The DB doesn't enforce uniqueness or other constraints, duplicated data in Wikipedia will end-up in the database. So you have to hide it by yourself, often the duplicated data has different values, humans made Wikipedia after all.
Below in the chapter Navigating in DBpedia you can learn how to inspect the tables of the database. I also recommend using Wikipedia and DBpedia to find resources, the name in the URL of a wikipedia article is the resource ID. You can use this name to find the DBpedia resource and that page will show you what type this resource has.
Links (these might not work forever, sorry):
type tables structure:
property tables structure:
.
In [ ]:
from adsy.ipython import *
import sqlite3
If you need adsy.ipython: git clone git://github.com/adfinis-sygroup/adsy-python.git adsy
In [2]:
db = sqlite3.connect("dbpedia.sqlite")
db.execute('PRAGMA cache_size = -1048576');
Analyizing the database improved query performance.
In [76]:
dh(db.execute("""
SELECT
count(*)
FROM
sqlite_stat1
"""))
Out[76]:
In [14]:
dh(db.execute("""
SELECT
count(*)
FROM
sqlite_stat3
"""))
Out[14]:
Countries by size
In [15]:
dh(db.execute("""
SELECT DISTINCT
c.resource_s,
a.value_s AS areaKm
FROM
`type_Country` c
JOIN
prop_callingCode cc
USING("resource_s")
JOIN
prop_areaKm a
USING("resource_s")
ORDER BY
CAST(a.value_s AS INTEGER) DESC
"""))
Out[15]:
Oldest Elvises
In [20]:
dh(db.execute("""
SELECT DISTINCT
b.resource_s,
((d.value_s - b.value_s) / 365) AS age,
d.type_hint AS deathType,
d.value AS deathDate,
d.value_s AS parsedDeath,
b.type_hint AS birthType,
b.value AS birthDate,
b.value_s as parsedBirth
FROM
prop_birthDate b
JOIN
prop_deathDate d
USING("resource_s")
WHERE
b.resource_s like "elvis%"
AND
b.property = "<http://dbpedia.org/property/birthDate>"
AND
d.property = "<http://dbpedia.org/property/deathDate>"
AND
b.type_hint = "date"
AND
d.type_hint = "date"
ORDER BY
age DESC
LIMIT 10
"""))
Out[20]:
Oldest Johns (filtering for sensible data)
In [79]:
dh(db.execute("""
SELECT DISTINCT
b.resource_s,
((d.value_s - b.value_s) / 365) AS age,
d.type_hint AS deathType,
d.value AS deathDate,
d.value_s AS parsedDeath,
b.type_hint AS birthType,
b.value AS birthDate,
b.value_s as parsedBirth
FROM
prop_birthDate b
JOIN
prop_deathDate d
USING("resource_s")
WHERE
b.resource_s like "John%"
AND
birthDate like '%#date>'
AND
deathDate like '%#date>'
AND
b.property = "<http://dbpedia.org/property/birthDate>"
AND
d.property = "<http://dbpedia.org/property/deathDate>"
AND
birthType = "date"
AND
deathType = "date"
ORDER BY
age DESC
LIMIT 30
"""))
Out[79]:
Navigating in DBpedia
Is there a type for vehicels?
In [41]:
dh(db.execute("""
SELECT
*
FROM
sqlite_master
WHERE
name like "type_veh%"
LIMIT 10
"""))
Out[41]:
No, wikipedia calls it human powered-transport: http://en.wikipedia.org/wiki/Vehicle
In [45]:
dh(db.execute("""
SELECT
*
FROM
sqlite_master
WHERE
name like "type_%transport%"
LIMIT 10
"""))
Out[45]:
No, that doesn't seem to be consistant to DBpedia.
Lets find a specific car model on DBpedia and navigate from there:
http://www.google.ch/search?q=dbpedia+BMW+3-Series
Lets find a specific car model on DBpedia and navigate from there:
http://www.google.ch/search?q=dbpedia+BMW+3-Series
- Ah, mean of transportation was correct: http://live.dbpedia.org/page/BMW_3_Series_(E21)
- Mean of transportation has the property wheelbase. lets find the longest! (update: I just learned that in DBpeadia types don't define the properties as in OO languages)
In [49]:
dh(db.execute("""
SELECT
*
FROM
type_MeanOfTransportation
LIMIT 10
"""))
Out[49]:
There are planes, do they have a wheelbase? No, not all resources of the type are enforced to have wheelbase in DBpedia.
In [58]:
dh(db.execute("""
SELECT
wb.resource_s,
wb.value_s AS wheelbase,
wb.value AS unparsedValue
FROM
prop_wheelbase wb
ORDER BY
wb.value_s DESC
LIMIT 10
"""))
Out[58]:
BTW not joining to the type is cheaper and in this case okay, I suppose no other type has a wheelbase.
Units inch and milimeter are mixed, we take those and ignore the rest.
Units inch and milimeter are mixed, we take those and ignore the rest.
In [75]:
limit = 20
dh(db.execute("""
SELECT * FROM (
SELECT
wb.resource,
wb.value_s * 25.4 / 1000 AS wheelbase,
wb.value AS unparsedValue
FROM
prop_wheelbase wb
WHERE
unparsedValue like "%inch>"
ORDER BY
wheelbase DESC
LIMIT :limit
)
UNION SELECT * FROM (
SELECT
wb.resource,
wb.value_s / 1000 AS wheelbase,
wb.value AS unparsedValue
FROM
prop_wheelbase wb
WHERE
unparsedValue like "%millimetre>"
ORDER BY
wheelbase DESC
LIMIT :limit
)
ORDER BY
wheelbase DESC
LIMIT :limit
""", locals()))
Out[75]:
Wheelbase is in meters, the only sane length unit :-) Note: the multiple values from the same resource are because the infobox describes the wheelbase of different sub-models.
What the hell is a DSB_Class_EA?

That's cheating!
Most queries against my dbpedia.sqlite need manual filtering or additional filters to remove noise and garbage.
What the hell is a DSB_Class_EA?
- http://dbpedia.org/page/DSB_Class_EA
- http://en.wikipedia.org/wiki/DSB_Class_EA
That's cheating!
Most queries against my dbpedia.sqlite need manual filtering or additional filters to remove noise and garbage.
No comments:
Post a Comment