Arc vs Dbpedia... Dbpedia wins
From: paul@devonianfarm.com
Subject: Arc vs Dbpedia... Dbpedia wins
Date: Fri, 3 Oct 2008 15:27:04 -0400 (EDT)
------=_20081003152704_64205
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Hello, I'm looking to get the triple sets from=0A=0Ahttp://dbpedia.org/=0A=
=0Ainto a local RDF store. My software is largely PHP-based, so arc is a =
desirable solution. One of the largest (but not largest) sets is the infob=
ox set, which has 32M triples. Long-term I'd like to load all of the set=
s (about 100 M triples) into a single store, since it doesn't seem possibl=
e to efficiently 'join' between stores.=0A=0AI ran into a few problems doin=
g this, which I'll talk about=0A=0AFirst of all, there is an incorrect re=
gex in the turtle parser for parsing strings. Somewhere around 50k triples=
I ran into a string that looked like "\\\\" which killed the parser. I go=
t rid of that string from the input file, and found it got killed by "... =
some text.. \\". I ran grep -v to zap those lines out.=0A=0ALoading was pr=
etty slow, but acceptable (only gotta do it once.) It started out around =
1000 triples/sec, and slowed to something around 300/sec by around 1M trip=
les. 24 hours later I had about 13M triples loaded. A few hours later, I=
hit the wall. most of the identifiers in arc are mediumints, so the syst=
em has a 16M tuple limit.=0A=0APerformance was mostly I/O bound. The machi=
ne consistently was seeing the disk 100% busy with mostly write activity. =
For the record it's running on a late-model dual core server with 4G of RAM=
, RHEL 5 64-bit. The disk system is a RAID 1 on which mysql performance i=
s less than stellar.=0A=0AIt looks like a 5 GB .nt file is going to puff up=
to an 8 GB mysql directory. I can accept that. In fact, once I get it a=
ll loaded, I'm going to run it through the mysql table compressor, which =
ought to cut the size by at least 50%.=0A=0A------=0A=0AHere's the plan goi=
ng forward:=0A=0A(1) Performance test at 16M tuples (is it even worth loadi=
ng more?)=0A(2) Get a faster disk dedicated for this kind of work=0A(3) Fix=
the bad regex=0A(4) expand mediumint -> plain int=0A(5) investigate which =
indices are used in the build process. drop all other indices, and rebuil=
d when the build process is done=0A=0ADoes this make sense?=0A=0A
------=_20081003152704_64205
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<font face=3D"arial" size=3D"2">Hello, I'm looking to get the triple =
sets from<br><br>http://dbpedia.org/<br><br>into a local RDF store. M=
y software is largely PHP-based, so arc is a desirable solution. =
; One of the largest (but not largest) sets is the infobox set, =
which has 32M triples. Long-term I'd like to load all of the sets (a=
bout 100 M triples) into a single store, since it doesn't seem possib=
le to efficiently 'join' between stores.<br><br>I ran into a few problems d=
oing this, which I'll talk about<br><br>First of all, there is =
an incorrect regex in the turtle parser for parsing strings. Somewher=
e around 50k triples I ran into a string that looked like "\\\\" which kill=
ed the parser. I got rid of that string from the input file, an=
d found it got killed by "... some text.. \\". I ran grep -v to zap t=
hose lines out.<br><br>Loading was pretty slow, but acceptable (only =
gotta do it once.) It started out around 1000 triples/sec, and =
slowed to something around 300/sec by around 1M triples. 24 hours lat=
er I had about 13M triples loaded. A few hours later, I hit the=
wall. most of the identifiers in arc are mediumints, so the sy=
stem has a 16M tuple limit.<br><br>Performance was mostly I/O bound. =
The machine consistently was seeing the disk 100% busy with mostly write ac=
tivity. For the record it's running on a late-model dual core server =
with 4G of RAM, RHEL 5 64-bit. The disk system is a RAID 1 on w=
hich mysql performance is less than stellar.<br><br>It looks like a 5 GB .n=
t file is going to puff up to an 8 GB mysql directory. I can accept t=
hat. In fact, once I get it all loaded, I'm going to run =
it through the mysql table compressor, which ought to cut the size by=
at least 50%.<br><br>------<br><br>Here's the plan going forward:<br><br>(=
1) Performance test at 16M tuples (is it even worth loading more?)<br>(2) G=
et a faster disk dedicated for this kind of work<br>(3) Fix the bad regex<b=
r>(4) expand mediumint -> plain int<br>(5) investigate which indices are=
used in the build process. drop all other indices, and rebuild=
when the build process is done<br><br>Does this make sense?<br><br></font>
------=_20081003152704_64205--
""" ;
ns1:returnPath "<paul@devonianfarm.com>" ;
ns1:xOriginalTo "arc-dev@semsol.org" ;
ns1:deliveredTo "web11p1@p15192371.pureserver.info" ;
ns1:received """by webmail.mailtrust.com
(Authenticated sender: paul@devonianfarm.com, from: paul@devonianfarm.com)
with HTTP; Fri, 3 Oct 2008 15:27:04 -0400 (EDT)""" ;
ns1:date "Fri, 3 Oct 2008 15:27:04 -0400 (EDT)" ;
ns1:subject "Arc vs Dbpedia... Dbpedia wins" ;
ns1:from "paul@devonianfarm.com" ;
ns1:to "arc-dev@semsol.org" ;
ns1:mIMEVersion "1.0" ;
ns1:contentType 'multipart/alternative;boundary="----=_20081003152704_64205"' ;
ns1:importance "Normal" ;
ns1:xPriority "3 (Normal)" ;
ns1:xType "html" ;
ns1:messageID "<1223062024.763326905@192.168.1.201>" ;
ns1:xMailer "webmail6.8" ;
ns1:xSpamCheckerVersion """SpamAssassin 2.64 (2004-01-11) on
p15192371.pureserver.info""" ;
ns1:xSpamLevel "" ;
ns1:xSpamStatus """No, hits=0.3 required=5.0 tests=AWL,HTML_MESSAGE,NO_REAL_NAME
autolearn=no version=2.64