Mailing list ARC-DEV: Archives

Arc vs Dbpedia... Dbpedia wins

From: paul@devonianfarm.com
Subject: Arc vs Dbpedia...  Dbpedia wins
Date: Fri, 3 Oct 2008 15:27:04 -0400 (EDT)


------=_20081003152704_64205
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello,  I'm looking to get the triple sets from=0A=0Ahttp://dbpedia.org/=0A=
=0Ainto a local RDF store.  My software is largely PHP-based,  so arc is a =
desirable solution.  One of the largest (but not largest) sets is the infob=
ox set,   which has 32M triples.  Long-term I'd like to load all of the set=
s (about 100 M triples) into a single store,  since it doesn't seem possibl=
e to efficiently 'join' between stores.=0A=0AI ran into a few problems doin=
g this,  which I'll talk about=0A=0AFirst of all,  there is an incorrect re=
gex in the turtle parser for parsing strings.  Somewhere around 50k triples=
 I ran into a string that looked like "\\\\" which killed the parser.  I go=
t rid of that string from the input file,  and found it got killed by "... =
some text.. \\".  I ran grep -v to zap those lines out.=0A=0ALoading was pr=
etty slow,  but acceptable (only gotta do it once.)  It started out around =
1000 triples/sec,  and slowed to something around 300/sec by around 1M trip=
les.  24 hours later I had about 13M triples loaded.  A few hours later,  I=
 hit the wall.  most of the identifiers in arc are mediumints,  so the syst=
em has a 16M tuple limit.=0A=0APerformance was mostly I/O bound.  The machi=
ne consistently was seeing the disk 100% busy with mostly write activity.  =
For the record it's running on a late-model dual core server with 4G of RAM=
,  RHEL 5 64-bit.  The disk system is a RAID 1 on which mysql performance i=
s less than stellar.=0A=0AIt looks like a 5 GB .nt file is going to puff up=
 to an 8 GB mysql directory.  I can accept that.  In fact,  once I get it a=
ll loaded,  I'm going to run it through the mysql table compressor,  which =
ought to cut the size by at least 50%.=0A=0A------=0A=0AHere's the plan goi=
ng forward:=0A=0A(1) Performance test at 16M tuples (is it even worth loadi=
ng more?)=0A(2) Get a faster disk dedicated for this kind of work=0A(3) Fix=
 the bad regex=0A(4) expand mediumint -> plain int=0A(5) investigate which =
indices are used in the build process.  drop all other indices,  and rebuil=
d when the build process is done=0A=0ADoes this make sense?=0A=0A
------=_20081003152704_64205
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<font face=3D"arial" size=3D"2">Hello,&nbsp; I'm looking to get the triple =
sets from<br><br>http://dbpedia.org/<br><br>into a local RDF store.&nbsp; M=
y software is largely PHP-based,&nbsp; so arc is a desirable solution.&nbsp=
; One of the largest (but not largest) sets is the infobox set,&nbsp;&nbsp;=
 which has 32M triples.&nbsp; Long-term I'd like to load all of the sets (a=
bout 100 M triples) into a single store,&nbsp; since it doesn't seem possib=
le to efficiently 'join' between stores.<br><br>I ran into a few problems d=
oing this,&nbsp; which I'll talk about<br><br>First of all,&nbsp; there is =
an incorrect regex in the turtle parser for parsing strings.&nbsp; Somewher=
e around 50k triples I ran into a string that looked like "\\\\" which kill=
ed the parser.&nbsp; I got rid of that string from the input file,&nbsp; an=
d found it got killed by "... some text.. \\".&nbsp; I ran grep -v to zap t=
hose lines out.<br><br>Loading was pretty slow,&nbsp; but acceptable (only =
gotta do it once.)&nbsp; It started out around 1000 triples/sec,&nbsp; and =
slowed to something around 300/sec by around 1M triples.&nbsp; 24 hours lat=
er I had about 13M triples loaded.&nbsp; A few hours later,&nbsp; I hit the=
 wall.&nbsp; most of the identifiers in arc are mediumints,&nbsp; so the sy=
stem has a 16M tuple limit.<br><br>Performance was mostly I/O bound.&nbsp; =
The machine consistently was seeing the disk 100% busy with mostly write ac=
tivity.&nbsp; For the record it's running on a late-model dual core server =
with 4G of RAM,&nbsp; RHEL 5 64-bit.&nbsp; The disk system is a RAID 1 on w=
hich mysql performance is less than stellar.<br><br>It looks like a 5 GB .n=
t file is going to puff up to an 8 GB mysql directory.&nbsp; I can accept t=
hat.&nbsp; In fact,&nbsp; once I get it all loaded,&nbsp; I'm going to run =
it through the mysql table compressor,&nbsp; which ought to cut the size by=
 at least 50%.<br><br>------<br><br>Here's the plan going forward:<br><br>(=
1) Performance test at 16M tuples (is it even worth loading more?)<br>(2) G=
et a faster disk dedicated for this kind of work<br>(3) Fix the bad regex<b=
r>(4) expand mediumint -&gt; plain int<br>(5) investigate which indices are=
 used in the build process.&nbsp; drop all other indices,&nbsp; and rebuild=
 when the build process is done<br><br>Does this make sense?<br><br></font>
------=_20081003152704_64205--

""" ;
         ns1:returnPath "<paul@devonianfarm.com>" ;
         ns1:xOriginalTo "arc-dev@semsol.org" ;
         ns1:deliveredTo "web11p1@p15192371.pureserver.info" ;
         ns1:received """by webmail.mailtrust.com
    (Authenticated sender: paul@devonianfarm.com, from: paul@devonianfarm.com) 
    with HTTP; Fri, 3 Oct 2008 15:27:04 -0400 (EDT)""" ;
         ns1:date "Fri, 3 Oct 2008 15:27:04 -0400 (EDT)" ;
         ns1:subject "Arc vs Dbpedia...  Dbpedia wins" ;
         ns1:from "paul@devonianfarm.com" ;
         ns1:to "arc-dev@semsol.org" ;
         ns1:mIMEVersion "1.0" ;
         ns1:contentType 'multipart/alternative;boundary="----=_20081003152704_64205"' ;
         ns1:importance "Normal" ;
         ns1:xPriority "3 (Normal)" ;
         ns1:xType "html" ;
         ns1:messageID "<1223062024.763326905@192.168.1.201>" ;
         ns1:xMailer "webmail6.8" ;
         ns1:xSpamCheckerVersion """SpamAssassin 2.64 (2004-01-11) on 
	p15192371.pureserver.info""" ;
         ns1:xSpamLevel "" ;
         ns1:xSpamStatus """No, hits=0.3 required=5.0 tests=AWL,HTML_MESSAGE,NO_REAL_NAME 
	autolearn=no version=2.64