This is a public chat log generated from the #semsol IRC channel.
09:25:33
good easter break everyone?
09:27:55
yeah, you, too?
09:28:54
yeah thanks, it was ok, got out in the hills a bit
12:15:18
bengee: do you include related bnodes in a the results of a DESCRIBE on a resource ?
12:16:21
descriptions of bnode objects should be included
12:16:31
ta
12:27:33
hiya bengee
12:27:45
hey
12:27:56
got your email, thx
12:28:30
the problem Peter Krantz was having (http://arc.semsol.org/community/irc/logs/2008/03/05) should be solved by the BINARY change..
12:28:40
thx 4 the patch
12:28:43
yeah
12:28:58
Before every time you issued the UNION SELECT it had to do a full table scan..
12:29:00
but it's prolly not enough for your use case
12:29:08
I have another issue now..
12:29:45
dont mess around with a hash just yet, when I have all my data loaded I will do extensive testing..
12:30:34
I don't want any unnecessary data, with 59M triples I am trying to keep the data as small as possible
12:30:53
heh, right
12:31:11
I have an issue with the ID UNION lookup..
12:31:36
I'l get rid of the unions in the next rev
12:32:13
the lookup is in the order id2val, s2val, o2val
12:32:29
I assumed mysql would optimize them away once a select matches, looking at the LIMIT 1, but it aparently doesn't
12:33:52
For example if I have 30K Subjects and 50K Objects already in store
12:34:30
and I want to load another object "http://www..." which it finds in Subject Id 36
12:34:57
it will then write to table o2val with Id 36 overwriting previous object 36
12:36:57
that shouldnt happen, IDs are globally unique
12:38:27
ok.. checking..
12:40:39
ok.. are you having any problems with character encoding..
12:41:20
personally no, but that's a bit mysql-dependent
12:42:14
I am loading data which is ISO-8859-1 and I am getting corrupted data into mysql
12:42:30
i am trying to track the code as it passes from file to store
12:42:36
are you doing any encoding
12:43:51
the earlier versions of arc did, but now I'm relying on php's xml parser to get things right. everything is parsed/converted to utf-8
12:44:08
ok
12:45:03
using the query("LOAD... how do you force a data type..
12:45:07
eg. ntriples
12:45:51
the format is auto-detected
12:46:20
i understand this but is there any way I can override the auto-detection
12:46:44
hmm, don't think so
12:46:54
sometimes it is getting it wrong.. still trying to find answers
12:47:33
when you have already loaded a few million triples it can become frustrating to go back and remove data
12:49:30
ah, wait
12:49:41
you can set a format via the configuration
12:50:11
e.g. when you instantiate the store
12:50:44
'format' => 'ntriples'
12:51:11
but this is then set for all LOADs
12:51:22
np
12:52:21
do not worry about the UNION query now.. with LIMIT 1 and now using index it will only return 1 row from each table and then finally take the first result and return this...
12:53:07
the BINARY change has improved my speed by approx 55 times at 11M triples
12:53:14
yeah, still inefficient, though, esp. as the o2val index will not always fit into mem
12:53:15
for "LOAD...
12:53:54
have you looked at splitting the tables like Redland Bnodes, Literals, Resources
12:53:55
11M triples is a lot already, wow
12:54:58
yeah, had a look at all the other table layouts out there
12:55:15
more than just a look, actually ;)
12:55:56
I have been using Redland for a few years.. but its only compatible with PHP3/4..
12:56:15
oh, really?
12:56:27
as my C skills are not sharp I have no hope to improve performance by using PHP5..
12:56:39
but my PHP skills are more natural
12:58:05
anyway off to play.. report more later..
12:58:15
do not worry about messing with MERGE tables just yet..
12:58:19
great, thanks
12:58:39
I will do plenty of testing with table sharding and this is what I will need..
12:58:56
did you try querying your 11m triples already?
12:59:10
yes, i am trying to work out your temp tables..
12:59:36
i only picked up ARC on Saturday so still learning your code..
12:59:49
you seem to be quick at it ;)
13:00:03
I must say your work is brilliant and I will try to help where I can..
13:00:25
anywork on a GRDDL parser in the future..
13:00:27
oh, thanks. Don't look at the locking code, it's an ugly mess
13:01:08
someone had a grddl extension
13:01:16
bengee checks ml
13:01:36
he started with an arc plugin, but then made it fully stand-alone, IIRC
13:01:44
ok
13:03:31
Daniel O'Connor
13:04:03
i have been using Raptor parser..
13:04:20
ok, thx..
13:04:31
http://code.google.com/p/xmlgrddl/
13:04:37
thx..
13:05:02
Whats the status with Trice..??
13:05:46
simplified the core during the Easter holidays, getting closer to a 1st release
13:07:19
request dispatching to handlers works, templating, too, so the core stuff is more or less done
13:07:45
access control needs a bit more work
13:07:48
excellent
13:07:56
would be interested to see the results..
13:08:04
bengee too ;)
13:08:10
i will hang around here more often now..
13:08:16
great
13:09:09
I am in China so GMT-8
13:10:02
ah, late evening then?
13:10:25
hi folks, any ideas on using sparql to export rdf (with bnodes in it) in multiple stages, preserving resource identity, and avoiding duplication of bnodes?
13:11:31
hmm, tricky
13:13:08
use Talis or ARC, they either don't have bnodes, or use addressable ones ;)
13:13:51
bengee: well, the current use case is exporting from arc to Talis (and vice versa) :p
13:15:11
I don't see a way to do it without keeping a record of all the bnode ids so far, and generating horribly long filters :p
13:16:24
ah, right, the importing store will generate different internal IDs from different chunks
13:16:40
yeah
13:16:59
I think ARC-> ARC would be ok because of the preserve bnode option?
13:17:07
possibly
13:17:24
and Talis -> Arc would be ok because there ain't any bnodes
13:17:51
though Arc -> Talis -> Arc would always have the problem of bnodes becoming uris
13:18:08
I guess you have to do a DESCRIBE and then insert the docs
13:18:26
and hope for CBDs from the queried store
13:19:20
yes - i would still get duplication though, right?
13:19:45
hmm, right
13:19:49
oh, maybe not if I were to do a 2nd order describe?
13:20:30
DESCRIBE ?x ?y WHERE { ?x ?p ?o OPTIONAL {?x ?p ?y} } LIMIT 10 OFFSET {$offset}
13:21:08
would that reduce the risk of duplication?
13:21:44
a bnode object may well be shared by two different resources, so you'd get the bnode twice
13:22:07
then, if ?y is a bnode, I would only get 10 resources in the set, instead 10 + n
13:23:02
at the sql level I'd get it twice maybe, but if you're returning the simpleindex, i can only get each resource once
13:24:00
right? or not?
13:24:07
I was thinking <res1> ex:prop _:bn1 . <res123456> ex:otherprop _:bn1 .
13:25:00
I guess wahtever tactic you have, the descriptions of bn1 will be part of two chunks
13:25:20
yeah, you're right
13:26:35
you probably need serious app logic
13:26:37
also, even if you manage to group the linked graph together, if the number of linked bnodes is greater than the LIMIT, it doesn't work :(
13:26:44
right
13:27:23
maybe it's simpler to try to think of a way to delete the duplicates afterwards
13:27:27
hey cerealtom
13:27:40
hey kwijibo
13:28:03
e.g. replicate all triples where s and o are not bnodes, then get the triples where s is non-blank and o is bnode
13:28:38
loop thru those, build CBDs for the bnodes, find all incoming links, ...
13:29:06
not trivial
13:29:17
yeah, totally not
13:29:37
hard to scale reliably
13:30:12
same kinds of issues with splitting any large dataset with bnodes
13:30:35
any ideas cerealtom ? you wrote a bulk loader ! :)
13:31:04
you could write a talis service that does the bnode to uri conversion somehow, i.e. if the source store uses its internal bnodes for results (which most do, I guess), you could stream the chuks through a service that rewrites the bnodes to URIs
13:31:23
s/chuks/chunks/
13:37:09
kwijibo: fraid my bulk loader is for loading to the contentbox, not the metabox; ie it just posts files from the filesystem into the contentbox of a platform store
16:29:55
bengee?
16:31:58
ah sweet, nevermind
