Channel #semsol: Logs

This is a public chat log generated from the #semsol IRC channel.

09:25:33 kwijibo: good easter break everyone?
09:27:55 bengee_: yeah, you, too?
09:28:54 kwijibo: yeah thanks, it was ok, got out in the hills a bit
12:15:18 kwijibo: bengee: do you include related bnodes in a the results of a DESCRIBE on a resource ?
12:16:21 bengee: descriptions of bnode objects should be included
12:16:31 kwijibo: ta
12:27:33 chameleon95: hiya bengee
12:27:45 bengee: hey
12:27:56 chameleon95: got your email, thx
12:28:30 chameleon95: the problem Peter Krantz was having (http://arc.semsol.org/community/irc/logs/2008/03/05) should be solved by the BINARY change..
12:28:40 bengee: thx 4 the patch
12:28:43 bengee: yeah
12:28:58 chameleon95: Before every time you issued the UNION SELECT it had to do a full table scan..
12:29:00 bengee: but it's prolly not enough for your use case
12:29:08 chameleon95: I have another issue now..
12:29:45 chameleon95: dont mess around with a hash just yet, when I have all my data loaded I will do extensive testing..
12:30:34 chameleon95: I don't want any unnecessary data, with 59M triples I am trying to keep the data as small as possible
12:30:53 bengee: heh, right
12:31:11 chameleon95: I have an issue with the ID UNION lookup..
12:31:36 bengee: I'l get rid of the unions in the next rev
12:32:13 chameleon95: the lookup is in the order id2val, s2val, o2val
12:32:29 bengee: I assumed mysql would optimize them away once a select matches, looking at the LIMIT 1, but it aparently doesn't
12:33:52 chameleon95: For example if I have 30K Subjects and 50K Objects already in store
12:34:30 chameleon95: and I want to load another object "http://www..." which it finds in Subject Id 36
12:34:57 chameleon95: it will then write to table o2val with Id 36 overwriting previous object 36
12:36:57 bengee: that shouldnt happen, IDs are globally unique
12:38:27 chameleon95: ok.. checking..
12:40:39 chameleon95: ok.. are you having any problems with character encoding..
12:41:20 bengee: personally no, but that's a bit mysql-dependent
12:42:14 chameleon95: I am loading data which is ISO-8859-1 and I am getting corrupted data into mysql
12:42:30 chameleon95: i am trying to track the code as it passes from file to store
12:42:36 chameleon95: are you doing any encoding
12:43:51 bengee: the earlier versions of arc did, but now I'm relying on php's xml parser to get things right. everything is parsed/converted to utf-8
12:44:08 chameleon95: ok
12:45:03 chameleon95: using the query("LOAD... how do you force a data type..
12:45:07 chameleon95: eg. ntriples
12:45:51 bengee: the format is auto-detected
12:46:20 chameleon95: i understand this but is there any way I can override the auto-detection
12:46:44 bengee: hmm, don't think so
12:46:54 chameleon95: sometimes it is getting it wrong.. still trying to find answers
12:47:33 chameleon95: when you have already loaded a few million triples it can become frustrating to go back and remove data
12:49:30 bengee: ah, wait
12:49:41 bengee: you can set a format via the configuration
12:50:11 bengee: e.g. when you instantiate the store
12:50:44 bengee: 'format' => 'ntriples'
12:51:11 bengee: but this is then set for all LOADs
12:51:22 chameleon95: np
12:52:21 chameleon95: do not worry about the UNION query now.. with LIMIT 1 and now using index it will only return 1 row from each table and then finally take the first result and return this...
12:53:07 chameleon95: the BINARY change has improved my speed by approx 55 times at 11M triples
12:53:14 bengee: yeah, still inefficient, though, esp. as the o2val index will not always fit into mem
12:53:15 chameleon95: for "LOAD...
12:53:54 chameleon95: have you looked at splitting the tables like Redland Bnodes, Literals, Resources
12:53:55 bengee: 11M triples is a lot already, wow
12:54:58 bengee: yeah, had a look at all the other table layouts out there
12:55:15 bengee: more than just a look, actually ;)
12:55:56 chameleon95: I have been using Redland for a few years.. but its only compatible with PHP3/4..
12:56:15 bengee: oh, really?
12:56:27 chameleon95: as my C skills are not sharp I have no hope to improve performance by using PHP5..
12:56:39 chameleon95: but my PHP skills are more natural
12:58:05 chameleon95: anyway off to play.. report more later..
12:58:15 chameleon95: do not worry about messing with MERGE tables just yet..
12:58:19 bengee: great, thanks
12:58:39 chameleon95: I will do plenty of testing with table sharding and this is what I will need..
12:58:56 bengee: did you try querying your 11m triples already?
12:59:10 chameleon95: yes, i am trying to work out your temp tables..
12:59:36 chameleon95: i only picked up ARC on Saturday so still learning your code..
12:59:49 bengee: you seem to be quick at it ;)
13:00:03 chameleon95: I must say your work is brilliant and I will try to help where I can..
13:00:25 chameleon95: anywork on a GRDDL parser in the future..
13:00:27 bengee: oh, thanks. Don't look at the locking code, it's an ugly mess
13:01:08 bengee: someone had a grddl extension
13:01:16 bengee: bengee checks ml
13:01:36 bengee: he started with an arc plugin, but then made it fully stand-alone, IIRC
13:01:44 chameleon95: ok
13:03:31 bengee: Daniel O'Connor
13:04:03 chameleon95: i have been using Raptor parser..
13:04:20 chameleon95: ok, thx..
13:04:31 bengee: http://code.google.com/p/xmlgrddl/
13:04:37 chameleon95: thx..
13:05:02 chameleon95: Whats the status with Trice..??
13:05:46 bengee: simplified the core during the Easter holidays, getting closer to a 1st release
13:07:19 bengee: request dispatching to handlers works, templating, too, so the core stuff is more or less done
13:07:45 bengee: access control needs a bit more work
13:07:48 chameleon95: excellent
13:07:56 chameleon95: would be interested to see the results..
13:08:04 bengee: bengee too ;)
13:08:10 chameleon95: i will hang around here more often now..
13:08:16 bengee: great
13:09:09 chameleon95: I am in China so GMT-8
13:10:02 bengee: ah, late evening then?
13:10:25 kwijibo: hi folks, any ideas on using sparql to export rdf (with bnodes in it) in multiple stages, preserving resource identity, and avoiding duplication of bnodes?
13:11:31 bengee: hmm, tricky
13:13:08 bengee: use Talis or ARC, they either don't have bnodes, or use addressable ones ;)
13:13:51 kwijibo: bengee: well, the current use case is exporting from arc to Talis (and vice versa) :p
13:15:11 kwijibo: I don't see a way to do it without keeping a record of all the bnode ids so far, and generating horribly long filters :p
13:16:24 bengee: ah, right, the importing store will generate different internal IDs from different chunks
13:16:40 kwijibo: yeah
13:16:59 kwijibo: I think ARC-> ARC would be ok because of the preserve bnode option?
13:17:07 bengee: possibly
13:17:24 kwijibo: and Talis -> Arc would be ok because there ain't any bnodes
13:17:51 kwijibo: though Arc -> Talis -> Arc would always have the problem of bnodes becoming uris
13:18:08 bengee: I guess you have to do a DESCRIBE and then insert the docs
13:18:26 bengee: and hope for CBDs from the queried store
13:19:20 kwijibo: yes - i would still get duplication though, right?
13:19:45 bengee: hmm, right
13:19:49 kwijibo: oh, maybe not if I were to do a 2nd order describe?
13:20:30 kwijibo: DESCRIBE ?x ?y WHERE { ?x ?p ?o OPTIONAL {?x ?p ?y} } LIMIT 10 OFFSET {$offset}
13:21:08 kwijibo: would that reduce the risk of duplication?
13:21:44 bengee: a bnode object may well be shared by two different resources, so you'd get the bnode twice
13:22:07 kwijibo: then, if ?y is a bnode, I would only get 10 resources in the set, instead 10 + n
13:23:02 kwijibo: at the sql level I'd get it twice maybe, but if you're returning the simpleindex, i can only get each resource once
13:24:00 kwijibo: right? or not?
13:24:07 bengee: I was thinking <res1> ex:prop _:bn1 . <res123456> ex:otherprop _:bn1 .
13:25:00 bengee: I guess wahtever tactic you have, the descriptions of bn1 will be part of two chunks
13:25:20 kwijibo: yeah, you're right
13:26:35 bengee: you probably need serious app logic
13:26:37 kwijibo: also, even if you manage to group the linked graph together, if the number of linked bnodes is greater than the LIMIT, it doesn't work :(
13:26:44 bengee: right
13:27:23 kwijibo: maybe it's simpler to try to think of a way to delete the duplicates afterwards
13:27:27 kwijibo: hey cerealtom
13:27:40 cerealtom: hey kwijibo
13:28:03 bengee: e.g. replicate all triples where s and o are not bnodes, then get the triples where s is non-blank and o is bnode
13:28:38 bengee: loop thru those, build CBDs for the bnodes, find all incoming links, ...
13:29:06 bengee: not trivial
13:29:17 kwijibo: yeah, totally not
13:29:37 kwijibo: hard to scale reliably
13:30:12 kwijibo: same kinds of issues with splitting any large dataset with bnodes
13:30:35 kwijibo: any ideas cerealtom ? you wrote a bulk loader ! :)
13:31:04 bengee: you could write a talis service that does the bnode to uri conversion somehow, i.e. if the source store uses its internal bnodes for results (which most do, I guess), you could stream the chuks through a service that rewrites the bnodes to URIs
13:31:23 bengee: s/chuks/chunks/
13:37:09 cerealtom: kwijibo: fraid my bulk loader is for loading to the contentbox, not the metabox; ie it just posts files from the filesystem into the contentbox of a platform store
16:29:55 kwijibo: bengee?
16:31:58 kwijibo: ah sweet, nevermind