Channel #semsol: Logs

This is a public chat log generated from the #semsol IRC channel.

22:39:49 kwijibo: kwijibo wondering if maybe INSERT INTO <http://example.org> { ?s ?p ?o } shouldn't parse
07:08:35 kwijibo: bengee: you about?
07:09:25 bengee: hey, resurfacing from rick watching?
07:09:34 kwijibo: ooh
07:09:40 kwijibo: thanks for reminding me
07:09:53 kwijibo: kwijibo starts watching rick astley again
07:10:47 kwijibo: http://n2.talis.com/svn/playground/kwijibo/PHP/arc/plugins/trunk/talis/Talis_StorePlugin.php my Talis_Store::import() method eventually runs out of memory depending on the memory and the size of the data - I was wondering if you could see a way to make it not run out of memory :p
07:16:35 bengee: hmm
07:18:44 bengee: do you thik it's the arc store not garbage-collecting properly?
07:19:03 kwijibo: I don't really understand how the memory usage works - I'd have thought that at the end of the loop it would all be cleared by the garbage collector and start over
07:19:11 kwijibo: but apparently not
07:20:07 bengee: is it php that uses the memory, or mysql perhaps?
07:20:19 bengee: (just to be sure)
07:20:23 kwijibo: php is the one that says it runs out of memory
07:20:27 bengee: ok
07:21:05 bengee: did you try to comment-out the $this->insert call
07:21:20 bengee: i.e. just loop through the record sets first
07:21:23 kwijibo: no - I'll try that
07:24:15 bengee: and doesn't (empty($data)) evaluate to true even if the construct is empty?
07:24:29 bengee: ah, no
07:24:33 bengee: "raw"
07:27:35 kwijibo: hmm, good call, that doesn't seem to be stopping
07:27:48 kwijibo: although I wish I'd lowered the memory before starting
07:30:09 kwijibo: hmm, seems to have slowed down considerably
07:30:29 kwijibo: I wonder if I'm going to get a row for D.O.Sing the platform again :p
07:31:58 kwijibo: ok, gotta go for a train now, cheers bengee cu l8r
07:32:05 bengee: ok
07:32:32 kwijibo: i wonder what's wrong with the insert call
12:03:11 kwijibo: hey bengee
12:03:15 kwijibo: back online now
12:03:35 kwijibo: it was because I was creating a new parser with each ->insert()
12:04:04 kwijibo: I'd have thought php could've coped with clearing up that, but apparently not
12:05:40 bengee: ah
12:09:26 kwijibo: I also thought iand told me that you needed a new parser for each document with arc2, but that seems not to be the case?
12:09:48 kwijibo: it seems to work just reusing the parser instance anyway
12:10:23 bengee: it depends, I think
12:10:43 bengee: some local variables may no be reset
12:11:10 bengee: unless you manually call __init()
12:11:13 kwijibo: what about if __init() was called at the start of parse() ?
12:11:18 bengee: ;)
12:12:55 bengee: the reader stuff may be problematic
12:13:32 bengee: you may have to unset $this->reader
12:13:40 bengee: so that a new socket can be opened etc
12:14:12 bengee: it's probably less dangerous to create new parser objects
12:14:43 bengee: there might be other dependencies
12:15:28 bengee: or conflicts
12:19:05 kwijibo: in this case, the parser is being passed the document, so shouldn't need to open a new socket
12:19:22 bengee: ok, then it may work
12:19:46 kwijibo: seems less harmful than the script dying from memory overflow anyway
12:21:00 kwijibo: can I request a reusable parser for a future revision? :D
12:21:21 kwijibo: or is it more problematic than it sounds?
12:21:45 bengee: it feels problematic
12:21:50 bengee: not sure if it is
12:22:15 bengee: the parser->sub_parser chain makes things a bit complicated
12:22:35 kwijibo: i wonder why the parser object is hanging around in memory
12:23:03 bengee: don't I unset() it after parsing?
12:23:10 kwijibo: oh well, as I said, I don't understand how the garbage collector works
12:23:27 kwijibo: even if I unset it manually, I still get the memory overflow
12:23:55 kwijibo: I mean: unset($parser); return $foo;
12:25:36 bengee: maybe the xml parser has some weird global scope and isn't freed
12:25:59 kwijibo: that sounds plausible
12:28:36 bengee: hmm, which parser are we talking about, btw? the rdfxml one?
12:28:52 bengee: and you pass in a string?
12:29:16 bengee: maybe it's the reader that consumes all the memory
12:30:14 bengee: a data reader keeps the string in memory and uses a pointer to walk through it.
12:30:49 bengee: so, if the reader isn't properly killed, you'll keep all the data strings in mem
12:32:05 bengee: arc calls closeStream after parsing, though, so this shouldn't happen
12:34:55 bengee: maybe there should be a streaming dumpTrix in arc which could then be accessed from a streaming trix loader/parser
12:35:36 bengee: i.e. $new_store->query("LOAD <oldendpoint/dumpTrix>")
12:36:50 kwijibo: bengee: emm, I think the turtle one actually - though I was using getRDFParser()
12:37:37 kwijibo: and I pass in a string
12:42:20 bengee: s/trix/sparql xml result/ would probably work, too, if ARC could stream those somehow
12:43:24 kwijibo: not sure I get the point of the streaming trix ?
12:43:53 bengee: you wouldn't have the offset/multi-parsing problem
12:44:16 bengee: you stream-insert in one store what's streaming out of another store
12:47:46 kwijibo: how would you stream the output?
12:48:11 bengee: yeah, that's the big questions ;)
12:48:57 bengee: I'd need a single query that can generate g, s, p, o, s_type, o_type, o_dt, o_lang from ARC's nomalized tables
12:49:49 bengee: + mysql_unbuffered_query()
12:50:00 bengee: + echo + flush()
12:51:52 kwijibo: the OFFSET thing isn't a hardship for me, because I have to lump the triples into documents to send over http anyway
12:53:04 kwijibo: I'm just looking at how to make LOAD scalable with the streaming parsing, like it is in the arc store
12:53:20 kwijibo: that will be nice
12:54:10 kwijibo: one of the pains of using a talis store is getting medium-large amounts of data in there - it has to be chunked
12:54:30 bengee: ah, ok
12:54:31 kwijibo: and you've already solved that
12:54:40 kwijibo: :)
12:59:56 kwijibo: that's how I managed to get wordnet into the schema-cache store - imported it into arc, and imported into the platform from my arc store
13:00:43 bengee: heh
13:00:56 bengee: how many triples were that?
13:01:21 kwijibo: 1.3 million in the biggest file
13:01:36 kwijibo: which was 95mb
13:01:41 kwijibo: I already had the others in
13:01:59 bengee: interesting
17:25:42 Rabur: Hello