Channel #semsol: Logs

This is a public chat log generated from the #semsol IRC channel.

07:54:51 mmmmmrob: mmmmmrob waves to bengee
07:55:02 bengee: heya
07:55:29 mmmmmrob: just been profiling memory usage on our app
07:55:45 mmmmmrob: as we're hitting 16Mb with fairly small amounts of rdf/xml
07:56:46 mmmmmrob: I managed to shave a couple of seconds off getSimpleIndex, but not any memory (well, 100k maybe)
07:56:56 mmmmmrob: s/100k/100 bytes/
07:57:38 mmmmmrob: bengee: the print_r seems to be expensive
07:58:48 bengee: ah, interesting
07:58:56 bengee: could replace it
07:59:13 mmmmmrob: I shaved 2 seconds off a page generation of 11 seconds
07:59:20 mmmmmrob: --- ARC2.php (revision 30236)
07:59:21 mmmmmrob: +++ ARC2.php (working copy)
07:59:22 mmmmmrob: @@ -124,7 +124,6 @@
07:59:24 mmmmmrob:
07:59:25 mmmmmrob: function getSimpleIndex($triples, $flatten_objects = 1, $vals = '') {
07:59:26 mmmmmrob: $r = array();
07:59:27 mmmmmrob: - $added = array();
07:59:29 mmmmmrob: foreach ($triples as $t) {
07:59:30 mmmmmrob: $skip_t = 0;
07:59:31 mmmmmrob: foreach (array('s', 'p', 'o') as $term) {
07:59:32 mmmmmrob: @@ -163,10 +162,19 @@
07:59:34 mmmmmrob: $o[$suffix] = $t['o ' . $suffix];
07:59:35 mmmmmrob: }
07:59:37 mmmmmrob: }
07:59:39 mmmmmrob: - $id = $s . ' ' . $p . ' ' . print_r($o, 1);
07:59:41 mmmmmrob: - if (!isset($added[$id])) {
07:59:43 mmmmmrob: + if (!isset($r[$s][$p])) {
07:59:45 mmmmmrob: $r[$s][$p][] = $o;
07:59:47 mmmmmrob: - $added[$id] = 1;
07:59:49 mmmmmrob: + } else {
07:59:51 mmmmmrob: + $in_already = false;
07:59:53 mmmmmrob: + foreach($r[$s][$p] as $o_in_already) {
07:59:55 mmmmmrob: + if ($o == $o_in_already) {
07:59:57 mmmmmrob: + $in_already = true;
07:59:59 mmmmmrob: + break;
08:00:01 mmmmmrob: + }
08:00:03 mmmmmrob: + }
08:00:05 mmmmmrob: + if (!$in_already) {
08:00:07 mmmmmrob: + $r[$s][$p][] = $o;
08:00:09 mmmmmrob: + }
08:00:11 mmmmmrob: }
08:00:13 mmmmmrob: }
08:00:17 mmmmmrob: }
08:00:19 mmmmmrob: though my changes may break things :-(
08:00:30 mmmmmrob: but the memory delta is what's confusing me
08:01:42 bengee: the added[id] is clearly getting quite large for large object literals
08:01:44 mmmmmrob: getSimpleIndex on less than 1Mb of rdf/xml jumps the memory up by 4Mb
08:02:15 bengee: I should hash that, or simply skip dupe checks
08:02:18 mmmmmrob: that's just the getSimpleIndex, not the parse
08:02:53 mmmmmrob: bengee: can you not use in_array or something like above to do the dupe checking on $r?
08:03:26 bengee: just checking
08:03:55 bengee: didn't know that needle can be an array, tbh
08:04:15 mmmmmrob: bengee: it may not be able to be
08:04:21 mmmmmrob: mmmmmrob is not sure
08:04:33 bengee: php 4.2.0
08:04:48 bengee: that's deployed enough, I guess
08:05:00 mmmmmrob: we're on 5
08:05:38 bengee: arc has 4.3 as minimum in the requirements
08:05:47 bengee: so, I can change the code
08:05:51 bengee: cheers :)
08:06:59 mmmmmrob: bengee: no probs
08:07:16 mmmmmrob: bengee: what can I do to work out more about the memory usage?
08:08:42 bengee: I used to use xdebug, not sure if it's helpful wrt to memory profiling, though
08:09:01 mmmmmrob: I have xdebug doing traces and reporting memory deltas
08:09:37 mmmmmrob: 8.8782 9202508 +0 -> ARC2_RDFParser->getSimpleIndex() /projects/zephyr/src/views/list.php:55
08:09:39 mmmmmrob: 8.8783 9202508 +0 -> ARC2_RDFXMLParser->getTriples() /projects/zephyr/lib/arc/parsers/ARC2_RDFParser.php:88
08:09:40 mmmmmrob: 8.8783 9202508 +0 -> ARC2_Class->v() /projects/zephyr/lib/arc/parsers/ARC2_RDFXMLParser.php:113
08:09:41 mmmmmrob: 8.8783 9202508 +0 -> is_array() /projects/zephyr/lib/arc/ARC2_Class.php:35
08:09:43 mmmmmrob: 8.8837 9593200 +390692 -> ARC2->getSimpleIndex() /projects/zephyr/lib/arc/parsers/ARC2_RDFParser.php:88
08:09:44 mmmmmrob: 9.0715 13259808 +3666608 -> memory_get_usage() /projects/zephyr/src/views/list.php:56
08:09:45 mmmmmrob: 9.0719 13259848 +40 -> number_format() /projects/zephyr/src/views/list.php:56
08:09:50 mmmmmrob: oooh, sorry, that wasn't attractive
08:10:31 mmmmmrob: the increase on getSimpleIndex appears to 3.6Mb + 400Kb
08:10:50 mmmmmrob: but having read the code I don't see where that's going.
08:11:56 mmmmmrob: do you know any way of dumping some kind of memory map? so I could see what varaibale are using the memory?
08:15:39 bengee: bengee tries getrusage(), can't make much sense of it
08:16:54 mmmmmrob: oooh
08:16:59 mmmmmrob: mmmmmrob goes to look
08:18:51 mustang: hello everyone
08:20:03 bengee: hi mustang
08:21:22 bengee: mmmmrob, I guess ARC is not passing a triples array reference to getSimpleIndex, i.e. at least the triples array is duplicated
08:23:38 mmmmmrob: bengee: yeah, I wondered about that too
08:24:35 bengee: although the triples copy shouldn't stay in mem once the index is built and returned
08:24:55 mmmmmrob: bengee: changing that to a reference drops about 200k off the total, so nothing massive just as you'd expect
08:37:00 danja: hey bengee - thanks a million for the parcel! Much appreciated.
08:37:51 bengee: ah, np :)
08:39:31 bengee: almost kept it for myself, I liked the reflecting cube ;)
08:56:42 danja: me too - definitely going in a video that one ;-)
09:35:51 danbri: anyone know a Triple - Object mapper in PHP? (to work with ARC)
09:36:27 bengee: heh, mustang just asked for that, too
09:46:16 danbri: asked me too, offchannel ... i suggested better to ask in public
10:15:04 kwijibo: mmmmmrob did you write one?
11:23:09 mmmmmrob: kwijibo: what? a triple to object mapper?
11:23:45 kwijibo: mmmmmrob yeah
11:24:36 mmmmmrob: kwijibo: someone else here did, but we're not using it anymore
11:24:50 kwijibo: ok, ta
11:25:12 mmmmmrob: kwijibo: I'm not sure I like the approach much, the reasons I originally wanted one were misguided
11:25:50 kwijibo: mmmmmrob: maybe easy to write one, hard to write one you actually like much
11:26:34 mmmmmrob: kwijibo: yeah, exactly - or one that makes sense in an open-world assumption
11:42:04 edsu: there's no way to LOAD a local file, rather than remote URI is there?
11:42:51 bengee: no, you can just use a relative path
11:43:07 bengee: you may have to specify a base
11:44:13 bengee: (where the base is </some/local/path/>)
11:47:10 bengee: I think the endpoint works with http uris/paths only, but the query() method can handle rel/local paths
11:49:31 edsu: edsu tries
11:53:48 edsu: so something like: $store->query('LOAD </home/ed/bzr/lcco/lcco.rdf>'); ?
11:54:05 bengee: yes, that should work
11:54:56 edsu: :-( it doesn't seem to work for me, just downloaded arc2 preview as well
11:55:06 edsu: not a big deal though
11:57:44 bengee: I think you need 'BASE </home/ed/bzr/lcco> LOAD <lcco.rdf>'
11:58:56 bengee: needs improvement, esp. security-wise
11:59:43 bengee: although I don't know if the BASE trick works through the endpoint
12:05:21 edsu: yup that works great, it's just a command line script to set up a store based on some existing rdf data
12:07:37 edsu: cool 'BASE <.> LOAD <lcco.rdf>' works too :-)
12:09:50 bengee: funny, when your code does stuff you didn't plan it to do ;)
13:32:19 edsu: anyone know if I can bind php variables to variables in the SPARQL query when doing a query?
13:33:33 bengee: don't think so
13:34:33 bengee: sparqlscript has access to GET and POST and can set variables that can be reused in php, but that's all still very experimental
13:34:44 mmmmmrob: bengee: hey
13:35:00 mmmmmrob: have profiled memory usage parsing rdf/xml on different sizes of docs
13:35:19 mmmmmrob: good news is the graph of memory usage is linear
13:35:21 mmmmmrob: :-)
13:35:26 bengee: oh, cool.
13:35:47 mmmmmrob: bengee: bad news is that is uses, on average, 14 times the size of the data
13:36:01 bengee: then it's some php ting?
13:36:04 bengee: thing
13:36:20 mmmmmrob: :-(
13:36:41 mmmmmrob: I don't see anything in what ARC2 is doing that would explain such a factor
13:36:45 bengee: could be interesting to compare php4 to php5
13:37:12 mmmmmrob: bengee: would you like the simple test I just wrote ?
13:37:16 bengee: i.e. if it's the references that need memory for internal pointers
13:37:28 bengee: yeah, could be handy
13:37:40 mmmmmrob: where would you like me to send the tarball?
13:38:00 bengee: bnowack at semsol dot com would be great
13:38:07 mmmmmrob: cool
13:38:30 bengee: thanks for taking it so far
13:43:31 mmmmmrob: bengee: little command line app on its way to you
13:43:43 bengee: col, thx
13:43:58 mmmmmrob: bengee: I can spend a day or two looking at this this week if you can give me some pointers as to where to dig next.
13:44:33 bengee: not sure if I'll find the time, but the script will be a great starting point
13:44:51 bengee: you are sure it's happening during getSimpleIndex?
13:45:28 mmmmmrob: bengee: there are two big jumps
13:45:32 mmmmmrob: the parse
13:45:36 mmmmmrob: is the first
13:45:50 mmmmmrob: getSimpleIndex also uses a lot of memory
13:46:03 mmmmmrob: I'll write a test for that as well
16:08:53 mmmmmrob: mmmmmrob waves to bengee
16:09:06 bengee: heh, heya
16:09:09 mmmmmrob: mmmmmrob hopes bengee isn't around to hear bad news
16:09:15 mmmmmrob: damn, you;re still here
16:09:28 bengee: np
16:09:38 mmmmmrob: bengee: I think I have pinned down the memory jump in parse
16:10:16 mmmmmrob: I parsed a doc, 615,542 bytes of rdf/xml
16:10:39 mmmmmrob: then did a print_r on the parser (an ARC2_RDFXMLParser)
16:10:50 mmmmmrob: I picked out the $triples array
16:11:05 mmmmmrob: and re-formatted it as an array declaration
16:11:10 mmmmmrob: and put it in an include
16:11:39 mmmmmrob: php loading that array consumes 12Mb of memory
16:12:14 bengee: interesting, how many triples were that approx?
16:12:53 mmmmmrob: mmmmmrob goes to count them...
16:13:16 mmmmmrob: bengee: 5970
16:14:25 mmmmmrob: that's the number of entries in the $triples array
16:14:50 bengee: hmm, that's 2k per triple, right?
16:16:16 mmmmmrob: bengee: hmmm, yep, 2k per triple
16:16:38 mmmmmrob: bengee: put like that it doesn't sound so much
16:17:37 bengee: are you on php 5.2?
16:17:56 mmmmmrob: PHP 5.2.4-2ubuntu5.1 with Suhosin-Patch 0.9.6.2 (cli) (built: May 9 2008 16:34:16)
16:18:00 mmmmmrob: so, yes
16:18:13 bengee: there was a mem bug, says google
16:19:31 bengee: http://bugs.php.net/bug.php?id=39438
16:20:15 bengee: but it was "Fixed in CVS HEAD and PHP_5_2" in dec 2006
16:20:19 mmmmmrob: yeah, just reading that
16:20:57 mmmmmrob: bengee: which php are you running?
16:21:22 bengee: bengee checks
16:22:08 bengee: http://bugs.php.net/bug.php?id=41053 is interesting, too
16:22:25 mmmmmrob: the array file is 2.5Mb, but loaded uses 12Mb of memory, that's some overhead
16:23:23 bengee: bengee has 5.2.5
16:23:47 bengee: on a macbook
16:25:35 mmmmmrob: bengee: can I send you the couple of files to try it?
16:26:03 bengee: do they work w/o xdebug?
16:26:16 bengee: I don' have that installed atm
16:26:42 bengee: bengee didn't know memory_get_usage()
16:27:23 mmmmmrob: bengee: they should work fine w/o xdebug
16:28:03 bengee: then I can try it
16:28:13 bengee: I can also switch between php4 and php5
16:28:35 bengee: (on the same machine, that is. might give hints)
16:28:37 mmmmmrob: I have mailed you two files, the loader and include (array.php)
16:28:45 mmmmmrob: bengee: cool
16:29:17 mmmmmrob: bengee: I expect it is the same array overhead concern affecting the getimpleIndex
16:37:55 mmmmmrob: hey bengee, I have to go now, but will be around tomorrow if you want to catch up. If there's anything I can do to help then let me know.
16:38:04 bengee: ok, cheers