This is a public chat log generated from the #semsol IRC channel.
07:54:51
mmmmmrob waves to bengee
07:55:02
heya
07:55:29
just been profiling memory usage on our app
07:55:45
as we're hitting 16Mb with fairly small amounts of rdf/xml
07:56:46
I managed to shave a couple of seconds off getSimpleIndex, but not any memory (well, 100k maybe)
07:56:56
s/100k/100 bytes/
07:57:38
bengee: the print_r seems to be expensive
07:58:48
ah, interesting
07:58:56
could replace it
07:59:13
I shaved 2 seconds off a page generation of 11 seconds
07:59:20
--- ARC2.php (revision 30236)
07:59:21
+++ ARC2.php (working copy)
07:59:22
@@ -124,7 +124,6 @@
07:59:25
function getSimpleIndex($triples, $flatten_objects = 1, $vals = '') {
07:59:26
$r = array();
07:59:27
- $added = array();
07:59:29
foreach ($triples as $t) {
07:59:30
$skip_t = 0;
07:59:31
foreach (array('s', 'p', 'o') as $term) {
07:59:32
@@ -163,10 +162,19 @@
07:59:34
$o[$suffix] = $t['o ' . $suffix];
07:59:35
}
07:59:37
}
07:59:39
- $id = $s . ' ' . $p . ' ' . print_r($o, 1);
07:59:41
- if (!isset($added[$id])) {
07:59:43
+ if (!isset($r[$s][$p])) {
07:59:45
$r[$s][$p][] = $o;
07:59:47
- $added[$id] = 1;
07:59:49
+ } else {
07:59:51
+ $in_already = false;
07:59:53
+ foreach($r[$s][$p] as $o_in_already) {
07:59:55
+ if ($o == $o_in_already) {
07:59:57
+ $in_already = true;
07:59:59
+ break;
08:00:01
+ }
08:00:03
+ }
08:00:05
+ if (!$in_already) {
08:00:07
+ $r[$s][$p][] = $o;
08:00:09
+ }
08:00:11
}
08:00:13
}
08:00:17
}
08:00:19
though my changes may break things :-(
08:00:30
but the memory delta is what's confusing me
08:01:42
the added[id] is clearly getting quite large for large object literals
08:01:44
getSimpleIndex on less than 1Mb of rdf/xml jumps the memory up by 4Mb
08:02:15
I should hash that, or simply skip dupe checks
08:02:18
that's just the getSimpleIndex, not the parse
08:02:53
bengee: can you not use in_array or something like above to do the dupe checking on $r?
08:03:26
just checking
08:03:55
didn't know that needle can be an array, tbh
08:04:15
bengee: it may not be able to be
08:04:21
mmmmmrob is not sure
08:04:33
php 4.2.0
08:04:48
that's deployed enough, I guess
08:05:00
we're on 5
08:05:38
arc has 4.3 as minimum in the requirements
08:05:47
so, I can change the code
08:05:51
cheers :)
08:06:59
bengee: no probs
08:07:16
bengee: what can I do to work out more about the memory usage?
08:08:42
I used to use xdebug, not sure if it's helpful wrt to memory profiling, though
08:09:01
I have xdebug doing traces and reporting memory deltas
08:09:37
8.8782 9202508 +0 -> ARC2_RDFParser->getSimpleIndex() /projects/zephyr/src/views/list.php:55
08:09:39
8.8783 9202508 +0 -> ARC2_RDFXMLParser->getTriples() /projects/zephyr/lib/arc/parsers/ARC2_RDFParser.php:88
08:09:40
8.8783 9202508 +0 -> ARC2_Class->v() /projects/zephyr/lib/arc/parsers/ARC2_RDFXMLParser.php:113
08:09:41
8.8783 9202508 +0 -> is_array() /projects/zephyr/lib/arc/ARC2_Class.php:35
08:09:43
8.8837 9593200 +390692 -> ARC2->getSimpleIndex() /projects/zephyr/lib/arc/parsers/ARC2_RDFParser.php:88
08:09:44
9.0715 13259808 +3666608 -> memory_get_usage() /projects/zephyr/src/views/list.php:56
08:09:45
9.0719 13259848 +40 -> number_format() /projects/zephyr/src/views/list.php:56
08:09:50
oooh, sorry, that wasn't attractive
08:10:31
the increase on getSimpleIndex appears to 3.6Mb + 400Kb
08:10:50
but having read the code I don't see where that's going.
08:11:56
do you know any way of dumping some kind of memory map? so I could see what varaibale are using the memory?
08:15:39
bengee tries getrusage(), can't make much sense of it
08:16:54
oooh
08:16:59
mmmmmrob goes to look
08:18:51
hello everyone
08:20:03
hi mustang
08:21:22
mmmmrob, I guess ARC is not passing a triples array reference to getSimpleIndex, i.e. at least the triples array is duplicated
08:23:38
bengee: yeah, I wondered about that too
08:24:35
although the triples copy shouldn't stay in mem once the index is built and returned
08:24:55
bengee: changing that to a reference drops about 200k off the total, so nothing massive just as you'd expect
08:37:00
hey bengee - thanks a million for the parcel! Much appreciated.
08:37:51
ah, np :)
08:39:31
almost kept it for myself, I liked the reflecting cube ;)
08:56:42
me too - definitely going in a video that one ;-)
09:35:51
anyone know a Triple - Object mapper in PHP? (to work with ARC)
09:36:27
heh, mustang just asked for that, too
09:46:16
asked me too, offchannel ... i suggested better to ask in public
10:15:04
mmmmmrob did you write one?
11:23:09
kwijibo: what? a triple to object mapper?
11:23:45
mmmmmrob yeah
11:24:36
kwijibo: someone else here did, but we're not using it anymore
11:24:50
ok, ta
11:25:12
kwijibo: I'm not sure I like the approach much, the reasons I originally wanted one were misguided
11:25:50
mmmmmrob: maybe easy to write one, hard to write one you actually like much
11:26:34
kwijibo: yeah, exactly - or one that makes sense in an open-world assumption
11:42:04
there's no way to LOAD a local file, rather than remote URI is there?
11:42:51
no, you can just use a relative path
11:43:07
you may have to specify a base
11:44:13
(where the base is </some/local/path/>)
11:47:10
I think the endpoint works with http uris/paths only, but the query() method can handle rel/local paths
11:49:31
edsu tries
11:53:48
so something like: $store->query('LOAD </home/ed/bzr/lcco/lcco.rdf>'); ?
11:54:05
yes, that should work
11:54:56
:-( it doesn't seem to work for me, just downloaded arc2 preview as well
11:55:06
not a big deal though
11:57:44
I think you need 'BASE </home/ed/bzr/lcco> LOAD <lcco.rdf>'
11:58:56
needs improvement, esp. security-wise
11:59:43
although I don't know if the BASE trick works through the endpoint
12:05:21
yup that works great, it's just a command line script to set up a store based on some existing rdf data
12:07:37
cool 'BASE <.> LOAD <lcco.rdf>' works too :-)
12:09:50
funny, when your code does stuff you didn't plan it to do ;)
13:32:19
anyone know if I can bind php variables to variables in the SPARQL query when doing a query?
13:33:33
don't think so
13:34:33
sparqlscript has access to GET and POST and can set variables that can be reused in php, but that's all still very experimental
13:34:44
bengee: hey
13:35:00
have profiled memory usage parsing rdf/xml on different sizes of docs
13:35:19
good news is the graph of memory usage is linear
13:35:21
:-)
13:35:26
oh, cool.
13:35:47
bengee: bad news is that is uses, on average, 14 times the size of the data
13:36:01
then it's some php ting?
13:36:04
thing
13:36:20
:-(
13:36:41
I don't see anything in what ARC2 is doing that would explain such a factor
13:36:45
could be interesting to compare php4 to php5
13:37:12
bengee: would you like the simple test I just wrote ?
13:37:16
i.e. if it's the references that need memory for internal pointers
13:37:28
yeah, could be handy
13:37:40
where would you like me to send the tarball?
13:38:00
bnowack at semsol dot com would be great
13:38:07
cool
13:38:30
thanks for taking it so far
13:43:31
bengee: little command line app on its way to you
13:43:43
col, thx
13:43:58
bengee: I can spend a day or two looking at this this week if you can give me some pointers as to where to dig next.
13:44:33
not sure if I'll find the time, but the script will be a great starting point
13:44:51
you are sure it's happening during getSimpleIndex?
13:45:28
bengee: there are two big jumps
13:45:32
the parse
13:45:36
is the first
13:45:50
getSimpleIndex also uses a lot of memory
13:46:03
I'll write a test for that as well
16:08:53
mmmmmrob waves to bengee
16:09:06
heh, heya
16:09:09
mmmmmrob hopes bengee isn't around to hear bad news
16:09:15
damn, you;re still here
16:09:28
np
16:09:38
bengee: I think I have pinned down the memory jump in parse
16:10:16
I parsed a doc, 615,542 bytes of rdf/xml
16:10:39
then did a print_r on the parser (an ARC2_RDFXMLParser)
16:10:50
I picked out the $triples array
16:11:05
and re-formatted it as an array declaration
16:11:10
and put it in an include
16:11:39
php loading that array consumes 12Mb of memory
16:12:14
interesting, how many triples were that approx?
16:12:53
mmmmmrob goes to count them...
16:13:16
bengee: 5970
16:14:25
that's the number of entries in the $triples array
16:14:50
hmm, that's 2k per triple, right?
16:16:16
bengee: hmmm, yep, 2k per triple
16:16:38
bengee: put like that it doesn't sound so much
16:17:37
are you on php 5.2?
16:17:56
PHP 5.2.4-2ubuntu5.1 with Suhosin-Patch 0.9.6.2 (cli) (built: May 9 2008 16:34:16)
16:18:00
so, yes
16:18:13
there was a mem bug, says google
16:19:31
http://bugs.php.net/bug.php?id=39438
16:20:15
but it was "Fixed in CVS HEAD and PHP_5_2" in dec 2006
16:20:19
yeah, just reading that
16:20:57
bengee: which php are you running?
16:21:22
bengee checks
16:22:08
http://bugs.php.net/bug.php?id=41053 is interesting, too
16:22:25
the array file is 2.5Mb, but loaded uses 12Mb of memory, that's some overhead
16:23:23
bengee has 5.2.5
16:23:47
on a macbook
16:25:35
bengee: can I send you the couple of files to try it?
16:26:03
do they work w/o xdebug?
16:26:16
I don' have that installed atm
16:26:42
bengee didn't know memory_get_usage()
16:27:23
bengee: they should work fine w/o xdebug
16:28:03
then I can try it
16:28:13
I can also switch between php4 and php5
16:28:35
(on the same machine, that is. might give hints)
16:28:37
I have mailed you two files, the loader and include (array.php)
16:28:45
bengee: cool
16:29:17
bengee: I expect it is the same array overhead concern affecting the getimpleIndex
16:37:55
hey bengee, I have to go now, but will be around tomorrow if you want to catch up. If there's anything I can do to help then let me know.
16:38:04
ok, cheers
