Mailing list ARC-DEV: Archives

Re: [arc-dev] SPARQL OPTIONAL Odd Behaviour?

From: Will Daniels 
Subject: Re: [arc-dev] SPARQL OPTIONAL Odd Behaviour?
Date: Tue, 31 Mar 2009 02:48:27 +0300


Hi Benji,

I just thought I would write to update you on what happened with this 
patch. Thanks for sending the test handler, it's extremely useful and I 
found the results very interesting. I was pleased to see that I had not 
broken anything plus the main DAWG test case for this (Nested Optionals 
- 2) passed. But unfortunately, as soon as I started to look at it again 
on on Saturday, it finally occurred to me that we can't simply COALESCE 
bindings from different OPTIONAL patterns into a single field in the 
intermediate results, because potentially the ids could then relate to 
completely different 2val tables! :D

So I have had to do a very ugly hack for now, whereby I just promote the 
relevant columns in the temporary table to INT and use the high-order 
bits to track which value table the ids belong to then restrict the 2val 
joins on the final query result SQL to their relevant ids by testing 
these bits in the join condition. As nasty as this is (and I'm sure it 
won't perform well) it does work and is only used when it's actually 
necessary. But no doubt you will want a cleaner solution and that would 
require your input in terms of ARC's design goals and priorities I think.

Anyway, the DAWG test cases highlight a number of other issues where 
ARC's logic in the relational mapping is flawed. Though many of the DAWG 
cases of course are (very) marginal, there is still a fair few that 
apply to perfectly valid, natural queries and I'd like to see about 
sorting them out. What I thought I would do is run with the code I have 
to fix as much as possible, then perhaps we could look at the results 
all together to see what problems exist and the degree/form of changes 
that would be needed to solve each of them, since there are a few common 
factors between the errors and their solutions, and some are 
considerably more marginal than others.

I understand that your focus is elsewhere just now, which in part is why 
I think it's best for me to just continue my investigations in the 
background and present a consolidated list of issues (hopefully also 
with solutions) in a week or two. I'm quite enjoying myself getting into 
all this so I don't mind hacking away at ARC by myself for a little 
while even if it turns out that you don't like what I come up 
with...what is important to me and that I really do appreciate is that 
you have been responsive and open to assistance, which is very different 
to my experience with RAP :P And of course, it's always useful to know 
exactly which SPARQL patterns ARC cannot handle and why...

Regards,
Will



As soon as I started


Benjamin Nowack wrote:
> Hi Will,
>
> Wow, that sounds great! There is an ugly test runner (attached) 
> which can be used to check ARC against the DAWG test suite. Add 
> the TestRunner class to your ARC directory, and put the t_* files 
> into some web-accessible directory. In that dir, you'll need
> two write-enabled sub-dirs: "tmp" and "earl". The database
> settings in the t_* files have to be filled in (You'll need 2
> rdf stores, one for the test files, and one for the test data).
> The "t_load_dawg_tests" only has to be run once, it'll import
> the test suite into the first store. the "t_run_dawg_tests" will
> then let you run individual tests or the whole beast. The current
> implementation fails on 80 (or 81) of the 400-something test 
> cases.
>
> I'm totally swamped with a soon-ish product launch, sorry for
> the poor support, but I'm totally looking forward to checking
> out your patch once I'm done with the other stuff here. Getting
> the OPTIONALs right has always been a little above my head.
> Ideas and help in this area would be awesome!
>
> TA,
> Benji
>
> --
> Benjamin Nowack
> http://bnode.org/
> http://semsol.com/
>
> On 26.03.2009 02:10:19, Will Daniels wrote:
>   
>> Well, I have a patch for this now that seems to be holding up in at 
>> least the relatively simple cases I've tested so far, and appears to 
>> produce correct SQL to coalesce sibling optionals for selection, 
>> aggregation, grouping, ordering and filtering.
>>
>> However, I've been struggling to think up good test cases and started 
>> looking at the DAWG Test Cases [1], which got me wondering about whether 
>> there is some system already in place to run these (or other tests) 
>> against ARC automatically?
>>
>> In any case, more extensive testing will probably have to wait until the 
>> weekend now, so unless anybody is particularly keen to get hold of the 
>> changes sooner (in which case let me know) I will send the patch then.
>>
>> Regards,
>> Will
>>
>> [1] http://www.w3.org/2001/sw/DataAccess/tests/r2
>>
>>
>> Will Daniels wrote:
>>     
>>> O_o patches! Now there's an invitation I can never refuse ;)
>>>
>>> It just so happens I have a day off today too...if I can just get ARC 
>>> to COALESCE bindings on the same variable from sibling OPTIONALs I 
>>> guess that would that suit? I hope it doesn't turn out to be more 
>>> complicated than that because my head is quite fuzzy today :D
>>>
>>> Cheers,
>>> Will
>>>
>>>
>>> Benjamin Nowack wrote:
>>>       
>>>> Hi Will,
>>>>
>>>> Yes, you're right, sibling optionals should be fixed, although ARC's 
>>>> approach of de-normalizing graphs from the triples makes the SQL 
>>>> generation often trickier than I would have hoped it'd be. Patches are
>>>> welcome ;)
>>>>
>>>> I'm not sure if it works as expected, but for the time being, you 
>>>> *could* perhaps try something along:
>>>>
>>>> SELECT * FROM <urn:/test/optional> WHERE {
>>>>   ?id a owl:Ontology . 
>>>>   OPTIONAL { 
>>>>     ?id ?version_p ?version . 
>>>>     FILTER(?version_p = owl:versionInfo || ?version_p = dc:date)
>>>>   }
>>>> }
>>>>
>>>> This would at least decrease the LEFT JOIN dependencies that ARC often
>>>> gets wrong.
>>>>
>>>> HTH, and thx for the feedback,
>>>> Benji
>>>>
>>>> --
>>>> Benjamin Nowack
>>>> http://bnode.org/
>>>> http://semsol.com/
>>>>
>>>> On 25.03.2009 01:27:24, Will Daniels wrote:
>>>>   
>>>>         
>>>>> Hi Benji,
>>>>>
>>>>> Thanks for the prompt reply :) I think the relevant part of the spec is
>>>>>           
>> 6.1:
>>     
>>>>> "In an optional match, either the optional graph pattern matches a 
>>>>> graph, thereby defining and adding bindings to one or more solutions, or 
>>>>> it leaves a solution unchanged without adding any additional bindings."
>>>>>
>>>>> To my mind, "either" does not permit to do both here and this also seems 
>>>>> most logical to me. But I'll certainly raise it on the W3C list for 
>>>>> clarification since it does not say explicitly that the OPTIONAL pattern 
>>>>> can/should not add additional *solutions* :P
>>>>>
>>>>> Anyway, I started digging into ARC to see what it is doing, and I 
>>>>> started to see what you mean about the difficulty of implementing this 
>>>>> in a single query:
>>>>>
>>>>> SELECT ...vars... FROM jos_rdf_triple
>>>>> JOIN jos_rdf_g2t ...named graph...
>>>>> LEFT JOIN jos_rdf_triple ...optional dc:date...
>>>>> LEFT JOIN jos_rdf_g2t ...optional named graph...
>>>>> WHERE ...a owl:Ontology...
>>>>>
>>>>> UNION ALL
>>>>>
>>>>> SELECT ...vars... FROM jos_rdf_triple
>>>>> JOIN jos_rdf_g2t ...named graph...
>>>>> LEFT JOIN jos_rdf_triple ...optional owl:versionInfo...
>>>>> LEFT JOIN jos_rdf_g2t ...optional named graph...
>>>>> WHERE ...a owl:Ontology...
>>>>>
>>>>> My immediate reaction was that perhaps a better way of doing UNIONs 
>>>>> would be to map them into the join condition (as ORs) in a single [LEFT] 
>>>>> JOIN for the group graph pattern. But then I realised that it would not 
>>>>> work with all the other stuff like GraphGraphPatterns that are allowed 
>>>>> in GroupGraphPattern needing to join g2t...so without using nested 
>>>>> SELECTs I think you are right, that it would have to be a "Won't Fix" in 
>>>>> the case that this deviates from the spec :(
>>>>>
>>>>> However, I almost forgot why I had written that query in the first 
>>>>> place. I was actually going for, in the first instance, something more
>>>>>           
>> like:
>>     
>>>>> SELECT * FROM <urn:/test/optional> WHERE
>>>>> { ?id a owl:Ontology . OPTIONAL { ?id owl:versionInfo ?version } . 
>>>>> OPTIONAL { ?id dc:date ?version } }
>>>>>
>>>>> And that was the first error that I found (this one is definitely wrong) 
>>>>> in ARC's mapping of SPARQL to the regular relational algebra of SQL, in 
>>>>> that an unbound variable in the first OPTIONAL pattern here results in 
>>>>> an unbound variable in the solution, rather than the correct RDF 
>>>>> relational semantics whereby only a join *conflict* to the left prevents 
>>>>> the second (or subsequent right-side OPTIONAL patterns) from binding 
>>>>> ?version in the result. This one we should be able to fix I think!?
>>>>>
>>>>> Best regards,
>>>>> Will
>>>>>
>>>>>
>>>>>
>>>>> Benjamin Nowack wrote:
>>>>>     
>>>>>           
>>>>>> Hi Will,
>>>>>>
>>>>>> To be honest, I'm not sure if it's wrong or right. ARC tries to map 
>>>>>> SPARQL to a single SQL query based on (My)SQL's relational algebra.
>>>>>> This is not always possible, and may sometimes lead to unexpected 
>>>>>> results. Putting a UNION into an OPTIONAL sounds like a good candidate
>>>>>> for fuzzy results. Might be worth asking on public-sparql-dev what
>>>>>> the correct results should look like, I'd be interested as well. If
>>>>>> it's wrong, however, it'll most likely be a "Won't fix" in ARC's 
>>>>>> SQL-based processor where OPTIONALs are simply translated to LEFT 
>>>>>> JOINs.
>>>>>>
>>>>>> Regards,
>>>>>> Benji
>>>>>>
>>>>>> [1] http://lists.w3.org/Archives/Public/public-sparql-dev/
>>>>>>
>>>>>> --
>>>>>> Benjamin Nowack
>>>>>> http://bnode.org/
>>>>>> http://semsol.com/
>>>>>>
>>>>>>
>>>>>> On 24.03.2009 01:54:57, Will Daniels wrote:
>>>>>>   
>>>>>>       
>>>>>>             
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm finding some behaviour in ARC2's SPARQL implementation that doesn't 
>>>>>>> look quite right to me. In certain formulations, an OPTIONAL pattern 
>>>>>>> appears to cause duplication in the results such that where the optional 
>>>>>>> pattern matches, I get two solutions, one extended with the optional 
>>>>>>> variable, and one without it.
>>>>>>>
>>>>>>> Take for example:
>>>>>>>
>>>>>>>  LOAD <http://xmlns.com/foaf/0.1/> INTO <urn:/test/optional>
>>>>>>>
>>>>>>> Then run the query:
>>>>>>>
>>>>>>>  PREFIX dc: <http://purl.org/dc/elements/1.1/>
>>>>>>>  PREFIX owl: <http://www.w3.org/2002/07/owl#>
>>>>>>>
>>>>>>>  SELECT * FROM <urn:/test/optional> WHERE
>>>>>>>  { ?id a owl:Ontology . OPTIONAL { { ?id dc:date ?version } UNION { ?id 
>>>>>>> owl:versionInfo ?version } } }
>>>>>>>
>>>>>>> You get:
>>>>>>>
>>>>>>>  0 =>
>>>>>>>    array (
>>>>>>>      'id' => 'http://xmlns.com/foaf/0.1/',
>>>>>>>      'id type' => 'uri',
>>>>>>>      'version' => '$Date: 2007-06-16 23:18:26 $',
>>>>>>>      'version type' => 'literal',
>>>>>>>    ),
>>>>>>>  1 =>
>>>>>>>    array (
>>>>>>>      'id' => 'http://xmlns.com/foaf/0.1/',
>>>>>>>      'id type' => 'uri',
>>>>>>>    ),
>>>>>>>
>>>>>>> It seems that the unbound solution { ?id owl:versionInfo ?version } from 
>>>>>>> the alternative UNION pattern is still being used to extend the 
>>>>>>> solution, which to my interpretation of the spec is not right. I tried 
>>>>>>> this out in Virtuoso before raising the issue here, and Virtuoso seems 
>>>>>>> to agree with me...I only get the one solution.
>>>>>>>
>>>>>>> Or is there something I have misunderstood about it all?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Will
>>>>>>>
>>>>>>>
>>>>>>>     
>>>>>>>         
>>>>>>>               
>>>>>>   
>>>>>>       
>>>>>>             
>>>>   

""" ;
         ns1:returnPath "<mail@willdaniels.co.uk>" ;
         ns1:xOriginalTo "arc-dev@semsol.org" ;
         ns1:deliveredTo "web11p1@p15192371.pureserver.info" ;
         ns1:received """from [192.168.1.100] (adsl119-129.kln.forthnet.gr [77.49.238.129])
	by smtp1.servage.net (Postfix) with ESMTP id 89E64F98463
	for <arc-dev@semsol.org>; Mon, 30 Mar 2009 23:45:59 +0000 (GMT)""" ;
         ns1:messageID "<49D15A4B.5070606@willdaniels.co.uk>" ;
         ns1:date "Tue, 31 Mar 2009 02:48:27 +0300" ;
         ns1:from "Will Daniels <mail@willdaniels.co.uk>" ;
         ns1:userAgent "Thunderbird 2.0.0.21 (X11/20090319)" ;
         ns1:mIMEVersion "1.0" ;
         ns1:to "arc-dev <arc-dev@semsol.org>" ;
         ns1:subject "Re: [arc-dev] SPARQL OPTIONAL Odd Behaviour?" ;
         ns1:references "<49C82151.60207@willdaniels.co.uk> <PM-GA.20090324084038.D4C06.1.1D@semsol.com> <49C96C5C.9090904@willdaniels.co.uk> <PM-GA.20090325112738.C545B.1.1D@semsol.com> <49CA2B11.4000800@willdaniels.co.uk> <49CAC7EB.7060402@willdaniels.co.uk> <PM-GA.20090326090859.42F92.1.1D@semsol.com>" ;
         ns1:inReplyTo "<PM-GA.20090326090859.42F92.1.1D@semsol.com>" ;
         ns1:xEnigmailVersion "0.95.6" ;
         ns1:contentType "text/plain; charset=ISO-8859-1; format=flowed" ;
         ns1:contentTransferEncoding "7bit" ;
         ns1:xSpamCheckerVersion """SpamAssassin 2.64 (2004-01-11) on 
	p15192371.pureserver.info""" ;
         ns1:xSpamLevel "" ;
         ns1:xSpamStatus """No, hits=-3.8 required=5.0 tests=AWL,BAYES_00 autolearn=ham 
	version=2.64