Re: [arc-dev] erdf parser
From: =?ISO-8859-1?Q?Robert_Goen=E9?=
Subject: Re: [arc-dev] erdf parser
Date: Sat, 17 May 2008 18:48:59 +0200
Hi!
I can't seem to get the code right. COuld you resend the php file? It
seems to be some non-text format..
Thanks!
On 16-mei-2008, at 16:13, Benjamin Nowack wrote:
> Content-Transfer-Encoding: quoted-printable
>
>
>
>
> On 16.05.2008 15:17:29, Robert Goen=E9 wrote:
>
>> Hi Benjamin,
>
>>
>
>> Thanks! I think the 'or' rule should be implemented, as the spec
>
>> suggests: if there is a label, use it as an object, else use the
>
>> content value.
>
> yes, the extractor does that already (in the "getCurrentObjectLiteral"
>
> method), but it skipped <a> tags if they didn't have an explicit
>
> predicate.
>
>
>
>> Could you tell me how you should implement it? I would like to
>
>> understand the parser a bit better and use it right away. I was
>
>> thinking of adjusting the following function:
>
> The added snippet is indeed just like (and appended to) the
>
> "img" section. (The tweaked class file is attached.):
>
>
>
> [[
>
> /* anchors */
>
> if ($n['tag'] =3D=3D 'a') {
>
> if (($s =3D $this->v('href iri', '', $n['a'])) &&
>
> $ct['cur_obj_literal']['value']) {
>
> $t =3D array(
>
> 's' =3D> $s,
>
> 's_type' =3D> 'iri',
>
> 'p' =3D> $ct['ns']['rdfs'] . 'label',
>
> 'o' =3D> $ct['cur_obj_literal']['value'],
>
> 'o_type' =3D> 'literal',
>
> 'o_lang' =3D> $ct['cur_obj_literal']['datatype'] ? '' :
>
> $ct['cur_obj_literal']['lang'],
>
> 'o_datatype' =3D> $ct['cur_obj_literal']['datatype'],
>
> );
>
> $this->addT($t);
>
> }
>
> }
>
>
>
> ]]
>
>
>
>
>
> Best,
>
> Benji
>
>
>
>
>
>
>
>>
>
>> /* imgs */
>
>> if ($n['tag'] =3D=3D 'img') {
>
>> if (($s =3D $this->v('src iri', '', $n['a'])) && $ct
>
>> ['cur_obj_literal']['val']) {
>
>> $t =3D array(
>
>> 's' =3D> $s,
>
>> 's_type' =3D> 'iri',
>
>> 'p' =3D> $ct['ns']['rdfs'] . 'label',
>
>> 'o' =3D> $ct['cur_obj_literal']['val'],
>
>> 'o_type' =3D> 'literal',
>
>> 'o_lang' =3D> $ct['cur_obj_literal']['dt'] ? '' : $ct
>
>> ['cur_obj_literal']['lang'],
>
>> 'o_dt' =3D> $ct['cur_obj_literal']['dt'],
>
>> );
>
>> $this->addT($t);
>
>>
>
>> Thanks in advance!
>
>>
>
>>
>
>> On 16-mei-2008, at 11:10, Benjamin Nowack wrote:
>
>>
>
>>>
>
>>>
>
>>>
>
>>> Hi Robert,
>
>>>
>
>>>
>
>>>
>
>>> I *think* I had support for anchors in an earlier stand-alone
>
>>>
>
>>> eRDF parser, but forgot to implement them when I switched to the
>
>>>
>
>>> extractor approach. Anyway, I'll add the label generation in the
>
>>>
>
>>> next rev. Generating two triples per anchor would need more work
>
>>>
>
>>> as the "current literal value" is generated in a separate method
>
>>>
>
>>> that prioritizes @title over plain node content. (And some people
>
>>>
>
>>> would possibly complain about triple bloat.) You'd probably have
>
>>>
>
>>> write a dedicated (@title-ignoring) extractor for anchors.
>
>>>
>
>>>
>
>>>
>
>>> Thanks for spotting this!
>
>>>
>
>>>
>
>>>
>
>>> Cheers,
>
>>>
>
>>> Benji
>
>>>
>
>>>
>
>>>
>
>>> --
>
>>>
>
>>> Benjamin Nowack
>
>>>
>
>>> http://bnode.org/
>
>>>
>
>>>
>
>>>
>
>>> On 15.05.2008 23:37:13, Robert Goen=E9 wrote:
>
>>>
>
>>>>
>
>>>
>
>>>>
>
>>>
>
>>>> Hi!
>
>>>
>
>>>>
>
>>>
>
>>>> I am using ARC2's eRDF parser extensively and keep on
>>>> discovering new
>
>>>
>
>>>> and useful ways of using embedded rdf in plain html all the time.
>
>>>
>
>>>>
>
>>>
>
>>>> I keep on running in the following issue: the parsing of the anchor
>
>>>
>
>>>> elements is not in conformance with the specification. The title
>
>>>
>
>>>> attribute or the element's content should be added as rdfs labels.
>
>>>
>
>>>> Without this feature, we cannot represent our everyday use of links
>
>>>
>
>>>> with rdf triples.
>
>>>
>
>>>>
>
>>>
>
>>>> The erdf summary states the following:
>
>>>
>
>>>>
>
>>>
>
>>>> "In addition, anchors generate triples with:
>
>>>
>
>>>>
>
>>>
>
>>>> * a subject URI derived from the href attribute
>
>>>
>
>>>> * a predicate of rdfs:label
>
>>>
>
>>>> * a literal value equal to the value of the ''title' attribute
>
>>>
>
>>>> if present, or the string-value of the anchor element's content if
>
>>>
>
>>>> not." http://research.talis.com/2005/erdf/wiki/Main/
>
>>>
>
>>>> SummaryOfTripleProductionRules
>
>>>
>
>>>>
>
>>>
>
>>>> I would even say that the title and the element's content should
>>>> BOTH
>
>>>
>
>>>> produce rdfs labels, as they both are ways of describing the
>
>>>> hyperlink.
>
>>>
>
>>>>
>
>>>
>
>>>> WYT?
>
>>>
>
>>>>
>
>>>
>
>>>> Regards, Robert Goen=E9
>
>>>
>
>>>>
>
>>>
>
>>>>
>
>>>
>
>>>
>
>>>
>
>>
>
>>
>
>
> <ARC2_ErdfExtractor.php>
""" ;
ns1:returnPath "<robert@goene.nl>" ;
ns1:xOriginalTo "arc-dev@semsol.org" ;
ns1:deliveredTo "web11p1@p15192371.pureserver.info" ;
ns1:received """from ?10.0.0.54? ( [84.87.3.38])
by mx.google.com with ESMTPS id e20sm6388279fga.7.2008.05.17.09.49.02
(version=TLSv1/SSLv3 cipher=RC4-MD5);
Sat, 17 May 2008 09:49:03 -0700 (PDT)""" ;
ns1:mimeVersion "1.0 (Apple Message framework v753)" ;
ns1:inReplyTo "<PM-GA.20080516161344.C9696.2.1D@semsol.com>" ;
ns1:references "<6C8D2235-D8FB-4A62-9C9F-0A2D259B6D46@goene.nl> <PM-GA.20080516111053.D178C.1.1D@semsol.com> <02A0AC59-A439-48EC-8442-EF4365736EF1@goene.nl> <PM-GA.20080516161344.C9696.2.1D@semsol.com>" ;
ns1:contentType "text/plain; charset=US-ASCII; delsp=yes; format=flowed" ;
ns1:messageId "<7C640BA7-F96A-4284-80F2-0D39D631B0F6@goene.nl>" ;
ns1:contentTransferEncoding "7bit" ;
ns1:from "=?ISO-8859-1?Q?Robert_Goen=E9?= <robert@goene.nl>" ;
ns1:subject "Re: [arc-dev] erdf parser" ;
ns1:date "Sat, 17 May 2008 18:48:59 +0200" ;
ns1:to '"arc-dev" <arc-dev@semsol.org>' ;
ns1:xMailer "Apple Mail (2.753)" ;
ns1:xSpamCheckerVersion """SpamAssassin 2.64 (2004-01-11) on
p15192371.pureserver.info""" ;
ns1:xSpamLevel "" ;
ns1:xSpamStatus """No, hits=-4.8 required=5.0 tests=BAYES_00,HTML_MESSAGE
autolearn=ham version=2.64""" ;
ns1:xUIDL """_'k"!$VI"!4:0!!\nN!!