<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eric.jain.name &#187; Semantic Web</title>
	<atom:link href="http://eric.jain.name/tags/semantic-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://eric.jain.name</link>
	<description>Eric Jain&#039;s Blog</description>
	<lastBuildDate>Mon, 26 Apr 2010 18:40:53 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Hiring</title>
		<link>http://eric.jain.name/2007/06/26/hiring/</link>
		<comments>http://eric.jain.name/2007/06/26/hiring/#comments</comments>
		<pubDate>Tue, 26 Jun 2007 06:44:22 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Life Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2007/06/26/hiring/</guid>
		<description><![CDATA[In the latest effort to make myself obsolete by the end of this year, we are looking for a software developer to help us better make our data available to both humans and machines. The main responsibilities of this position will be the further development of the UniProt web site and the UniProt RDF distribution.

If [...]]]></description>
			<content:encoded><![CDATA[<p>In the latest effort to make myself obsolete by the end of this year, <a href="http://expasy.org/people/swissprot.html">we</a> are looking for a software developer to help us better make our data available to both humans and machines. The main responsibilities of this position will be the further development of the <a href="http://beta.uniprot.org/">UniProt web site</a> and the <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/">UniProt RDF distribution</a>.</p>
<p><span id="more-49"></span></p>
<p>If you have some experience with coding Java, have a strong interest in science and the technologies we use, thrive in <span style="text-decoration:line-through">chaotic</span> open environments, and would like to work in one of the world&#8217;s <a href="http://www.mercerhr.com/referencecontent.jhtml?idContent=1128060">best cities to live</a>, here are the <a href="http://www.isb-sib.ch/infos/careers_070625.htm">instructions for applying</a>.</p>
<p>You can meet me at <a href="http://open-bio.org/wiki/BOSC_2007">BOSC</a> and <a href="http://www.iscb.org/ismbeccb2007/">ISMB</a> next month. If you have any questions, <a href="mailto:Eric.Jain@isb-sib.ch">contact me</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2007/06/26/hiring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Compare Ontology Versions</title>
		<link>http://eric.jain.name/2007/06/07/compare-ontology-versions/</link>
		<comments>http://eric.jain.name/2007/06/07/compare-ontology-versions/#comments</comments>
		<pubDate>Thu, 07 Jun 2007 16:09:31 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2007/06/07/compare-ontology-versions/</guid>
		<description><![CDATA[There is a rather useful (but perhaps somewhat hidden) plug-in for Protege that can be used to compare two versions of an OWL file: PROMPT.

The plug-in is installed by default in Protege, but needs to be enabled via Project > Configure > Tab Widgets > PromptTab.
Here&#8217;s a screenshot showing differences in the UniProt &#8220;core&#8221; ontology [...]]]></description>
			<content:encoded><![CDATA[<p>There is a rather useful (but perhaps somewhat hidden) plug-in for <a href="http://protege.stanford.edu/">Protege</a> that can be used to compare two versions of an OWL file: <a href="http://protege.cim3.net/cgi-bin/wiki.pl?Prompt">PROMPT</a>.</p>
<p><span id="more-47"></span></p>
<p>The plug-in is installed by default in Protege, but needs to be enabled via Project > Configure > Tab Widgets > PromptTab.</p>
<p>Here&#8217;s a screenshot showing differences in the <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/owl/">UniProt &#8220;core&#8221; ontology</a> from release <a href="http://beta.uniprot.org/news/2007/05/15/release">10.5</a> to <a href="http://beta.uniprot.org/news/2007/05/29/release">11.0</a>.</p>
<p><img src="./screenshot.png" alt=""/></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2007/06/07/compare-ontology-versions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>N-Triple Converter Comparison</title>
		<link>http://eric.jain.name/2007/03/12/n-triple-converter-comparison/</link>
		<comments>http://eric.jain.name/2007/03/12/n-triple-converter-comparison/#comments</comments>
		<pubDate>Mon, 12 Mar 2007 15:34:26 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2007/03/12/n-triple-converter-comparison/</guid>
		<description><![CDATA[In order to bulk-load RDF data into Oracle (Spatial) 11g, the data needs to be converted to N-Triples first. If the data set is large, this step can add quite a bit of overhead, which is why I decided to benchmark and compare several options.

For the comparison, the taxonomy.rdf.gz file from UniProt release 10.0 was [...]]]></description>
			<content:encoded><![CDATA[<p>In order to bulk-load RDF data into Oracle (Spatial) 11g, the data needs to be converted to <a href="http://www.w3.org/2001/sw/RDFCore/ntriples/">N-Triples</a> first. If the data set is large, this step can add quite a bit of overhead, which is why I decided to benchmark and compare several options.</p>
<p><span id="more-42"></span></p>
<p>For the comparison, the <code><a href="ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/taxonomy.rdf.gz">taxonomy.rdf.gz</a></code> file from UniProt release 10.0 was used. This file is about 127M large (uncompressed). The machine on which the comparison was run is a (slightly obsolete) Itanium machine, with plenty of RAM.</p>
<h3>Raptor</h3>
<p>The first tool I tried was <a href="http://librdf.org/raptor/">Raptor</a> (1.4.14). After the usual <code>configure</code> &#8212; <code>make</code> &#8212; <code>make install</code> the conversion can be run like so:</p>
<pre>
zcat taxonomy.rdf.gz | rapper -e -o ntriples - file://taxonomy.rdf.gz# > taxonomy.nt
</pre>
<p>This completed in 38.9, 38.8 and 38.9 seconds (in subsequent runs).</p>
<p>The <code>-e</code> flag turns off validation. This doesn&#8217;t seem to have a measurable impact on performance, but is necessary to avoid erroneous &#8220;Duplicated rdf:ID value&#8221; errors (at least in another, larger file).</p>
<p>The data is decompressed on the fly to save time (and disk space).</p>
<h3>Jena</h3>
<p>Next I tried <a href="http://jena.sourceforge.net/">Jena</a> (2.5.2). After adding all the jars to the classpath, the conversion was run like so:</p>
<pre>
zcat taxonomy.rdf.gz | java jena.rdfparse -b file://taxonomy.rdf.gz# -x - > taxonomy.nt
</pre>
<p>This completed in 2:38, 2:40 and 2:38 min.</p>
<p>This is too slow if I wanted to convert the entire <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/">UniProt RDF data set</a> within reasonable time, but at least I got a (correct) warning about a bad URI that I hadn&#8217;t been aware of&#8230;</p>
<p>The JVM is <a href="http://dev2dev.bea.com/jrockit/">JRockit</a> (5.0 R27.1) with default parameters (I tried some variations such as adding <code>-Xgcprio:throughput</code>, but didn&#8217;t see any significant change).</p>
<h3>Rio</h3>
<p>Last, I tried <a href="http://www.openrdf.org/">Rio</a> (1.0.9), another Java parser. Rio doesn&#8217;t seem to include a command line tool for conversion, but it&#8217;s not a lot of code:</p>
<pre>
import java.io.IOException;
import org.openrdf.model.Resource;
import org.openrdf.model.URI;
import org.openrdf.model.Value;
import org.openrdf.rio.Parser;
import org.openrdf.rio.StatementHandler;
import org.openrdf.rio.StatementHandlerException;
import org.openrdf.rio.ntriples.NTriplesWriter;
import org.openrdf.rio.rdfxml.RdfXmlParser;

public class Converter
{
	public static void main(String[] args)
		throws Exception
	{
		Parser parser = new RdfXmlParser();
		final NTriplesWriter writer = new NTriplesWriter(System.out);
		writer.startDocument();
		parser.setStatementHandler(new StatementHandler()
		{
			public void handleStatement(Resource s, URI p, Value o)
				throws StatementHandlerException
			{
				try
				{
					writer.writeStatement(s, p, o);
				}

				catch (IOException e)
				{
					throw new StatementHandlerException(e);
				}
			}
		});
		parser.parse(System.in, args[0]);
		writer.endDocument();
	}

}
</pre>
<pre>
zcat taxonomy.rdf.gz | java Converter file://taxonomy.rdf.gz# > taxonomy.nt
</pre>
<p>This ran in 49.8, 49.5 and 50.0 seconds.</p>
<p>Using buffered readers or writers seemed to decrease performance slightly, so I assume the streams are already being buffered.</p>
<h3>Conclusion</h3>
<p>The conversion can be done fastest with Raptor. Rio is the best choice if you need to set up a platform-independent procedure (e.g. integrated into an Ant build). Jena is best if you also need to check the data :-)</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2007/03/12/n-triple-converter-comparison/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Metalink for UniProt RDF</title>
		<link>http://eric.jain.name/2007/03/07/metalink-for-uniprot-rdf/</link>
		<comments>http://eric.jain.name/2007/03/07/metalink-for-uniprot-rdf/#comments</comments>
		<pubDate>Wed, 07 Mar 2007 01:37:01 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Life Science]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2007/03/07/metalink-for-uniprot-rdf/</guid>
		<description><![CDATA[The UniProt RDF distribution is over 5GB large. To help people retrieve the data more efficiently, we now mirror the data and provide a Metalink file that describes all the file locations.

Using aria2 &#8212; a command line download client that supports the Metalink standard &#8212; you can do:

aria2c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/RELEASE.metalink

This will retrieve the data from all [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/">UniProt RDF</a> distribution is over 5GB large. To help people retrieve the data more efficiently, we now mirror the data and provide a <a href="ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/RELEASE.metalink">Metalink</a> file that describes all the file locations.</p>
<p><span id="more-41"></span></p>
<p>Using <a href="http://aria2.sourceforge.net/">aria2</a> &#8212; a command line download client that supports the <a href="http://www.metalinker.org/">Metalink standard</a> &#8212; you can do:</p>
<pre>
aria2c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/RELEASE.metalink
</pre>
<p>This will retrieve the data from all available mirrors in parallel, piece the files together, and verify the transferred data with checksums.</p>
<p>If you prefer graphical user interfaces, there is another tool called <a href="http://dfast.sourceforge.net/">wxDownload Fast</a>. Be sure to use the latest version (0.5.5) &#8212; previous versions wouldn&#8217;t do parallel downloads from some servers due to a small <a href="http://sourceforge.net/tracker/index.php?func=detail&#038;aid=1674258&#038;group_id=106901&#038;atid=645951">bug</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2007/03/07/metalink-for-uniprot-rdf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RDFRoom</title>
		<link>http://eric.jain.name/2006/04/01/rdfroom/</link>
		<comments>http://eric.jain.name/2006/04/01/rdfroom/#comments</comments>
		<pubDate>Sat, 01 Apr 2006 13:57:27 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Humor]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2006/04/01/rdfroom/</guid>
		<description><![CDATA[We&#8217;ve been looking for a decent ontology editor for a while now. The problem is that most editors are either to technical or too cumbersome to use for entering a lot of data. But it looks like we have finally found something suitable!
 RDFRoom is a graphical RDF tool. It seems to work quite well [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been looking for a decent ontology editor for a while now. The problem is that most editors are either to technical or too cumbersome to use for entering a lot of data. But it looks like we have finally found something suitable!</p>
<p><span id="more-21"></span> <a href="http://www.dfki.uni-kl.de/~grimnes/2006/03/RDFRoom/">RDFRoom</a> is a graphical RDF tool. It seems to work quite well (it&#8217;s written in Python) and performance is acceptable, at least for smaller graphs. Here&#8217;s a screenshot:</p>
<blockquote><p><a class="imagelink" title="RDFRoom" href="screenshot.png"><img width="128" height="66" id="image20" alt="RDFRoom" src="screenshot.thumbnail.png" /></a></p></blockquote>
<p>As you can see, I&#8217;m just deleting some outdated data&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2006/04/01/rdfroom/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data and Reality</title>
		<link>http://eric.jain.name/2006/02/06/data-and-reality/</link>
		<comments>http://eric.jain.name/2006/02/06/data-and-reality/#comments</comments>
		<pubDate>Mon, 06 Feb 2006 18:08:57 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Review]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2006/02/06/data-and-reality/</guid>
		<description><![CDATA[Brief review of Data and Reality by William Kent. This book was written in 1978, but is still remarkably relevant in many ways.

After going through all the different terms that people use when talking about data (some of these terms have fallen out of fashion, others are still in use), Kent points out inconsistencies and [...]]]></description>
			<content:encoded><![CDATA[<p>Brief review of <a href="http://www.amazon.com/gp/product/1585009709">Data and Reality</a> by William Kent. This book was written in 1978, but is still remarkably relevant in many ways.</p>
<p><span id="more-15"></span></p>
<p>After going through all the different terms that people use when talking about data (some of these terms have fallen out of fashion, others are still in use), Kent points out inconsistencies and limitations of the relational and hierarchical data models:</p>
<blockquote><p>the data processing community has evolved a number of models in which to express descriptions of reality. These models are highly structured, rigid, and simplistic, being amenable to economic processing by computer. [...] Some members of that community have been so overwhelmed by the success of a certain technology for processing data that they have confused this technology with the natural semantics of information.</p></blockquote>
<p>Of course Kent is full of understanding for those poor, misguided souls:</p>
<blockquote><p>The builders and users of today&#8217;s commercial systems quite justifiably want to avoid cluttering their systems with anything that may impair efficiency and productivity. The argument that this new approach will make the overall management of data more productive in the long run has yet to be convincingly demonstrated to them.</p></blockquote>
<p>He predicts that data integration will be the killer application for a more sophisticated data model:</p>
<blockquote><p>The need for a more descriptive model will only gradually achieve general recognition. It will come from the headaches of trying to crunch together the diverse formats and data structures used by growing families of applications operating on the same integrated data base.</p></blockquote>
<p>Kent then goes on to outline an ideal data model (graph-based). Nevertheless he manages to remain realistic:</p>
<blockquote><p>Perhaps it is inevitable that tools and theories never quite match. There are some opposite qualities inherent in them. [...] Thus the truth of things may be this: useful things get done by tools which are an amalgam of fragments and theories. Those are the kinds of tools whose production and maintenance expense can be justified. Theories are helpful to gain understanding, which may lead to the better design of better tools.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2006/02/06/data-and-reality/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Call for Better Information Retrieval Systems</title>
		<link>http://eric.jain.name/2006/01/27/call-for-better-information-retrieval-systems/</link>
		<comments>http://eric.jain.name/2006/01/27/call-for-better-information-retrieval-systems/#comments</comments>
		<pubDate>Fri, 27 Jan 2006 03:45:00 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Usability]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2006/01/27/call-for-better-information-retrieval-systems/</guid>
		<description><![CDATA[From a recent review article in Nature Genetics:
[...] current ad hoc IR systems are not able to retrieve our example sentence when they are given the query &#8216;yeast cell cycle&#8217;. Instead, this could be achieved by realizing that &#8216;yeast&#8217; is a synonym for S. cerevisiae, that &#8216;cell cycle&#8217; is a Gene Ontology term, that the [...]]]></description>
			<content:encoded><![CDATA[<p>From a recent <a href="http://dx.doi.org/10.1038/nrg1768">review article</a> in Nature Genetics:</p>
<blockquote><p>[...] current ad hoc IR systems are not able to retrieve our example sentence when they are given the query &#8216;yeast cell cycle&#8217;. Instead, this could be achieved by realizing that &#8216;yeast&#8217; is a synonym for S. cerevisiae, that &#8216;cell cycle&#8217; is a Gene Ontology term, that the word &#8216;Cdc28&#8242; refers to an S. cerevisiae protein and finally, by looking up the Gene Ontology terms that relate to Cdc28 to connect it to the yeast cell cycle. Although this will not be easy, we see this form of query expansion as the next logical step for ad hoc IR.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2006/01/27/call-for-better-information-retrieval-systems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vivisimo</title>
		<link>http://eric.jain.name/2006/01/07/vivisimo/</link>
		<comments>http://eric.jain.name/2006/01/07/vivisimo/#comments</comments>
		<pubDate>Sat, 07 Jan 2006 03:59:22 +0000</pubDate>
		<dc:creator>Eric Jain</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Usability]]></category>

		<guid isPermaLink="false">http://eric.jain.name/2006/02/05/vivisimo/</guid>
		<description><![CDATA[Vivisimo has set up a new site for searching content from life-science-related journals and databases &#8211; though none of ours so far.

One of the interesting features of this search engine is that it attempts to cluster results by topic. This seems to work quite well for medical conditions (e.g. &#8220;stroke&#8221;), but is less suitable for [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://vivisimo.com/">Vivisimo</a> has set up a <a href="http://biometacluster.com/">new site</a> for searching content from life-science-related journals and databases &ndash; though none of ours so far.</p>
<p><span id="more-11"></span></p>
<p>One of the interesting features of this search engine is that it attempts to cluster results by topic. This seems to work quite well for medical conditions (e.g. &#8220;stroke&#8221;), but is less suitable for distinguishing words used in different contexts (e.g. gene names from words). The ranking also needs some work, and a search for &#8220;uniprot&#8221; results in &#8220;Did you mean uniprost&#8221;&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.jain.name/2006/01/07/vivisimo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
