In order to bulk-load RDF data into Oracle (Spatial) 11g, the data needs to be converted to N-Triples first. If the data set is large, this step can add quite a bit of overhead, which is why I decided to benchmark and compare several options.
For the comparison, the <a href="ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/taxonomy.rdf.gz">taxonomy.rdf.gz</a>
file from UniProt release 10.0 was used. This file is about 127M large (uncompressed). The machine on which the comparison was run is a (slightly obsolete) Itanium machine, with plenty of RAM.
Raptor
The first tool I tried was Raptor (1.4.14). After the usual configure
– make
– make install
the conversion can be run like so:
zcat taxonomy.rdf.gz | rapper -e -o ntriples - file://taxonomy.rdf.gz# > taxonomy.nt
This completed in 38.9, 38.8 and 38.9 seconds (in subsequent runs).
The -e
flag turns off validation. This doesn’t seem to have a measurable impact on performance, but is necessary to avoid erroneous “Duplicated rdf:ID value” errors (at least in another, larger file).
The data is decompressed on the fly to save time (and disk space).
Jena
Next I tried Jena (2.5.2). After adding all the jars to the classpath, the conversion was run like so:
zcat taxonomy.rdf.gz | java jena.rdfparse -b file://taxonomy.rdf.gz# -x - > taxonomy.nt
This completed in 2:38, 2:40 and 2:38 min.
This is too slow if I wanted to convert the entire UniProt RDF data set within reasonable time, but at least I got a (correct) warning about a bad URI that I hadn’t been aware of…
The JVM is JRockit (5.0 R27.1) with default parameters (I tried some variations such as adding -Xgcprio:throughput
, but didn’t see any significant change).
Rio
Last, I tried Rio (1.0.9), another Java parser. Rio doesn’t seem to include a command line tool for conversion, but it’s not a lot of code:
import java.io.IOException;
import org.openrdf.model.Resource;
import org.openrdf.model.URI;
import org.openrdf.model.Value;
import org.openrdf.rio.Parser;
import org.openrdf.rio.StatementHandler;
import org.openrdf.rio.StatementHandlerException;
import org.openrdf.rio.ntriples.NTriplesWriter;
import org.openrdf.rio.rdfxml.RdfXmlParser;
public class Converter
{
public static void main(String[] args)
throws Exception
{
Parser parser = new RdfXmlParser();
final NTriplesWriter writer = new NTriplesWriter(System.out);
writer.startDocument();
parser.setStatementHandler(new StatementHandler()
{
public void handleStatement(Resource s, URI p, Value o)
throws StatementHandlerException
{
try
{
writer.writeStatement(s, p, o);
}
catch (IOException e)
{
throw new StatementHandlerException(e);
}
}
});
parser.parse(System.in, args[0]);
writer.endDocument();
}
}
zcat taxonomy.rdf.gz | java Converter file://taxonomy.rdf.gz# > taxonomy.nt
This ran in 49.8, 49.5 and 50.0 seconds.
Using buffered readers or writers seemed to decrease performance slightly, so I assume the streams are already being buffered.
Conclusion
The conversion can be done fastest with Raptor. Rio is the best choice if you need to set up a platform-independent procedure (e.g. integrated into an Ant build). Jena is best if you also need to check the data :-)