Mulgara | Semantic Store - Content Wrappers

Content Wrappers

Once the factory is created to handle the creation of resolver objects, an implementation of the Content interface should be written that can be sent to a ContentHandler object. This is used by the resolver to encapsulate the content of files being resolved under the specific protocol, providing a common format that the handler can read data from without worrying about the protocol. It also provides a URI to the original resource, if required, and a map of blank nodes to their respective values for usage across constraints on the same resource.

Content implementations do not require any knowledge about the file types that are to be used with the resolver. The two main considerations are converting the content object's URI source to an input stream and the management of blank node mappings.

Implementing the Interface

Content implementations are compulsory for any protocol resolver. Without one the resolver has no way of allowing a handler access to the data that needs converting to Statements. Normally you can use the connection methods on the URL class to obtain an input stream from the source content, but you are also managing the https protocol, which is more advanced. To overcome this, the HttpConnection class of the apache commons httpclient is used to instantiate and maintain a connection to the content source. The implementation looks something like the following (extracted from HttpContent.java):

package org.mulgara.resolver.http;

//Local packages
import org.mulgara.content.Content;

// Java 2 standard packages
import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.UnknownHostException;
import java.util.*;

// Apache HTTP Client
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.protocol.Protocol;
import org.apache.commons.httpclient.methods.*;
import org.apache.log4j.Logger;

// Java 2 enterprise packages
import javax.activation.MimeType;
import javax.activation.MimeTypeParseException;

//Third party packages
import org.apache.log4j.Logger;

public class HttpContent implements Content {

/** Logger. */
private final static Logger logger = Logger.getLogger(HttpContent.class
.getName());

/** The URI version of the URL */
private URI httpUri;

/**
* A map containing any format-specific blank node mappings from previous
* parses of this file.
*/
private Map blankNodeMap = new HashMap();

/**
* Connection host host
*/
private String host;

/**
* port to make connection to
*/
private int port;

/**
* Schema for connection schema
*/
private String schema;

/**
* A container for HTTP attributes that may persist from request to request
*/
private HttpState state = new HttpState();

/**
* Http connection
*/
private HttpConnection connection = null;

/**
* To obtain the http headers only
*/
private static final int HEAD = 1;

/**
* To obtain the response body
*/
private static final int GET = 2;

/**
* Max. number of redirects
*/
private static final int MAX_NO_REDIRECTS = 10;

public HttpContent(URI uri) throws URISyntaxException, MalformedURLException {
this(uri.toURL());
}

/**
* Constructor.
*
* @param url The URL this object will be representing
* the content of
*/
public HttpContent(URL url) throws URISyntaxException {

// Validate "url" parameter
if (url == null) {

throw new IllegalArgumentException("Null \"url\" parameter");
}

initialiseSettings(url);
}

/**
* Initialise the basic settings for a connection
*
* @param url
* location of source
* @throws URISyntaxException
* invalid URI
*/
private void initialiseSettings(URL url) throws URISyntaxException {

// Convert the URL to a Uri
httpUri = new URI(url.toExternalForm());

// obtain basic details for connections
host = httpUri.getHost();
port = httpUri.getPort();
schema = httpUri.getScheme();

}

/**
* Retrieves the node map used to ensure that blank nodes are consistent.
*
* @return The node map used to ensure that blank nodes are consistent
*/
public Map getBlankNodeMap() {

return blankNodeMap;
}

/**
* Obtain the approrpriate connection method
*
* @param methodType
* can be HEAD or GET
* @return HttpMethodBase method
*/
private HttpMethod getConnectionMethod(int methodType) {

if (methodType != GET && methodType != HEAD) {
throw new IllegalArgumentException(
"Invalid method base supplied for connection");
}

Protocol protocol = Protocol.getProtocol(schema);

connection = new HttpConnection(host, port, protocol);

String proxyHost = System.getProperty("mulgara.httpcontent.proxyHost");

if (proxyHost != null && proxyHost.length() > 0) {
connection.setProxyHost(proxyHost);
}

String proxyPort = System.getProperty("mulgara.httpcontent.proxyPort");
if (proxyPort != null && proxyPort.length() > 0) {
connection.setProxyPort(Integer.parseInt(proxyPort));
}

// default timeout to 30 seconds
connection.setConnectionTimeout(Integer.parseInt(System.getProperty(
"mulgara.httpcontent.timeout", "30000")));

String proxyUserName = System
.getProperty("mulgara.httpcontent.proxyUserName");
if (proxyUserName != null) {
state.setCredentials(System.getProperty("mulgara.httpcontent.proxyRealm"),
System.getProperty("mulgara.httpcontent.proxyRealmHost"),
new UsernamePasswordCredentials(proxyUserName, System
.getProperty("mulgara.httpcontent.proxyPassword")));
}

HttpMethod method = null;
if (methodType == HEAD) {
method = new HeadMethod(httpUri.toString());
}
else {
method = new GetMethod(httpUri.toString());
}

if (connection.isProxied() && connection.isSecure()) {
method = new ConnectMethod(method);
}

// manually follow redirects due to the
// strictness of http client implementation

method.setFollowRedirects(false);

return method;
}

/**
* Obtain a valid connection and follow redirects if neccessary
*
* @param methodType
* request the headders (HEAD) or body (GET)
* @return valid connection method. Can be null.
* @throws IOException
* @throws URISyntaxException
*/
private HttpMethod establishConnection(int methodType) throws IOException {

HttpMethod method = this.getConnectionMethod(methodType);
Header header = null;

if (method != null) {
method.execute(state, connection);
if (!isValidStatusCode(method.getStatusCode())) {
throw new UnknownHostException("Unable to obtain connection to "
+ httpUri + ". Returned status code " + method.getStatusCode());
}
else {
// has a redirection been issued
int numberOfRedirection = 0;
while (isRedirected(method.getStatusCode())
&& numberOfRedirection <= MAX_NO_REDIRECTS) {

// release the existing connection
method.releaseConnection();

//attempt to follow the redirects
numberOfRedirection++;

// obtain the new location
header = method.getResponseHeader("location");
if (header != null) {
try {
initialiseSettings(new URL(header.getValue()));
if (logger.isInfoEnabled()) {
logger.info("Redirecting to " + header.getValue());
}

// attempt a new connection to this location
method = this.getConnectionMethod(methodType);
method.execute(state, connection);
if (!isValidStatusCode(method.getStatusCode())) {
throw new UnknownHostException(
"Unable to obtain connection to " + " the redirected site "
+ httpUri + ". Returned status code "
+ method.getStatusCode());
}
}
catch (URISyntaxException ex) {
throw new IOException("Unable to follow redirection to "
+ header.getValue() + " Not a valid URI");
}
}
else {
throw new IOException("Unable to obtain redirecting detaild from "
+ httpUri);
}
}
}
}
return method;
}

/*
* @see org.mulgara.content.Content#getContentType()
*/
public MimeType getContentType() {

MimeType mimeType = null;
HeadMethod method = null;
String contentType = null;

try {

// obtain connection and retrieve the headers
method = (HeadMethod) establishConnection(HEAD);
Header header = method.getResponseHeader("Content-Type");
if (header != null) {
contentType = header.getValue();
mimeType = new MimeType(contentType);
if (logger.isInfoEnabled()) {
logger.info("Obtain content type " + mimeType + " from " + httpUri);
}
}
}
catch (MimeTypeParseException e) {
logger.warn("Unable to parse " + contentType + " as a content type for "
+ httpUri);
}
catch (IOException e) {
logger.info("Unable to obtain content type for " + httpUri);
}
catch (java.lang.IllegalStateException e) {
logger.info("Unable to obtain content type for " + httpUri);
}
finally {
if (method != null) {
method.releaseConnection();
}
if (connection != null) {
connection.close();
}
}
return mimeType;
}

/**
* Retrieves the URI for the actual content.
*
* @return The URI for the actual content
*/
public URI getURI() {

return httpUri;
}

/**
* Creates an input stream to the resource whose content we are representing.
*
* @return An input stream to the resource whose content we are representing
* @throws IOException
*/
public InputStream newInputStream() throws IOException {

// Create an input stream by opening the URL's input stream
GetMethod method = null;
InputStream inputStream = null;

// obtain connection and retrieve the headers
method = (GetMethod) establishConnection(GET);
inputStream = method.getResponseBodyAsStream();
if (inputStream == null) {
throw new IOException("Unable to obtain inputstream from " + httpUri);
}
return inputStream;
}

private boolean isValidStatusCode(int status) {
return (status == HttpStatus.SC_OK || isRedirected(status));
}

private boolean isRedirected(int status) {
return (status == HttpStatus.SC_TEMPORARY_REDIRECT
|| status == HttpStatus.SC_MOVED_TEMPORARILY
|| status == HttpStatus.SC_MOVED_PERMANENTLY || status == HttpStatus.SC_SEE_OTHER);
}

}

An analysis of the class is as follows:

The packaging for the content implementation is not required to be of any particular format. However, for the sake of neatness and ease of coding, it is recommended that the implementation be in the same package as your resolver implementation. The org.mulgara.content.Content package requires importing to provide access to the Content interface, along with the packages java.io.InputStream, java.io.IOException, java.util.Map, javax.activation.MimeType and java.net.URI, which are used in the interface. Any other imports depend on your implementation of the interface.

public class HttpContent implements Content {

All Content implementation classes must implement the Content interface. It is possible that you are extending another content object, in which case the implementation is not necessary as long as the superclass handles the implementation.

/**
* Constructor.
*
* @param url The URL this object will be representing
* the content of
*/
public HttpContent(URL url) throws URISyntaxException {

// Validate "url" parameter
if (url == null) {

throw new IllegalArgumentException("Null \"url\" parameter");
}

initialiseSettings(url);
}

Constructors for Content implementations have no specific requirements as all instantiation is handled by the resolver. As long as the resolver knows how to create an instance, the format is not set. The main consideration for this method is what format your resource locations should be passed in as, since this determines how you create an input stream and transform the location into an URI. For http and https protocols, you can use URLs that are in a natural URI format. Since there is no exception handling on the getURI() method, we have to perform our URL to URI conversion during our constructor. We also initialize the blank node map object as a HashMap so that there will be a valid object when getBlankNodeMap() is called.

/**
* Retrieves the node map used to ensure that blank nodes are consistent.
*
* @return The node map used to ensure that blank nodes are consistent
*/
public Map getBlankNodeMap() {

return blankNodeMap;
}

The getBlankNodeMap() method usually returns the variable containing our map object, unless some pre-processing is required. In the case of http content, you can get away with just returning the HashMap containing the mappings.

/**
* Retrieves the URI for the actual content.
*
* @return The URI for the actual content
*/
public URI getURI() {

return httpUri;
}

Content objects allow access to the original URI and should not throw any exceptions when the getURI method is called. If an exception does occur during the conversion of the resource's source object (for example, java.net.URL), then the URI creation should occur during the constructor and this method should return a global variable.

/*
* @see org.mulgara.content.Content#getContentType()
*/
public MimeType getContentType() {

MimeType mimeType = null;
HeadMethod method = null;
String contentType = null;

try {

// obtain connection and retrieve the headers
method = (HeadMethod) establishConnection(HEAD);
Header header = method.getResponseHeader("Content-Type");
if (header != null) {
contentType = header.getValue();
mimeType = new MimeType(contentType);
if (logger.isInfoEnabled()) {
logger.info("Obtain content type " + mimeType + " from " + httpUri);
}
}
}
catch (MimeTypeParseException e) {
logger.warn("Unable to parse " + contentType + " as a content type for "
+ httpUri);
}
catch (IOException e) {
logger.info("Unable to obtain content type for " + httpUri);
}
catch (java.lang.IllegalStateException e) {
logger.info("Unable to obtain content type for " + httpUri);
}
finally {
if (method != null) {
method.releaseConnection();
}
if (connection != null) {
connection.close();
}
}
return mimeType;
}

Most content handlers are written to handle content of a certain type, which subsequently has a mime type associated with it. When the resolver receives a model to resolve, it cycles through the list of registered content handlers and when it finds one that can parse the data, it hands over the content object for parsing. To find out if a content object can be handled by the handler, the getContentType() method can be invoked to retrieve the file's mime type. For http resources, you can use the connection headers to determine the mime type.

/**
* Creates an input stream to the resource whose content we are representing.
*
* @return An input stream to the resource whose content we are representing
* @throws IOException
*/
public InputStream newInputStream() throws IOException {

// Create an input stream by opening the URL's input stream
GetMethod method = null;
InputStream inputStream = null;

// obtain connection and retrieve the headers
method = (GetMethod) establishConnection(GET);
inputStream = method.getResponseBodyAsStream();
if (inputStream == null) {
throw new IOException("Unable to obtain inputstream from " + httpUri);
}
return inputStream;
}

ContentHandler implementations require some way to access the actual content of a resource in order to convert it to RDF triples. This is done in the form of an InputStream. This means that the resource pointer the Content object is based around should be able to be create an InputStream in some way. As previously stated, you could normally use the openStream() method of the URL class to create an input stream from the resource, but the https support requires a more complex method. Using the HttpConnection class of the apache commons httpclient jar you are able to use a GET method call to the server to stream the data to the handler.

The other methods in the class not outlined here are all unique to the http content object and irrelevant to other content types.

mulgara - semantic store

Content Wrappers

Implementing the Interface