Published in GeoWeb Standards – Discoverability
We have a rich history of geography, cartography and GIS that is currently tucked away in top drawers, intranets, and repositories that may not stay online when we most need the data. How do we expose these huge troves of data in a way that can be utilized across various domains. The GeoWeb is all part of the same web, semantic, sensor, social, (and interplanetary). So it is vital at the GeoWeb align itself with the web and the multitude of sources and endpoints that the web is reaching into.
There are many possible solutions, and a few that are within easy grasp that we can build our tools to encompass, and develop practices that encourage utilization of these solutions while still moving forward onto better ones as the GeoWeb matures. So we’ll take a few articles to look at specific solutions.
Perhaps the most prevalent issue, and the one that is most easily addressable, is the findability and discovery of geodata on the web. Mano Marks reflected this same sentiment in his blog post on standards.
In thinking about discoverability, there are several primary use cases to consider: Machine crawling, Human discovery, and Tool discovery. Providing data via just a single mechanism means that it doesn’t get utilized and consumed to it’s potential and so somewhere along the chain of utilization it will be a burden to actually incorporate into a workflow.
Think of the machines
Machine crawling is the ability for any spider to walk links, find data, metadata, and formats automatically. It’s what Google, or GeoNetwork does to find and register data sources.
There was recently a discussion on auto-discovery in the GeoWeb suggesting the use of robots.txt, sitemaps, or embedded META tags in HTML pages.
Consider how a spider would get to a site: it follows a link to a geospatial portal from some blog, resource, or directly entered as a good place to get data. It does a GET request on the root homepage, “/” which most likely returns the index.html equivalent. The program then parses through that for links or additional information.
If the spider knows about them, then it may ask for a sitemap.xml or robots.txt. But nothing in the original page request noted that this potentially very complete listing of data was there. This problem is the equivalent of an application having to know that it needs to ask for a GetCapabilities or other method to even discover what is available. Too much implicit knowledge of the specification is required for a program to easily discover new data and services.
What the program does see are these links that can contain information such as a link to a list of available resources. The simplest is a link to the Atom or RSS feed that can simply be a paginated list of all the resources available to the application. Within Atom, there is then the ability to link to various representations of that data in different formats. So applications are able to take the most appropriate format based on what they can consume.
Several years ago I first proposed how KML and GeoRSS could easily support one another via cross-links and with HTML documents. Atom has very nice
type attributes that allow for linking to all sorts of different representations. You can even link to OGC services like WMS and WFS using atom links.
Of particular interest here are looking at the currently approved list of Atom link relation types that provide basic semantics for telling you how what this link means. Is it another page? just related? It’s a limited set, but one that covers an approachable majority case for developers to begin using.
For example, mechanisms like OpenSearch, specified in a
rel="search", simply notify the application that here is a service that it can query to get at additional resources. And with OpenSearch-Geo, a geoweb crawler can query information within a specific location or bounding area.
Humans need data too
Crawlers are great, they provide a say to pull together information into various other sites and tools to provide customized interfaces to users. However, within any site or tool, how should we expose geodata in a way that humans can easily use for whatever purposes the may have.
Again, links have become a very well understood concept on the Web. That underlined blue line states “beyond me lies an unspecified amount of information about this topic“. However, these links typically imply that they will open another human readable HTML page in the browser. A problem caused by links to media such as geospatial data is that the content behind a link may not be just text, it could be an image, audio, movie, KML, database, or a service. Clicking on that link relies on the browser interpreting the MIME-type (remember the point about how vital mime-types are?) and opening the application the user has specified, or left as default.
So consider what this means for generic media. Clicking open a link to an image probably just opens the image in your browser, or opening a movie loads an embedded video player. Geodata browsers, however, probably doesn’t have the same install base as say, Quicktime. Except perhaps GoogleEarth. The Web has become much more comfortable with clicking on a KML link and seeing Google Earth open up and show the data on a globe.
But something very vital often exists with a link to KML data – a recognizable icon that notifies the user (as they learn) that it is a file that will open in Google Earth, or another KML viewer. This is the same as the very widely used RSS icon.
So what we need for GeoWeb standards are some visual representation to people that they are can clink on this link and open a spatial relational database, or an OGC service, and perhaps have some confidence that there is an application that will provide them a useful way to access the data. (and I’m still waiting for Sean Gillies’ ISO and Dublin Core icons)
Of course, we should also employ emergent interfaces that show users the type of data links that are appropriate for them based on their profile or registered MIME-type handlers.
So we have discovery links for machine crawlers to register and harvest geodata, and links for humans to click on to follow to data and within data. However, this can easily become overwhelming to need to click through to every link. Imagine if browsing Flickr through lynx.
Browsers already do a lot to assist users in finding relevant extra pieces of data in a page. RSS autodisovery links show up in URL bars notifying our feed readers that we can subscribe to this page. OpenSearch allows someone to embed this search into their browsers (most of them at least) to easily search the repository later.
The decreasing cost of links
These various approaches for different needs and use cases are all very well aligned. They don’t rely on additional external files that we need to make sure stay up to date or that tools are built to just know that the file can be found at a pre-defined location. Links cost next to nothing, mostly measured in bandwidth sizes, but provide a wealth of accessibility and discovery of geospatial data. Especially data in formats that make sense depending on the tools and use cases for different problems.
Of course, links alone don’t address all the needs of the evolving GeoWeb, they merely provide for the integration of geospatial data with the rest of the web. An important, necessary, but not entirely sufficient first step. We need to consider the actual uses and interfaces of these standards, archival, synchronization, conflation and more.