Dominic J. Thoreau

Web Mapping

Web Mapping Constructs.

Dominic J. Thoreau

8th September 1998

Mapping complex, non-linear hypertext information structures.

As far as I know there is no software readily available to map complex structures. Existing software seems to have failings in some facets. Many published "site maps" don't acknowledge (or have, possibly) any link other than strictly hierarchical. Facilities for graphically viewing web sites are few, and those that do don't seem to recognise inter-relationships.

Tim Berners-Lee's " Enquire-Within-Upon-Everything" (Cern, December 1980)
The whole point of the web was the fact that documents can be referenced randomly, without insisting that all links be hierarchical in nature . Many sites (including Yahoo, Alta-Vista, Webcrawler etc) exist solely to help navigate these complex structures. Unfortunately all the site map management tools seem to ignore this. While many sites seem to have tree structures others do not (eg Yahoo, while on the surface appearing to be purely tree-based, in fact has "aliases" that link across the tree).

Apart from trees themselves, in my experience there are more relationships mis-expressed as trees than true trees. So called family trees tend to grow in rambling structures, and interconnect (not just in the low IQ; there is substantial inter-breeding in the royal houses of Europe).

Additionally, when composing large structures, or making databases browse-able via CGI or canned HTML moderately sized databases can becoming large, unwieldy structures (Imagine drawing a single map for Microsoft's Corporate site!). In the case of the Lepidoptera project (475k of base database, with indices and lookup tables became 2Mb of source data, which in turn became 4,140 HTML files totalling 12Mb, in addition to the 1,868 image files) the concept of a map quickly became irrelevant.

"Meta" anything in a computing context is basically self-describing: meta-data is data describing data (this can be definitions of fields and tables, or the more interesting descriptors (things like accuracy and source comments). Meta-maps are maps that describe maps. The "index maps" present in map books that identify the relationships between maps are one common example of meta-maps.

Instead a rough "meta-map" concept was used, where rather than map individual pages a map was made of the relationship between types of pages. (A diagram was constructed, albeit one using a demonstration program from a programming tool. The deficiences in this particular diagram were immediately obvious, but there was neither the time nor the tools to produce software that would produce such a document) The type definitions used, in hindsight, confirm rather closely to the classes/entities that would have resulted from a proper analysis using standard methodologies (say Codd and/or Booch) (although code-reuse is not really an issue)

In addition to assisting programmers with developing and managing the document space, with good naming conventions (the scheme originally chosen did not support spaces or hyphens in entity names, but underscores in their place helped) it is possible to help explain the document to users and project management.

Case: Yahoo

A personal, Black box reverse engineering of the perceived concepts behind Yahoo(TM). I have no knowledge of the actual technology used. This is my best guess, and how I would do it.
Yahoo, one of the earliest "web catalogues" (or more recently "portals"), while having a large amount of data actually appears to have a simple internal structure. There are only two types of attributes required: the "Category" and the "Web site". A category can contains other objects such sub-categories or web sites, as well as being linked to all it's ancestor categories. This would have been a simplistic, pure tree structure except for the (very useful) aliasing to related, but not entirely similar categories across the structure (for example: the Category page for Entertainment : Music : Software contains aliases to Business_and_Economy : Companies : Music : Software and Computers_and_Internet : Software : Shareware : Microsoft_Windows : Desktop_Themes : Individual_Themes : Music, two related categories that do not qualify to be merged)

Case: Everything

Everything is © Rob Malda/Nate Oostendorp.
Rob Malda's "Everything" project is one of the purer proofs that relationships between items can be other that strictly hierarchical. The concept is basically one of user created "nodes", filled with descriptive text, that can be linked to any other node in the system. A linked node need not exist to be linked to. Instead, anyone following a "null" link will be invited to create that node. An attempt is made to impose a limited natural selection mechanism on node content, by providing an alternate data set for each node, and asking those viewing to make a judgement on the superior. Presumably at some point the lesser would be deleted.

Case: NZ Lepidoptera Project

Images of New Zealand Lepidoptera. Crosby, Dugdale, Thoreau. 1998 Manaaki Whenua Press (currently in press)
The NZ Lepidoptera Project is an effort to publish electronically the type specimen details of the indigenous Lepidoptera (moths and butterflies) of New Zealand. While for this project some of the complexities of taxonomic classication were ignored, there was still a complex structure method due to the need for multiple access paths to the main information page. Access paths were added, in addition to three variants on taxonomic identity, by Collector, Author, Location collected, and institution currently storing the specimen. All paths lead to the same type of page (1 of ~2200). The current build has ~4,500 pages and ~120,000 hyperlinks.

Details of scheme used.

The tool used need not be expensive. For the purposes of the previously mentioned project the tool used was a demonstration applet from Sun Microsystems' Java Development Kit, with shadows then added in a generic paint program
The diagramming tool used needs to be able to construct entities and link them together.
  • An entity is simply a box, and links between them simply lines.
  • Arrow-heads are added to show the directions of traffic available.
  • If travel in both directions is possible, two arrow heads should be drawn, facing away from each other.
  • By adding a "shadow" under entities that map to a large number of pages, it is possible to give some broad idea of scale (the arbitrary figure used to add shadows here was that a level of shadow should be added on a link where the average ratio was at least 1:20. This gave at maximum 3 shadows under one entity).
  • For ease of readability, the first entity/page presented to the user is displayed in the top-left of the document.
  • For clarity, some entities may be ommitted. Toolbars, advertisements etc. are not shown. These should be written descriptively in text attached to the project.
  • Each Entity relates to either a single handcrafted page or a single script that produces one type of page.
  • Don't display duplicate structures; the version that is restricted to 8.3 character filenames is not drawn separately.
  • Don't display un-related structures.
  • Bear in mind that there are other relationships that are important, including run order, file dependencies etc.
  • A single entity only represents a single type of page, and should be simple enough to explain in an single sentence, in the context of the whole project.
  • One entity, one script.

Further thoughts

None of the ideas here are being actively pursued. The job I work is involved in the field of Bioinformatics/Biodiversity, so more general software-engineering is not a priority. However, I am still interested in them. If you are too, contact me. The philosophical direction I come from suggests that any code created here that isn't specifically owned by anybody else (like my employers) would ideally become Open Source software
  1. Drawing simple flow diagrams is a simple task, that doesn't need a tool anymore complex than a generic digramming tool. However, extending the code slightly to keep related attributes associated with their respective entities would be greatly beneficial to users. It's only a small step from there to code generation for at least skeleton code. Unfortunately it's not a priority for me to write small scale CASE tools.
  2. The above is a very rough idea shaped by the tools I had at my disposal. I am interested in expanding the ideas used. I would probably write a small tool for this if I had access to a good C++ compiler/GUI (I learnt Borland(now Inprise)'s OWL toolkit in tertiary education).
  3. Other ideas to be shaped into a standard:
    • Numeric page counters: either fan-out or total
    • differentiate between script generated and hand coded pages
    • Optionality: the pages generated by script x lead to either page type y or page type z
    • Ownership of individual scripts. Who will your changes impact?
    • Sub-and super diagrams; information complexity hiding

© Landcare Research New Zealand Ltd 1998
Dominic J. Thoreau

© 2004 Dominic J. Thoreau - this is
Updated and uploaded Fri Dec 29 11:45:07 2006