Difference between revisions of "User:Martind/Document Log Discovery Platform"

From London Hackspace Wiki
Jump to navigation Jump to search
Line 17: Line 17:
 
* Additional presentation information, e.g. reading offset
 
* Additional presentation information, e.g. reading offset
  
* WikiLeaks Iraq War Logs
+
WikiLeaks Iraq War Logs
** Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
+
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
** Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
+
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
* WikiLeaks Embassy Cables
+
 
** Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
+
WikiLeaks Embassy Cables
** Document ID: 10COPENHAGEN69
+
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
** Browsers: [http://cablesearch.org/ cablesearch.org],  
+
* Document ID: 10COPENHAGEN69
* http://spacelog.org/
+
* Browsers: [http://cablesearch.org/ cablesearch.org],  
** Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
+
 
** Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
+
http://spacelog.org/
** Document ID: 01:06:43:11
+
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
** Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
+
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
 +
* Document ID: 01:06:43:11
 +
* Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
 +
* has rel="canonical"
  
 
== Observations ==
 
== Observations ==

Revision as of 17:29, 18 December 2010

Problem Statement

We're seeing an increase in the publication of vast corpuses of data logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant?

  • Can we allow large number of interested parties (anyone really) to annotate these documents?
    • What kinds of annotations do we want to make? (Information structure)
    • How can we make that easy? (Tools)
    • Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
  • Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?

Exemplary Publications

Look out for:

  • Canonical URLs (implicit, or explicit via rel="canonical")
  • Document IDs
  • Document ranges for timeline browsers
  • Additional presentation information, e.g. reading offset

WikiLeaks Iraq War Logs

WikiLeaks Embassy Cables

http://spacelog.org/

Observations

Editorial Functions

  • It seems useful to be able to link/group individual messages
  • It seems useful to be able to annotate content (with text, links)
  • It seems useful to be able to contribute anonymously
  • It seems useful to be able to annotate/qualify editorial contributions by others

Interestingness, Popularity

  • What constitutes an "interesting" section of a document is a matter of perspective.
    • Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
  • Relationship between "popular" and "interesting" items:
    • Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
    • "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
    • The latter could however feed into the former: items that are widely perceived as interesting
      • To best accomplish this we should attempt to simplify the workflow of an editorial process.

Interoperability

  • Many parties will already build browsers for data log archives, with varying ways of navigating such content.
    • We don't need to duplicate those efforts, but we should integrate with them.

Addressing Schemes for Archives

  • Need a shared addressing scheme that works across archives, archive browsers
    • Based on permalinks?
    • Definite goal: to place links. Ability to send reader to the source material
    • Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
  • Alternatively: need a method of translating between different addressing schemes
    • Start with a review of link structures of a wide spectrum of archives
  • Should publish best practises for a good addressing scheme
    • Document the structure of individual addressing schemes
    • Publish recommendations for addressing schemes, terminology used: common conventions
  • ...

Spacelog highlights:

  • Addressing schemes become more complex for timeline browsers
  • This data log is not just a collection of independent documents, but a sequence of directly related events
  • The primary reading mode is in context: items are always presented within a timeline
    • Want to show documents leading up to and succeeding the highlighted documents
  • As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
  • This is mostly an artefact of timeline presentation within a browser
    • Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context

Content licenses

  • A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
    • In the case of data published by governments and NGOs the data may either be in the public domain, or will have an explicit license
    • In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license

Links