Difference between revisions of "User:Martind/Document Log Discovery Platform"

Revision as of 17:29, 18 December 2010

Problem Statement

We're seeing an increase in the publication of vast corpuses of data logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant?

Can we allow large number of interested parties (anyone really) to annotate these documents?
- What kinds of annotations do we want to make? (Information structure)
- How can we make that easy? (Tools)
- Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?

Exemplary Publications

Look out for:

Canonical URLs (implicit, or explicit via rel="canonical")
Document IDs
Document ranges for timeline browsers
Additional presentation information, e.g. reading offset

WikiLeaks Iraq War Logs

Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ (archive.org)
Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720

WikiLeaks Embassy Cables

Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
Document ID: 10COPENHAGEN69
Browsers: cablesearch.org,

http://spacelog.org/

Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
Document ID: 01:06:43:11
Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
has rel="canonical"

Observations

Editorial Functions

It seems useful to be able to link/group individual messages
It seems useful to be able to annotate content (with text, links)
It seems useful to be able to contribute anonymously
It seems useful to be able to annotate/qualify editorial contributions by others

Interestingness, Popularity

What constitutes an "interesting" section of a document is a matter of perspective.
- Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
Relationship between "popular" and "interesting" items:
- Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
- "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
- The latter could however feed into the former: items that are widely perceived as interesting
  - To best accomplish this we should attempt to simplify the workflow of an editorial process.

Interoperability

Many parties will already build browsers for data log archives, with varying ways of navigating such content.
- We don't need to duplicate those efforts, but we should integrate with them.

Addressing Schemes for Archives

Need a shared addressing scheme that works across archives, archive browsers
- Based on permalinks?
- Definite goal: to place links. Ability to send reader to the source material
- Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
Alternatively: need a method of translating between different addressing schemes
- Start with a review of link structures of a wide spectrum of archives
Should publish best practises for a good addressing scheme
- Document the structure of individual addressing schemes
- Publish recommendations for addressing schemes, terminology used: common conventions
...

Spacelog highlights:

Addressing schemes become more complex for timeline browsers
This data log is not just a collection of independent documents, but a sequence of directly related events
The primary reading mode is in context: items are always presented within a timeline
- Want to show documents leading up to and succeeding the highlighted documents
As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
This is mostly an artefact of timeline presentation within a browser
- Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context

Content licenses

A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
- In the case of data published by governments and NGOs the data may either be in the public domain, or will have an explicit license
- In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license

Links

http://booktwo.org/notebook/openbookmarks/ check these sites for bookmark/annotation conventions
- http://www.openbookmarks.org/ (focused on ebooks)
  - http://wiki.openbookmarks.org/Bookmark_Exchange_Format

@@ Line 17: / Line 17: @@
 * Additional presentation information, e.g. reading offset
-* WikiLeaks Iraq War Logs
+WikiLeaks Iraq War Logs
-** Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
+* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
-** Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
+* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
-* WikiLeaks Embassy Cables
-** Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
+WikiLeaks Embassy Cables
-** Document ID: 10COPENHAGEN69
+* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
-** Browsers: [http://cablesearch.org/ cablesearch.org],
+* Document ID: 10COPENHAGEN69
-* http://spacelog.org/
+* Browsers: [http://cablesearch.org/ cablesearch.org],
-** Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
-** Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
+http://spacelog.org/
-** Document ID: 01:06:43:11
+* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
-** Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
+* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
+* Document ID: 01:06:43:11
+* Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
+* has rel="canonical"
 == Observations ==