Difference between revisions of "User:Martind/Document Log Discovery Platform"

From London Hackspace Wiki
Jump to navigation Jump to search
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Problem Statement ==
 
== Problem Statement ==
  
We're seeing an increase in the publication of vast corpuses of data logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant?
+
We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others?
  
* Can we allow large number of interested parties (anyone really) to annotate these documents?
+
* Can we allow large number of interested parties to annotate these documents?
 
** What kinds of annotations do we want to make? (Information structure)
 
** What kinds of annotations do we want to make? (Information structure)
 
** How can we make that easy? (Tools)
 
** How can we make that easy? (Tools)
 
** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
 
** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
 
* Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
 
* Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
 +
** How would such a service integrate with layers below (archives) and above (editorial, exploration)?
 +
 +
Note:
 +
* These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
 +
* Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.
  
 
== Exemplary Publications ==
 
== Exemplary Publications ==
  
 
Look out for:
 
Look out for:
* Canonical URLs (implicit, or explicit via rel="canonical")
+
* Canonical document URLs (implicit, or explicit via rel="canonical")
 +
* Corpus IDs
 
* Document IDs
 
* Document IDs
 
* Document ranges for timeline browsers
 
* Document ranges for timeline browsers
 
* Additional presentation information, e.g. reading offset
 
* Additional presentation information, e.g. reading offset
 +
* How to construct canonical URLs from document IDs
 +
* Existing services that interact with this archive
  
'''WikiLeaks Iraq War Logs'''
+
=== WikiLeaks Iraq War Logs ===
 
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
 
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
 
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
 
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
 +
* Document also has a "Tracking number": 20091223210038SMB
 +
* To construct a canonical document URL: Base URL + document ID
 +
* Other services:
 +
** URLs are actively being shared (and annotated) on Twitter
 +
** This data set is present in Google Fusion Tables, though seemingly without further annotations
 +
* Other observations:
  
'''WikiLeaks Embassy Cables'''
+
===WikiLeaks Afghan War Diary===
 +
* Document URL:
 +
* Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4
 +
* Document also has a "Tracking number": 2007-033-004042-0756
 +
* To construct a canonical document URL:
 +
* Other services:
 +
** URLs are actively being shared (and annotated) on Twitter
 +
** This data set is present in Google Fusion Tables, though seemingly without further annotations
 +
* Other observations:
 +
 
 +
===WikiLeaks Embassy Cables===
 
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
 
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
 
* Document ID: 10COPENHAGEN69
 
* Document ID: 10COPENHAGEN69
 
* Browsers: [http://cablesearch.org/ cablesearch.org],  
 
* Browsers: [http://cablesearch.org/ cablesearch.org],  
 +
* To construct a canonical document URL: Base URL + document ID
 +
* Other services:
 +
** URLs/identifiers are actively being shared (and annotated) on Twitter, e.g. http://twitter.com/bennohansen/statuses/14367684422533120
  
'''SpaceLog'''
+
===SpaceLog===
 
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
 
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
 
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
 
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
 
* Corpus ID: apollo13
 
* Corpus ID: apollo13
 
* Document ID: 01:06:43:11
 
* Document ID: 01:06:43:11
** not enough to form a URL. Need to know corpus ID to construct permalink
+
* Presentation parameters:
* Reading offset ID: #log-line-110591 ("log-line-110591" alone doesn't seem to suffice to construct a link)
+
** Reading offset ID: #log-line-110591
* has rel="canonical"
+
* To construct canonical URLs:
 +
** Document URL: URL template, corpus ID, document ID, reading offset
 +
** Document range URL: URL template, corpus ID, document IDs, reading offset
 +
** No means to query corpus ID, reading offset for a document ID
 +
* Other services:
 +
** URLs are actively being shared (and annotated) on Twitter
 +
* Other observations:
 +
** Has rel="canonical"
 +
** Document IDs are actually timestamps.
 +
** There are collisions within a corpus, which does not greatly affect presentation, but may affect integration with other services
 +
** There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)
  
'''Twitter'''
+
===Twitter===
 
* Document URL:
 
* Document URL:
 
** http://twitter.com/#!/wikileaks/statuses/15975805188317184
 
** http://twitter.com/#!/wikileaks/statuses/15975805188317184
Line 41: Line 78:
 
* Corpus ID: wikileaks
 
* Corpus ID: wikileaks
 
* Document ID: 15975805188317184  
 
* Document ID: 15975805188317184  
** not enough to form a URL. Need to query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id and then manually construct permalink
+
* To construct canonical document URL: base URL + corpus ID + document ID
 +
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
 +
* Other services:
 +
** ExquisiteTweets allows to group tweets
 +
 
 +
===Eur-Lex===
 +
* Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
 +
* Corpus ID: OJ:L (Official Journal, legislation series)
 +
* Document ID: 2010:333:0001:0005
 +
* Presentation parameters:
 +
** Language: EN
 +
** Format: PDF
 +
* To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
 +
* Other observations:
 +
** Corpus ID could be understood as part of the document ID, we may not need to treat them separately
 +
* Other services:
 +
** TODO. They are likely to exist.
 +
 
 +
===TODO===
 +
* Patent databases
 +
* Databases of law
 +
* ...
  
 
== Observations ==
 
== Observations ==
 +
 +
=== Presentation ===
 +
* Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context
 +
** Though don't make it too complex. Core use case: WikiLeaks text archives
 +
* Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication
 +
** This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes.
  
 
=== Editorial Functions ===
 
=== Editorial Functions ===
Line 61: Line 125:
  
 
=== Interoperability ===
 
=== Interoperability ===
* Many parties will already build browsers for data log archives, with varying ways of navigating such content.  
+
* Many parties will already build browsers for document log archives, with varying ways of navigating such content.  
 
** We don't need to duplicate those efforts, but we should integrate with them.
 
** We don't need to duplicate those efforts, but we should integrate with them.
  
 
=== Addressing Schemes for Archives ===
 
=== Addressing Schemes for Archives ===
  
* Need a shared addressing scheme that works across archives, archive browsers
+
Addressability is a first technical barrier:
** Based on permalinks?
+
* Being able to construct links between multiple representations of the same documents
** Definite goal: to place links. Ability to send reader to the source material
+
** Across mirrors (same content, same addressing scheme, different location)
** Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
+
** Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
* Alternatively: need a method of translating between different addressing schemes
+
* Being able to identify existing references/annotations by detecting such links
** Start with a review of link structures of a wide spectrum of archives
+
** E.g. via Twitter/Google search
* Should publish best practises for a good addressing scheme
+
 
** Document the structure of individual addressing schemes
+
Need a shared addressing scheme that works across archives, archive browsers
** Publish recommendations for addressing schemes, terminology used: common conventions
+
* Based on permalinks?
* ...
+
* Definite goal: to place links. Ability to send reader to the source material
 +
* Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
 +
 
 +
Alternatively: need a method of translating between different addressing schemes
 +
* Start with a review of link structures of a wide spectrum of archives
 +
 
 +
Should publish best practises for a good addressing scheme
 +
* Document the structure of individual addressing schemes
 +
* Publish recommendations for addressing schemes, terminology used: common conventions
  
 
Spacelog highlights:
 
Spacelog highlights:
 
* Addressing schemes become more complex for timeline browsers
 
* Addressing schemes become more complex for timeline browsers
* This data log is not just a collection of independent documents, but a sequence of directly related events
+
* This document log is not just a collection of independent documents, but a sequence of directly related events
 
* The primary reading mode is in context: items are always presented within a timeline
 
* The primary reading mode is in context: items are always presented within a timeline
 
** Want to show documents leading up to and succeeding the highlighted documents
 
** Want to show documents leading up to and succeeding the highlighted documents
Line 88: Line 160:
 
=== Content licenses ===
 
=== Content licenses ===
 
* A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
 
* A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
** In the case of data published by governments and NGOs the data may either be in the public domain, or will have an explicit license
+
** In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license
 
** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
 
** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
 +
 +
== Workflows ==
 +
 +
=== Hunt & Gather ===
 +
 +
* Browse documents for a supported archive on any of its mirrors
 +
* Click "annotate" button
 +
** Service detects repository, corpus, document ID
 +
** Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context
 +
** Service presents editorial tools (annotate, link, publish, ...)
 +
* Repeat
 +
 +
=== Review ===
 +
 +
* Browse: service has lists of editorial work (search, recent, my, popular, ...)
 +
* Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context)
 +
** Service presents source material, all editorial information, links to additional services relating to the referenced archive(s)
 +
** Service allows feedback (voting, amendments, comments)
 +
* Repeat
 +
 +
=== Shaping Contexts ===
 +
 +
* Allow to "clone" public and personal editorial contexts to a new context in your personal namespace
 +
* This merges into an existing target context (or creates a blank one first)
 +
* Allow to withdraw your own contributions (should probably archive it instead of just deleting it)
 +
* All of these merely operate on references (to archives, contributors, comments, ...)
  
 
== Links ==
 
== Links ==

Latest revision as of 16:53, 26 December 2010

Problem Statement

We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others?

  • Can we allow large number of interested parties to annotate these documents?
    • What kinds of annotations do we want to make? (Information structure)
    • How can we make that easy? (Tools)
    • Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
  • Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
    • How would such a service integrate with layers below (archives) and above (editorial, exploration)?

Note:

  • These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
  • Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.

Exemplary Publications

Look out for:

  • Canonical document URLs (implicit, or explicit via rel="canonical")
  • Corpus IDs
  • Document IDs
  • Document ranges for timeline browsers
  • Additional presentation information, e.g. reading offset
  • How to construct canonical URLs from document IDs
  • Existing services that interact with this archive

WikiLeaks Iraq War Logs

  • Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ (archive.org)
  • Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
  • Document also has a "Tracking number": 20091223210038SMB
  • To construct a canonical document URL: Base URL + document ID
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
    • This data set is present in Google Fusion Tables, though seemingly without further annotations
  • Other observations:

WikiLeaks Afghan War Diary

  • Document URL:
  • Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4
  • Document also has a "Tracking number": 2007-033-004042-0756
  • To construct a canonical document URL:
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
    • This data set is present in Google Fusion Tables, though seemingly without further annotations
  • Other observations:

WikiLeaks Embassy Cables

SpaceLog

  • Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
  • Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
  • Corpus ID: apollo13
  • Document ID: 01:06:43:11
  • Presentation parameters:
    • Reading offset ID: #log-line-110591
  • To construct canonical URLs:
    • Document URL: URL template, corpus ID, document ID, reading offset
    • Document range URL: URL template, corpus ID, document IDs, reading offset
    • No means to query corpus ID, reading offset for a document ID
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
  • Other observations:
    • Has rel="canonical"
    • Document IDs are actually timestamps.
    • There are collisions within a corpus, which does not greatly affect presentation, but may affect integration with other services
    • There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)

Twitter

Eur-Lex

  • Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
  • Corpus ID: OJ:L (Official Journal, legislation series)
  • Document ID: 2010:333:0001:0005
  • Presentation parameters:
    • Language: EN
    • Format: PDF
  • To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
  • Other observations:
    • Corpus ID could be understood as part of the document ID, we may not need to treat them separately
  • Other services:
    • TODO. They are likely to exist.

TODO

  • Patent databases
  • Databases of law
  • ...

Observations

Presentation

  • Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context
    • Though don't make it too complex. Core use case: WikiLeaks text archives
  • Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication
    • This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes.

Editorial Functions

  • It seems useful to be able to link/group individual messages
  • It seems useful to be able to annotate content (with text, links)
  • It seems useful to be able to contribute anonymously
  • It seems useful to be able to annotate/qualify editorial contributions by others

Interestingness, Popularity

  • What constitutes an "interesting" section of a document is a matter of perspective.
    • Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
  • Relationship between "popular" and "interesting" items:
    • Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
    • "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
    • The latter could however feed into the former: items that are widely perceived as interesting
      • To best accomplish this we should attempt to simplify the workflow of an editorial process.

Interoperability

  • Many parties will already build browsers for document log archives, with varying ways of navigating such content.
    • We don't need to duplicate those efforts, but we should integrate with them.

Addressing Schemes for Archives

Addressability is a first technical barrier:

  • Being able to construct links between multiple representations of the same documents
    • Across mirrors (same content, same addressing scheme, different location)
    • Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
  • Being able to identify existing references/annotations by detecting such links
    • E.g. via Twitter/Google search

Need a shared addressing scheme that works across archives, archive browsers

  • Based on permalinks?
  • Definite goal: to place links. Ability to send reader to the source material
  • Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)

Alternatively: need a method of translating between different addressing schemes

  • Start with a review of link structures of a wide spectrum of archives

Should publish best practises for a good addressing scheme

  • Document the structure of individual addressing schemes
  • Publish recommendations for addressing schemes, terminology used: common conventions

Spacelog highlights:

  • Addressing schemes become more complex for timeline browsers
  • This document log is not just a collection of independent documents, but a sequence of directly related events
  • The primary reading mode is in context: items are always presented within a timeline
    • Want to show documents leading up to and succeeding the highlighted documents
  • As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
  • This is mostly an artefact of timeline presentation within a browser
    • Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context

Content licenses

  • A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
    • In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license
    • In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license

Workflows

Hunt & Gather

  • Browse documents for a supported archive on any of its mirrors
  • Click "annotate" button
    • Service detects repository, corpus, document ID
    • Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context
    • Service presents editorial tools (annotate, link, publish, ...)
  • Repeat

Review

  • Browse: service has lists of editorial work (search, recent, my, popular, ...)
  • Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context)
    • Service presents source material, all editorial information, links to additional services relating to the referenced archive(s)
    • Service allows feedback (voting, amendments, comments)
  • Repeat

Shaping Contexts

  • Allow to "clone" public and personal editorial contexts to a new context in your personal namespace
  • This merges into an existing target context (or creates a blank one first)
  • Allow to withdraw your own contributions (should probably archive it instead of just deleting it)
  • All of these merely operate on references (to archives, contributors, comments, ...)

Links