User:Martind/Document Log Discovery Platform: Difference between revisions

From London Hackspace Wiki
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Problem Statement ==
== Problem Statement ==


We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant?
We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others?


* Can we allow large number of interested parties (anyone really) to annotate these documents?
* Can we allow large number of interested parties to annotate these documents?
** What kinds of annotations do we want to make? (Information structure)
** What kinds of annotations do we want to make? (Information structure)
** How can we make that easy? (Tools)
** How can we make that easy? (Tools)
** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
* Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
* Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
** How would such a service integrate with layers below (archives) and above (editorial, exploration)?


Note:
Note:
* These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
* These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
* Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.


== Exemplary Publications ==
== Exemplary Publications ==
Line 21: Line 23:
* Additional presentation information, e.g. reading offset
* Additional presentation information, e.g. reading offset
* How to construct canonical URLs from document IDs
* How to construct canonical URLs from document IDs
* Existing services that interact with this archive


'''WikiLeaks Iraq War Logs'''
=== WikiLeaks Iraq War Logs ===
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
* Document also has a "Tracking number": 20091223210038SMB
* To construct a canonical document URL: Base URL + document ID
* To construct a canonical document URL: Base URL + document ID
* Other services:
** URLs are actively being shared (and annotated) on Twitter
** This data set is present in Google Fusion Tables, though seemingly without further annotations
* Other observations:
===WikiLeaks Afghan War Diary===
* Document URL:
* Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4
* Document also has a "Tracking number": 2007-033-004042-0756
* To construct a canonical document URL:
* Other services:
** URLs are actively being shared (and annotated) on Twitter
** This data set is present in Google Fusion Tables, though seemingly without further annotations
* Other observations:


'''WikiLeaks Embassy Cables'''
===WikiLeaks Embassy Cables===
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
* Document ID: 10COPENHAGEN69
* Document ID: 10COPENHAGEN69
* Browsers: [http://cablesearch.org/ cablesearch.org],  
* Browsers: [http://cablesearch.org/ cablesearch.org],  
* To construct a canonical document URL: Base URL + document ID
* To construct a canonical document URL: Base URL + document ID
* Other services:
** URLs/identifiers are actively being shared (and annotated) on Twitter, e.g. http://twitter.com/bennohansen/statuses/14367684422533120


'''SpaceLog'''
===SpaceLog===
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
Line 44: Line 64:
** Document range URL: URL template, corpus ID, document IDs, reading offset
** Document range URL: URL template, corpus ID, document IDs, reading offset
** No means to query corpus ID, reading offset for a document ID
** No means to query corpus ID, reading offset for a document ID
* Other services:
** URLs are actively being shared (and annotated) on Twitter
* Other observations:
* Other observations:
** Has rel="canonical"
** Has rel="canonical"
Line 50: Line 72:
** There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)
** There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)


'''Twitter'''
===Twitter===
* Document URL:
* Document URL:
** http://twitter.com/#!/wikileaks/statuses/15975805188317184
** http://twitter.com/#!/wikileaks/statuses/15975805188317184
Line 58: Line 80:
* To construct canonical document URL: base URL + corpus ID + document ID
* To construct canonical document URL: base URL + corpus ID + document ID
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
* Other services:
** ExquisiteTweets allows to group tweets


'''Eur-Lex'''
===Eur-Lex===
* Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
* Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
* Corpus ID: OJ:L (Official Journal, legislation series)
* Corpus ID: OJ:L (Official Journal, legislation series)
Line 67: Line 91:
** Format: PDF
** Format: PDF
* To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
* To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
* Other observations:
** Corpus ID could be understood as part of the document ID, we may not need to treat them separately
* Other services:
** TODO. They are likely to exist.


'''TODO'''
===TODO===
* Patent databases
* Patent databases
* Databases of law
* Databases of law
Line 74: Line 102:


== Observations ==
== Observations ==
=== Presentation ===
* Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context
** Though don't make it too complex. Core use case: WikiLeaks text archives
* Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication
** This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes.


=== Editorial Functions ===
=== Editorial Functions ===
Line 91: Line 125:


=== Interoperability ===
=== Interoperability ===
* Many parties will already build browsers for data log archives, with varying ways of navigating such content.  
* Many parties will already build browsers for document log archives, with varying ways of navigating such content.  
** We don't need to duplicate those efforts, but we should integrate with them.
** We don't need to duplicate those efforts, but we should integrate with them.


=== Addressing Schemes for Archives ===
=== Addressing Schemes for Archives ===


* Need a shared addressing scheme that works across archives, archive browsers
Addressability is a first technical barrier:
** Based on permalinks?
* Being able to construct links between multiple representations of the same documents
** Definite goal: to place links. Ability to send reader to the source material
** Across mirrors (same content, same addressing scheme, different location)
** Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
** Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
* Alternatively: need a method of translating between different addressing schemes
* Being able to identify existing references/annotations by detecting such links
** Start with a review of link structures of a wide spectrum of archives
** E.g. via Twitter/Google search
* Should publish best practises for a good addressing scheme
 
** Document the structure of individual addressing schemes
Need a shared addressing scheme that works across archives, archive browsers
** Publish recommendations for addressing schemes, terminology used: common conventions
* Based on permalinks?
* ...
* Definite goal: to place links. Ability to send reader to the source material
* Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
 
Alternatively: need a method of translating between different addressing schemes
* Start with a review of link structures of a wide spectrum of archives
 
Should publish best practises for a good addressing scheme
* Document the structure of individual addressing schemes
* Publish recommendations for addressing schemes, terminology used: common conventions


Spacelog highlights:
Spacelog highlights:
* Addressing schemes become more complex for timeline browsers
* Addressing schemes become more complex for timeline browsers
* This data log is not just a collection of independent documents, but a sequence of directly related events
* This document log is not just a collection of independent documents, but a sequence of directly related events
* The primary reading mode is in context: items are always presented within a timeline
* The primary reading mode is in context: items are always presented within a timeline
** Want to show documents leading up to and succeeding the highlighted documents
** Want to show documents leading up to and succeeding the highlighted documents
Line 118: Line 160:
=== Content licenses ===
=== Content licenses ===
* A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
* A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
** In the case of data published by governments and NGOs the data may either be in the public domain, or will have an explicit license
** In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license
** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
== Workflows ==
=== Hunt & Gather ===
* Browse documents for a supported archive on any of its mirrors
* Click "annotate" button
** Service detects repository, corpus, document ID
** Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context
** Service presents editorial tools (annotate, link, publish, ...)
* Repeat
=== Review ===
* Browse: service has lists of editorial work (search, recent, my, popular, ...)
* Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context)
** Service presents source material, all editorial information, links to additional services relating to the referenced archive(s)
** Service allows feedback (voting, amendments, comments)
* Repeat
=== Shaping Contexts ===
* Allow to "clone" public and personal editorial contexts to a new context in your personal namespace
* This merges into an existing target context (or creates a blank one first)
* Allow to withdraw your own contributions (should probably archive it instead of just deleting it)
* All of these merely operate on references (to archives, contributors, comments, ...)


== Links ==
== Links ==

Latest revision as of 16:53, 26 December 2010

Problem Statement

We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others?

  • Can we allow large number of interested parties to annotate these documents?
    • What kinds of annotations do we want to make? (Information structure)
    • How can we make that easy? (Tools)
    • Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
  • Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
    • How would such a service integrate with layers below (archives) and above (editorial, exploration)?

Note:

  • These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
  • Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.

Exemplary Publications

Look out for:

  • Canonical document URLs (implicit, or explicit via rel="canonical")
  • Corpus IDs
  • Document IDs
  • Document ranges for timeline browsers
  • Additional presentation information, e.g. reading offset
  • How to construct canonical URLs from document IDs
  • Existing services that interact with this archive

WikiLeaks Iraq War Logs

  • Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ (archive.org)
  • Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
  • Document also has a "Tracking number": 20091223210038SMB
  • To construct a canonical document URL: Base URL + document ID
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
    • This data set is present in Google Fusion Tables, though seemingly without further annotations
  • Other observations:

WikiLeaks Afghan War Diary

  • Document URL:
  • Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4
  • Document also has a "Tracking number": 2007-033-004042-0756
  • To construct a canonical document URL:
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
    • This data set is present in Google Fusion Tables, though seemingly without further annotations
  • Other observations:

WikiLeaks Embassy Cables

SpaceLog

  • Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
  • Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
  • Corpus ID: apollo13
  • Document ID: 01:06:43:11
  • Presentation parameters:
    • Reading offset ID: #log-line-110591
  • To construct canonical URLs:
    • Document URL: URL template, corpus ID, document ID, reading offset
    • Document range URL: URL template, corpus ID, document IDs, reading offset
    • No means to query corpus ID, reading offset for a document ID
  • Other services:
    • URLs are actively being shared (and annotated) on Twitter
  • Other observations:
    • Has rel="canonical"
    • Document IDs are actually timestamps.
    • There are collisions within a corpus, which does not greatly affect presentation, but may affect integration with other services
    • There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)

Twitter

Eur-Lex

  • Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
  • Corpus ID: OJ:L (Official Journal, legislation series)
  • Document ID: 2010:333:0001:0005
  • Presentation parameters:
    • Language: EN
    • Format: PDF
  • To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
  • Other observations:
    • Corpus ID could be understood as part of the document ID, we may not need to treat them separately
  • Other services:
    • TODO. They are likely to exist.

TODO

  • Patent databases
  • Databases of law
  • ...

Observations

Presentation

  • Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context
    • Though don't make it too complex. Core use case: WikiLeaks text archives
  • Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication
    • This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes.

Editorial Functions

  • It seems useful to be able to link/group individual messages
  • It seems useful to be able to annotate content (with text, links)
  • It seems useful to be able to contribute anonymously
  • It seems useful to be able to annotate/qualify editorial contributions by others

Interestingness, Popularity

  • What constitutes an "interesting" section of a document is a matter of perspective.
    • Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
  • Relationship between "popular" and "interesting" items:
    • Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
    • "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
    • The latter could however feed into the former: items that are widely perceived as interesting
      • To best accomplish this we should attempt to simplify the workflow of an editorial process.

Interoperability

  • Many parties will already build browsers for document log archives, with varying ways of navigating such content.
    • We don't need to duplicate those efforts, but we should integrate with them.

Addressing Schemes for Archives

Addressability is a first technical barrier:

  • Being able to construct links between multiple representations of the same documents
    • Across mirrors (same content, same addressing scheme, different location)
    • Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
  • Being able to identify existing references/annotations by detecting such links
    • E.g. via Twitter/Google search

Need a shared addressing scheme that works across archives, archive browsers

  • Based on permalinks?
  • Definite goal: to place links. Ability to send reader to the source material
  • Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)

Alternatively: need a method of translating between different addressing schemes

  • Start with a review of link structures of a wide spectrum of archives

Should publish best practises for a good addressing scheme

  • Document the structure of individual addressing schemes
  • Publish recommendations for addressing schemes, terminology used: common conventions

Spacelog highlights:

  • Addressing schemes become more complex for timeline browsers
  • This document log is not just a collection of independent documents, but a sequence of directly related events
  • The primary reading mode is in context: items are always presented within a timeline
    • Want to show documents leading up to and succeeding the highlighted documents
  • As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
  • This is mostly an artefact of timeline presentation within a browser
    • Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context

Content licenses

  • A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
    • In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license
    • In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license

Workflows

Hunt & Gather

  • Browse documents for a supported archive on any of its mirrors
  • Click "annotate" button
    • Service detects repository, corpus, document ID
    • Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context
    • Service presents editorial tools (annotate, link, publish, ...)
  • Repeat

Review

  • Browse: service has lists of editorial work (search, recent, my, popular, ...)
  • Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context)
    • Service presents source material, all editorial information, links to additional services relating to the referenced archive(s)
    • Service allows feedback (voting, amendments, comments)
  • Repeat

Shaping Contexts

  • Allow to "clone" public and personal editorial contexts to a new context in your personal namespace
  • This merges into an existing target context (or creates a blank one first)
  • Allow to withdraw your own contributions (should probably archive it instead of just deleting it)
  • All of these merely operate on references (to archives, contributors, comments, ...)

Links