User:Martind/Document Log Discovery Platform: Difference between revisions
From London Hackspace Wiki
(29 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Problem Statement == | == Problem Statement == | ||
We're seeing an increase in the publication of vast corpuses of | We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others? | ||
* Can we allow large number of interested parties | * Can we allow large number of interested parties to annotate these documents? | ||
** What kinds of annotations do we want to make? (Information structure) | ** What kinds of annotations do we want to make? (Information structure) | ||
** How can we make that easy? (Tools) | ** How can we make that easy? (Tools) | ||
** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use) | ** Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use) | ||
* Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service? | * Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service? | ||
** How would such a service integrate with layers below (archives) and above (editorial, exploration)? | |||
Note: | |||
* These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories. | |||
* Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist. | |||
== Exemplary Publications == | == Exemplary Publications == | ||
* WikiLeaks Iraq War Logs | Look out for: | ||
* Canonical document URLs (implicit, or explicit via rel="canonical") | |||
* Corpus IDs | |||
* WikiLeaks Embassy Cables | * Document IDs | ||
* Document ranges for timeline browsers | |||
* Additional presentation information, e.g. reading offset | |||
* How to construct canonical URLs from document IDs | |||
* http:// | * Existing services that interact with this archive | ||
=== WikiLeaks Iraq War Logs === | |||
** Document ID: 01:06:43:11 | * Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org]) | ||
** Reading offset ID: #log-line-110591 (" | * Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720 | ||
* Document also has a "Tracking number": 20091223210038SMB | |||
* To construct a canonical document URL: Base URL + document ID | |||
* Other services: | |||
** URLs are actively being shared (and annotated) on Twitter | |||
** This data set is present in Google Fusion Tables, though seemingly without further annotations | |||
* Other observations: | |||
===WikiLeaks Afghan War Diary=== | |||
* Document URL: | |||
* Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4 | |||
* Document also has a "Tracking number": 2007-033-004042-0756 | |||
* To construct a canonical document URL: | |||
* Other services: | |||
** URLs are actively being shared (and annotated) on Twitter | |||
** This data set is present in Google Fusion Tables, though seemingly without further annotations | |||
* Other observations: | |||
===WikiLeaks Embassy Cables=== | |||
* Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html | |||
* Document ID: 10COPENHAGEN69 | |||
* Browsers: [http://cablesearch.org/ cablesearch.org], | |||
* To construct a canonical document URL: Base URL + document ID | |||
* Other services: | |||
** URLs/identifiers are actively being shared (and annotated) on Twitter, e.g. http://twitter.com/bennohansen/statuses/14367684422533120 | |||
===SpaceLog=== | |||
* Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591 | |||
* Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591 | |||
* Corpus ID: apollo13 | |||
* Document ID: 01:06:43:11 | |||
* Presentation parameters: | |||
** Reading offset ID: #log-line-110591 | |||
* To construct canonical URLs: | |||
** Document URL: URL template, corpus ID, document ID, reading offset | |||
** Document range URL: URL template, corpus ID, document IDs, reading offset | |||
** No means to query corpus ID, reading offset for a document ID | |||
* Other services: | |||
** URLs are actively being shared (and annotated) on Twitter | |||
* Other observations: | |||
** Has rel="canonical" | |||
** Document IDs are actually timestamps. | |||
** There are collisions within a corpus, which does not greatly affect presentation, but may affect integration with other services | |||
** There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7) | |||
===Twitter=== | |||
* Document URL: | |||
** http://twitter.com/#!/wikileaks/statuses/15975805188317184 | |||
** http://twitter.com/wikileaks/statuses/15975805188317184 | |||
* Corpus ID: wikileaks | |||
* Document ID: 15975805188317184 | |||
* To construct canonical document URL: base URL + corpus ID + document ID | |||
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id | |||
* Other services: | |||
** ExquisiteTweets allows to group tweets | |||
===Eur-Lex=== | |||
* Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF | |||
* Corpus ID: OJ:L (Official Journal, legislation series) | |||
* Document ID: 2010:333:0001:0005 | |||
* Presentation parameters: | |||
** Language: EN | |||
** Format: PDF | |||
* To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters | |||
* Other observations: | |||
** Corpus ID could be understood as part of the document ID, we may not need to treat them separately | |||
* Other services: | |||
** TODO. They are likely to exist. | |||
===TODO=== | |||
* Patent databases | |||
* Databases of law | |||
* ... | |||
== Observations == | == Observations == | ||
=== Presentation === | |||
* Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context | |||
** Though don't make it too complex. Core use case: WikiLeaks text archives | |||
* Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication | |||
** This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes. | |||
=== Editorial Functions === | === Editorial Functions === | ||
Line 42: | Line 125: | ||
=== Interoperability === | === Interoperability === | ||
* Many parties will already build browsers for | * Many parties will already build browsers for document log archives, with varying ways of navigating such content. | ||
** We don't need to duplicate those efforts, but we should integrate with them. | ** We don't need to duplicate those efforts, but we should integrate with them. | ||
=== Addressing Schemes for Archives === | === Addressing Schemes for Archives === | ||
* Need a shared addressing scheme that works across archives, archive browsers | Addressability is a first technical barrier: | ||
* Being able to construct links between multiple representations of the same documents | |||
** Across mirrors (same content, same addressing scheme, different location) | |||
** Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations) | |||
* Being able to identify existing references/annotations by detecting such links | |||
** E.g. via Twitter/Google search | |||
Need a shared addressing scheme that works across archives, archive browsers | |||
* Based on permalinks? | |||
* Definite goal: to place links. Ability to send reader to the source material | |||
* Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.) | |||
Alternatively: need a method of translating between different addressing schemes | |||
* Start with a review of link structures of a wide spectrum of archives | |||
Should publish best practises for a good addressing scheme | |||
* Document the structure of individual addressing schemes | |||
* Publish recommendations for addressing schemes, terminology used: common conventions | |||
Spacelog highlights: | Spacelog highlights: | ||
* Addressing schemes become more complex for timeline browsers | * Addressing schemes become more complex for timeline browsers | ||
* This | * This document log is not just a collection of independent documents, but a sequence of directly related events | ||
* The primary reading mode is in context: items are always presented within a timeline | * The primary reading mode is in context: items are always presented within a timeline | ||
** Want to show documents leading up to and succeeding the highlighted documents | ** Want to show documents leading up to and succeeding the highlighted documents | ||
Line 69: | Line 160: | ||
=== Content licenses === | === Content licenses === | ||
* A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner | * A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner | ||
** In the case of | ** In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license | ||
** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license | ** In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license | ||
== Workflows == | |||
=== Hunt & Gather === | |||
* Browse documents for a supported archive on any of its mirrors | |||
* Click "annotate" button | |||
** Service detects repository, corpus, document ID | |||
** Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context | |||
** Service presents editorial tools (annotate, link, publish, ...) | |||
* Repeat | |||
=== Review === | |||
* Browse: service has lists of editorial work (search, recent, my, popular, ...) | |||
* Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context) | |||
** Service presents source material, all editorial information, links to additional services relating to the referenced archive(s) | |||
** Service allows feedback (voting, amendments, comments) | |||
* Repeat | |||
=== Shaping Contexts === | |||
* Allow to "clone" public and personal editorial contexts to a new context in your personal namespace | |||
* This merges into an existing target context (or creates a blank one first) | |||
* Allow to withdraw your own contributions (should probably archive it instead of just deleting it) | |||
* All of these merely operate on references (to archives, contributors, comments, ...) | |||
== Links == | == Links == |
Latest revision as of 16:53, 26 December 2010
Problem Statement
We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant? When identified with a new corpus, how can I see what other people already found? How can I make my own findings available to others?
- Can we allow large number of interested parties to annotate these documents?
- What kinds of annotations do we want to make? (Information structure)
- How can we make that easy? (Tools)
- Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
- Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
- How would such a service integrate with layers below (archives) and above (editorial, exploration)?
Note:
- These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
- Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.
Exemplary Publications
Look out for:
- Canonical document URLs (implicit, or explicit via rel="canonical")
- Corpus IDs
- Document IDs
- Document ranges for timeline browsers
- Additional presentation information, e.g. reading offset
- How to construct canonical URLs from document IDs
- Existing services that interact with this archive
WikiLeaks Iraq War Logs
- Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ (archive.org)
- Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
- Document also has a "Tracking number": 20091223210038SMB
- To construct a canonical document URL: Base URL + document ID
- Other services:
- URLs are actively being shared (and annotated) on Twitter
- This data set is present in Google Fusion Tables, though seemingly without further annotations
- Other observations:
WikiLeaks Afghan War Diary
- Document URL:
- Document ID: D92871CA-D217-4124-B8FB-89B9A2CFFCB4
- Document also has a "Tracking number": 2007-033-004042-0756
- To construct a canonical document URL:
- Other services:
- URLs are actively being shared (and annotated) on Twitter
- This data set is present in Google Fusion Tables, though seemingly without further annotations
- Other observations:
WikiLeaks Embassy Cables
- Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
- Document ID: 10COPENHAGEN69
- Browsers: cablesearch.org,
- To construct a canonical document URL: Base URL + document ID
- Other services:
- URLs/identifiers are actively being shared (and annotated) on Twitter, e.g. http://twitter.com/bennohansen/statuses/14367684422533120
SpaceLog
- Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
- Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
- Corpus ID: apollo13
- Document ID: 01:06:43:11
- Presentation parameters:
- Reading offset ID: #log-line-110591
- To construct canonical URLs:
- Document URL: URL template, corpus ID, document ID, reading offset
- Document range URL: URL template, corpus ID, document IDs, reading offset
- No means to query corpus ID, reading offset for a document ID
- Other services:
- URLs are actively being shared (and annotated) on Twitter
- Other observations:
- Has rel="canonical"
- Document IDs are actually timestamps.
- There are collisions within a corpus, which does not greatly affect presentation, but may affect integration with other services
- There are collisions across corpuses (concurrent space missions, e.g. Gemini 6A and 7)
- Document URL:
- Corpus ID: wikileaks
- Document ID: 15975805188317184
- To construct canonical document URL: base URL + corpus ID + document ID
- Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
- Other services:
- ExquisiteTweets allows to group tweets
Eur-Lex
- Document URL: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2010:333:0001:0005:EN:PDF
- Corpus ID: OJ:L (Official Journal, legislation series)
- Document ID: 2010:333:0001:0005
- Presentation parameters:
- Language: EN
- Format: PDF
- To construct canonical document URL: URL template + corpus ID + document ID + presentation parameters
- Other observations:
- Corpus ID could be understood as part of the document ID, we may not need to treat them separately
- Other services:
- TODO. They are likely to exist.
TODO
- Patent databases
- Databases of law
- ...
Observations
Presentation
- Would like to be able to extract document content (excerpt or in full) to present it along with editorial information and other context
- Though don't make it too complex. Core use case: WikiLeaks text archives
- Would like to offer editorial tools in a manner that they can easily be integrated on other sites, e.g. as part of an editorial publication
- This infrastructure is plumbing, not a product in itself. It only becomes useful by feeding into and strengthening existing editorial processes.
Editorial Functions
- It seems useful to be able to link/group individual messages
- It seems useful to be able to annotate content (with text, links)
- It seems useful to be able to contribute anonymously
- It seems useful to be able to annotate/qualify editorial contributions by others
Interestingness, Popularity
- What constitutes an "interesting" section of a document is a matter of perspective.
- Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
- Relationship between "popular" and "interesting" items:
- Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
- "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
- The latter could however feed into the former: items that are widely perceived as interesting
- To best accomplish this we should attempt to simplify the workflow of an editorial process.
Interoperability
- Many parties will already build browsers for document log archives, with varying ways of navigating such content.
- We don't need to duplicate those efforts, but we should integrate with them.
Addressing Schemes for Archives
Addressability is a first technical barrier:
- Being able to construct links between multiple representations of the same documents
- Across mirrors (same content, same addressing scheme, different location)
- Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
- Being able to identify existing references/annotations by detecting such links
- E.g. via Twitter/Google search
Need a shared addressing scheme that works across archives, archive browsers
- Based on permalinks?
- Definite goal: to place links. Ability to send reader to the source material
- Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
Alternatively: need a method of translating between different addressing schemes
- Start with a review of link structures of a wide spectrum of archives
Should publish best practises for a good addressing scheme
- Document the structure of individual addressing schemes
- Publish recommendations for addressing schemes, terminology used: common conventions
Spacelog highlights:
- Addressing schemes become more complex for timeline browsers
- This document log is not just a collection of independent documents, but a sequence of directly related events
- The primary reading mode is in context: items are always presented within a timeline
- Want to show documents leading up to and succeeding the highlighted documents
- As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
- This is mostly an artefact of timeline presentation within a browser
- Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context
Content licenses
- A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
- In the case of documents published by governments and NGOs the data may either be in the public domain, or will have an explicit license
- In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
Workflows
Hunt & Gather
- Browse documents for a supported archive on any of its mirrors
- Click "annotate" button
- Service detects repository, corpus, document ID
- Service allows to amend existing editorial work for this corpus, document (yours, everybody's), or to start a new context
- Service presents editorial tools (annotate, link, publish, ...)
- Repeat
Review
- Browse: service has lists of editorial work (search, recent, my, popular, ...)
- Open one editorial context (which either belongs to a person, group, or represents the total/global work for this context)
- Service presents source material, all editorial information, links to additional services relating to the referenced archive(s)
- Service allows feedback (voting, amendments, comments)
- Repeat
Shaping Contexts
- Allow to "clone" public and personal editorial contexts to a new context in your personal namespace
- This merges into an existing target context (or creates a blank one first)
- Allow to withdraw your own contributions (should probably archive it instead of just deleting it)
- All of these merely operate on references (to archives, contributors, comments, ...)
Links
- http://booktwo.org/notebook/openbookmarks/ check these sites for bookmark/annotation conventions
- http://www.openbookmarks.org/ (focused on ebooks)