User:Martind/Document Log Discovery Platform: Difference between revisions
From London Hackspace Wiki
Line 44: | Line 44: | ||
** Document range URL: URL template, corpus ID, document IDs, reading offset | ** Document range URL: URL template, corpus ID, document IDs, reading offset | ||
** No means to query corpus ID, reading offset for a document ID | ** No means to query corpus ID, reading offset for a document ID | ||
* Other observations: | |||
** Document IDs are actually timestamps. There may be collisions, which does not greatly affect presentation, but may affect integration with other services | |||
'''Twitter''' | '''Twitter''' |
Revision as of 18:16, 18 December 2010
Problem Statement
We're seeing an increase in the publication of vast corpuses of document logs, often in the form of message archives, usually in a structured message format. They're all quite overwhelming: how to make sense of such a vast amount of text? How to identify sections that are relevant?
- Can we allow large number of interested parties (anyone really) to annotate these documents?
- What kinds of annotations do we want to make? (Information structure)
- How can we make that easy? (Tools)
- Can we identify good conventions and techniques for the above that are more generally applicable? (Patterns of use)
- Finally, can we think of these functions as a layer on top of mere archives, and construct them as a physically separate service?
Note:
- These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
Exemplary Publications
Look out for:
- Canonical document URLs (implicit, or explicit via rel="canonical")
- Corpus IDs
- Document IDs
- Document ranges for timeline browsers
- Additional presentation information, e.g. reading offset
- How to construct canonical URLs from document IDs
WikiLeaks Iraq War Logs
- Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ (archive.org)
- Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
- To construct a canonical document URL: Base URL + document ID
WikiLeaks Embassy Cables
- Document URL: http://213.251.145.96/cable/2010/02/10COPENHAGEN69.html
- Document ID: 10COPENHAGEN69
- Browsers: cablesearch.org,
- To construct a canonical document URL: Base URL + document ID
SpaceLog
- Document URL: http://apollo13.spacelog.org/01:06:43:11/#log-line-110591
- Document range URL: http://apollo13.spacelog.org/01:06:43:11/01:06:43:30/#log-line-110591
- Corpus ID: apollo13
- Document ID: 01:06:43:11
- Reading offset ID: #log-line-110591
- has rel="canonical"
- To construct canonical URLs:
- Document URL: URL template, corpus ID, document ID, reading offset
- Document range URL: URL template, corpus ID, document IDs, reading offset
- No means to query corpus ID, reading offset for a document ID
- Other observations:
- Document IDs are actually timestamps. There may be collisions, which does not greatly affect presentation, but may affect integration with other services
- Document URL:
- Corpus ID: wikileaks
- Document ID: 15975805188317184
- To construct canonical document URL: base URL + corpus ID + document ID
- Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
Observations
Editorial Functions
- It seems useful to be able to link/group individual messages
- It seems useful to be able to annotate content (with text, links)
- It seems useful to be able to contribute anonymously
- It seems useful to be able to annotate/qualify editorial contributions by others
Interestingness, Popularity
- What constitutes an "interesting" section of a document is a matter of perspective.
- Such annotations become more useful if they're linked to a context (e.g. "this cable relates to news story X")
- Relationship between "popular" and "interesting" items:
- Much easier to establish "popularity" via simple (implicit, explicit) voting mechanisms: Q&A sites, collaborative news sites, click tracking etc.
- "Interestingness" requires more work, since it is the result of an editorial process. This makes it slower, potentially tedious to demonstrate, and error prone.
- The latter could however feed into the former: items that are widely perceived as interesting
- To best accomplish this we should attempt to simplify the workflow of an editorial process.
Interoperability
- Many parties will already build browsers for data log archives, with varying ways of navigating such content.
- We don't need to duplicate those efforts, but we should integrate with them.
Addressing Schemes for Archives
- Need a shared addressing scheme that works across archives, archive browsers
- Based on permalinks?
- Definite goal: to place links. Ability to send reader to the source material
- Optional goal: simple API. Ability to load source material by forming an address. (This places more requirements on the nature of the archive of the respective source material.)
- Alternatively: need a method of translating between different addressing schemes
- Start with a review of link structures of a wide spectrum of archives
- Should publish best practises for a good addressing scheme
- Document the structure of individual addressing schemes
- Publish recommendations for addressing schemes, terminology used: common conventions
- ...
Spacelog highlights:
- Addressing schemes become more complex for timeline browsers
- This data log is not just a collection of independent documents, but a sequence of directly related events
- The primary reading mode is in context: items are always presented within a timeline
- Want to show documents leading up to and succeeding the highlighted documents
- As a consequence, an address is a tuple of a) a document ID or pair of document IDs (start, end) and b) a visual offset that anchors the start of the reading position
- This is mostly an artefact of timeline presentation within a browser
- Want to anchor reading position at the first highlighted document. The reader can then scroll up to get context
Content licenses
- A good discovery platform may want to republish content from linked archives to be able to present them in a coherent manner
- In the case of data published by governments and NGOs the data may either be in the public domain, or will have an explicit license
- In the case of document leaks the legal status of this data may not be clear obvious, and there may not be an explicit license
Links
- http://booktwo.org/notebook/openbookmarks/ check these sites for bookmark/annotation conventions
- http://www.openbookmarks.org/ (focused on ebooks)