User:Martind/Document Log Discovery Platform: Difference between revisions

From London Hackspace Wiki

no edit summary
No edit summary
Line 11: Line 11:


Note:
Note:
* These notes are limited to (text) document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
* These notes are limited to text document corpuses, and won't attempt to incorporate numerical/statistical/other data repositories.
* Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.
* Specifically no attempt is made to address information within a document, or to address information aggregated across documents, if such derivative forms don't already exist.


== Exemplary Publications ==
== Exemplary Publications ==


The ultimate goal: being able to construct links between multiple representations of the same documents
The ultimate goal:  
* Across mirrors (same addressing scheme, different location)
* Being able to construct links between multiple representations of the same documents
* Across types of archive browsers (may have different addressing schemes, will have different locations)
** Across mirrors (same content, same addressing scheme, different location)
** Across types of archive browsers (some may have further annotations, all may have different addressing schemes, all will have different locations)
* Being able to identify existing references by detecting such links
** E.g. via Twitter/Google search


Look out for:
Look out for:
Line 27: Line 30:
* Additional presentation information, e.g. reading offset
* Additional presentation information, e.g. reading offset
* How to construct canonical URLs from document IDs
* How to construct canonical URLs from document IDs
* Existing services that interact with this archive


'''WikiLeaks Iraq War Logs'''
'''WikiLeaks Iraq War Logs'''
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
* Document URL: http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ ([http://web.archive.org/web/20101030053024/http://warlogs.wikileaks.org/id/BCD499A0-F0A3-2B1D-B27A2F1D750FE720/ archive.org])
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
* Document ID: BCD499A0-F0A3-2B1D-B27A2F1D750FE720
* Document also has a "Tracking number": 20091223210038SMB
* To construct a canonical document URL: Base URL + document ID
* To construct a canonical document URL: Base URL + document ID
* Other services:
** URLs are actively being shared (and annotated) on Twitter
* Other observations:


'''WikiLeaks Embassy Cables'''
'''WikiLeaks Embassy Cables'''
Line 38: Line 46:
* Browsers: [http://cablesearch.org/ cablesearch.org],  
* Browsers: [http://cablesearch.org/ cablesearch.org],  
* To construct a canonical document URL: Base URL + document ID
* To construct a canonical document URL: Base URL + document ID
* Other services:
** URLs are actively being shared (and annotated) on Twitter


'''SpaceLog'''
'''SpaceLog'''
Line 50: Line 60:
** Document range URL: URL template, corpus ID, document IDs, reading offset
** Document range URL: URL template, corpus ID, document IDs, reading offset
** No means to query corpus ID, reading offset for a document ID
** No means to query corpus ID, reading offset for a document ID
* Other services:
** URLs are actively being shared (and annotated) on Twitter
* Other observations:
* Other observations:
** Has rel="canonical"
** Has rel="canonical"
Line 64: Line 76:
* To construct canonical document URL: base URL + corpus ID + document ID
* To construct canonical document URL: base URL + corpus ID + document ID
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
** Can query corpus ID (username) via e.g. http://dev.twitter.com/doc/get/statuses/show/:id
* Other services:
** ExquisiteTweets allows to group tweets


'''Eur-Lex'''
'''Eur-Lex'''
Line 75: Line 89:
* Other observations:
* Other observations:
** Corpus ID could be understood as part of the document ID, we may not need to treat them separately
** Corpus ID could be understood as part of the document ID, we may not need to treat them separately
* Other services:
** TODO. They are likely to exist.


'''TODO'''
'''TODO'''