NTCIR-11 Temporalia uses a web corpus, called
"LivingKnowledge news and blogs annotated subcollection", constructed by the
project and distributed by
The collection is 20G uncompressed and over 5G zipped.
It spans from May 2011 to March 2013 and contains around 3.8M documents collected
from about 1500 different blogs and news sources.
The data is split into 970 files, named after the date of that day and some information
about its sources (there might be more than one file per day).
Each file contains a number of text documents.
For each document the following information is available
<tag name="title">EU Widens LIBOR Scandal Investigation and Threatens Heavy Fines</tag>
The "doc id" refers to a unique document identifier in the collection.
The "host" contains the hostname the text was pulled from,
the "date" the publishing data of the document, the "url" the url the text was pulled from,
the "sourcerss" the rss that was accessed to retrieve the page, and finally,
the "title" the title of the page.
Between the <text> tags, there’s the content of the page.
This collection also provides three kinds of annotations: Sentence splitting,
Named Entities, and Time annotations.
Each sentence in the content of the page is surrounded by <SE> tags.
Each identified named entity is surrounded by the <E> tags.
The type of the entity is included inside the tag, for instance
Each time reference identified in the text is surrounded by the tag <T>
<T val="2012">the end of 2012</T> which contains a "val" element
referring to the estimated point in time the annotation is referring to.
An example of documents can be found below.
We will provide a script to remove these annotations and leave only the textual
part of each page, if the participants wish to only use that.
How to get a copy of the document collection
Please contact us via
or email (tc4fia at googlegroups dot com)
first with your name, affiliation, and registered group id.
We will then let you know the contact address of Leïla Medjkoune, Head of Web Archiving,
who manages the distribution of the document collection.
How to remove tags and correct markup inconsistency
Please read HOWTO and FAQ