NTCIR Temporal Information Access (Temporalia) Task

Temporalia-2 at NTCIR-12 offer two subtasks in English and Chinese to address temporal information access technologies as follows. Interested researchers and research groups can participate in either or both of the subtasks in any combination of languages.

TID Subtask | TDR Subtask

Temporal Intent Disambiguation (TID) Subtask

TID subtask asks participants to estimate a distribution of four temporal intent classes (Atemporal, Past, Recent, or Future) for a given query. This is an upgraded task from Temporal Query Intent Classification (TQIC) subtask at NTCIR-11, where participants were only asked to estimate the best (single) temporal intent category for a given query. Like the TQIC subtask, participants will receive a set of query strings and submission date, and develop a system that estimates a distribution among four temporal intent classes. TID will employ test queries more likely to be temporally ambiguous than those used in TQIC. The answer distribution is estimated from the voting crowd workers.

Participants are allowed to use any external resources to complete the TID subtask as long as the detail of external resource usage is reported. This subtask does not necessarily require to index our document collection.

Sample query

<query>
  <id>033</id>
  <query_string>weather in London</query_string>
  <query_issue_time>May 1, 2013 GMT+0</query_issue_time>
  <probabilities>
    <Past>0.0</Past>
    <Recency>0.9</Recency>
    <Future>0.1</Future>
    <Atemporal>0.0</Atemporal>
  </probabilities>
</query>
<query>
  <id>035</id>
  <query_string>value of silver dollars 1976</query_string>
  <query_issue_time>May 1, 2013 GMT+0</query_issue_time>
  <probabilities>
    <Past>0.727</Past>
    <Recency>0.273</Recency>
    <Future>0.0</Future>
    <Atemporal>0.0</Atemporal>
  </probabilities>
</query>
		...

Query size for Dry Run (Training/Testing)

English: 93 (73/20)
Chinese: 52 (34/18)

TID Evaluation

For a specific query \(q\), let \(P = \{p_1, p_2, p_3, p_4\}\) denote its standard temporal class distribution, and \(W = \{w_1, w_2, w_3, w_4\}\) denote the temporal class distribution from a participant. The classification loss for a single query will be measured using the following two ways.

Metric-1: Averaged per-class absolute loss, i.e.,

\(\frac{1}{4}\sum_{i=1}^4|w_i-p_i|\)

Metric-2: Cosine similarity between the two probability vectors \(P\) and \(W\), i.e.,

\(cos\theta = \frac{P \cdot W}{|P||W|}\ = \frac{\sum_{i=1}^4|p_i*w_i|} {\sqrt{\sum_{i=1}^4p_i^2}*\sqrt{\sum_{i=1}^4w_i^2}}\)

For example, suppose \(P=\{0.50,0.50,0.00,0.00\}\), and \(W=\{0.00,0.00,0.50,0.50\}\), then for metric-1, the metric value will be 0.5. And for metric-2, it is 0.

The final performance of a submitted run would be the averaged value across all test queries.

Temporally Diversified Retrieval (TDR) Subtask

TDR subtask will require participants to retrieve a set of documents relevant to each of four temporal intent classes for a given topic description. Participants are also asked to return a set of documents that is temporally diversifie for the same topic. Participants will receive a set of topic descriptions, query issuing time, and indicative search questions for each of temporal classes (Past, Recency, Future, and Atemporal). Please be careful. Indicative search questions show only one possible subtopic under a particular temporal class as a reference. Therefore, participants are expected to mine other potential subtopics that are relevant to each of temporal classes.

In summary, participants are asked to develop a system that can produce a total of five search results per topic (Past, Recency, Future, Atemporal, and Diversified). The first four results are similar to Temporalia-1, where ranking should be optimised for a particular temporal class, while the fifth result is the new element in Temporalia-2 where ranking should be temporally diversified by considering all four temporal classes.

Participants are allowed to use any external resources to complete the TDR subtask as long as the detail of external resource usage is reported. This subtask requires to index our document collections.

Sample Topic

<topic>
  <id>002</id>
  <title>Junk food health effect</title>
  <description>I am concerned about the health effects of junk food in general. I need to know more about their ingredients, impact on health, history, current scientific discoveries and any prognoses.</description>
  <query_issue_time>Mar 29, 2013 GMT+0:00</query_issue_time>
  <subtopics>
    <subtopic id="002a" type="atemporal">How junk foods are defined?</subtopic>
    <subtopic id="002p" type="past">When did junk foods become popular?</subtopic>
    <subtopic id="002r" type="recency">What are the latest studies on the effect of junk foods on our health?</subtopic>
    <subtopic id="002f" type="future">Will junk food continue to be popular in the future?</subtopic>
  </subtopics>
</topic>

Topic Size for Dry Run

English: 10
Chinese: 10

Topic Size for Formal Run

English: 50
Chinese: 50

Topic fields to use for your ranking

It is up to you about what fields of topic descriptions are used as system inputs. However, please make sure to report the input fields when you submit a run and when you write a participant report. Typical combinations might be

Title and subtopic
Title, Description, and subtopic

Please do not use the subtopic class information. For the convenience of data management, subtopic IDs of TDR contain a token of temporal classes such as 001p for past subtopic and 001f for future subtopic. However, please do not use this token as an input to your system.

TDR Evaluation

For the evaluation, we will use the standard Cranfield methodology. In particular, a pool of possibly relevant documents is created based on the top-ranked documents from participants' submitted runs. Then each document in the pool will be assessed (e.g., through online crowdsourcing), and its relevance grade will be judged.

A ranked list generated for a specific temporal subtopic, its performance will be evaluated using the metric nDCG (cf. [1]).

A diversified ranked list generated to satisfy all possible temporal classes, its performance will be evaluated using α-nDCG (cf. [2]) and D#-nDCG (cf. [3]

[1] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002.

[2] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Buttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st SIGIR, pages 659–666, 2008.

[3] Tetsuya Sakai and Ruihua Song. Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th SIGIR, pages 1043–1052, 2011.

Feedback or suggestions?

NTCIR-12 Temporalia welcomes any feedback or suggestions for our task design, please feel free to contact us via tc4fia at googlegroups dot com.