Indri hints by Trevor Strohman

Blog

Indri hints, composed by Trevor Strohman

Contents [hide]

1 About Indri
2 Getting Help
3 About the Indri Applications
4 Using the Indri API
5 Tweaking the Indexing Process
6 Manipulating Retrieval
7 Using Indri with a Cluster of Machines
8 About Indri Repositories
9 Using Parameters
10 Using Explicitly Numbered Queries
11 More Complicated Build Parameters

About Indri

Indri is a text search engine developed at UMass. It is a part of the Lemur project.

From an academic perspective, Indri is interesting because it combines inference networks with language modeling. The query language, which is reminicent of the Inquery query language, allows researchers to experiment with proximity, document structure, text passages, and other document features without writing code. Like other academic engines, Indri can parse TREC newswire and web collections, and it is able to return results in the TREC standard format.

From an industrial perspective, Indri is interesting because it is efficient, supported, and easy to integrate. Indri is freely available from UMass with a flexible BSD-inspired license, but a commercially supported version will soon be available from Lexalytics. Indri includes an API that is accessible from C++, Java, C# and PHP. Indri also can be distributed across a cluster of nodes for high speed query performance. In version 2.0, Indri adds true multithreaded operation, so documents can be added, queried and deleted concurrently. More information about these features can be found in this paper.

Getting Help

I suggest these resources for finding out more about Indri:

The website,
The tutorials,
the forum,
the README file in the distribution,
the code documentation,
and the query language documentation.

If you have questions about Indri, you're likely to get the fastest response by posting to the forum, since many people watch it regularly. You may find the answer to your question by searching through the forum archives. Many of the forum responses that I write will eventually end up on this page as well.

About the Indri Applications

The buildindex application can build Indri repositories from TREC formatted documents, HTML documents, text documents, and PDF files. Additionally, on Windows it can index Word and PowerPoint documents. Buildindex understands tags in HTML/XML documents, and it can be instructed to index them as well.

The runquery application evaluates queries against one or more Indri repositories, and returns the results in a ranked list of documents. runquery can be instructed to print the document text as well, or the text of passages if the query is a passage retrieval query.

The indrid application is a repository server. It waits for connections from runquery (or from other applications using the QueryEnvironment interface) and processes queries from network requests. One copy of runquery can connect to many indrid instances at once, making retrieval using a cluster of machines possible.

Using the Indri API

Indri provides the QueryEnvironment and IndexEnvrionment classes, which can be used from C++, Java, C# or PHP (although indexing is not supported from PHP). The buildindex and runquery applications use these classes exclusively. Please keep in mind that we reserve the right to change any classes within Indri that are not in the indri::api namespace. If you write your code to use only indri::api classes, we will do our best to make sure they still work in future versions of Indri.

IndexEnvironment understands many different file types. However, you can create your own file type, as long as it is XML-like, and tell IndexEnvironment how to index it. Then, using the addFile method, IndexEnvironment can index your document(s). If you want to do more complex processing on your data, or if your data is arriving in real time, you may parse your document into a ParsedDocument structure. The IndexEnvrionment object can index these structures directly.

QueryEnvironment allows you to run queries and retrieve a ranked list of results. You can use runAnnotatedQuery to retrieve match information (annotations), which is useful for highlighting matched words in documents. By using the addIndex method with an instance of IndexEnvironment, you can evaluate queries on an index that is currently being built. The addServer method allows you to connect to indrid processes for distributed retrieval.

How do I use the Indri API from Java?

First, you need to build Indri inculding the Java wrappers. On Unix, you do this by adding the --enable-swig line when running the configure script. The script should find your Java installation automatically, but if it doesn't, you can show it where to find java by using the --with-javahome parameter. If you are using Windows, use the swig project file from Visual Studio to build the Java API. You may need to change the include path on the project to point to your Java installation.

Once that's built, indri.jar and libindri.so should be in your indri/swig/obj/java directory. If you are using Mac OS X, libindri.so will be called libindri.jnilib. The indri.jar file contains all of the Java support files for Indri, while libindri.so contains the Indri C++ code.

If you run an application that uses the indri.jar file, it will attempt to load the libindri.so file automatically. For this to work, you need to set the java.library.path variable correctly. You can do this on the java command line:

java -cp indri.jar \
     -Djava.library.path=indri/swig/obj/java \
     MyIndriApplication

Tweaking the Indexing Process

It is our hope that for most indexing needs (especially for research purposes), buildindex should be sufficient. If you think buildindex is missing a critical feature, please let us know. If you want something easier to use than the buildindex tool, consider using the Java interface.

buildindex understands the following file types:

html	HTML formatted data, one document per file
xml	XML formatted data, one document per file (same as html, but without link processing)
trecweb	TREC web collections, such as WT10G or GOV2, with many documents per file
trectext	TREC newswire collections, such as AP89, with many documents per file
mbox	Unix mailbox files (this may not be in the distribution yet)
doc	Microsoft Word documents (Windows only, requires Microsoft Office)
ppt	Microsoft PowerPoint documents (Windows only, requires Microsoft Office)
pdf	Adobe PDF
txt	Text documents

If you don't specify a corpus type (using the class parameter), Indri will index files based on their extensions. Any file that doesn't use a known extension will be skipped. If you do include a corpus.class parameter, Indri assumes all files in the directory are of that type.

Many tasks that users want to do during index time are probably done best by using pre-processing scripts or programs to add tags to your corpus before Indri indexes them. A team at Carnegie Mellon University is currently implementing code to make Indri more flexible at handling heavily marked up content, or content that has been tagged by many different applications.

There are at least two broad categories of tasks where buildindex won't work for you:

You want to index structured documents that don't look anything like SGML/XML/HTML documents, or
you want to index documents within another application (like a desktop search tool)

In the first case, you'll need to write your own parser. Make a parser that can output a ParsedDocument structure, then call IndexEnvironment::addParsedDocument() to add your document to the index. In the second case, you can use the IndexEnvironment::addDocument() or IndexEnvironment::addString() calls, and let Indri do the parsing for you.

What does a trectext file look like?

A trectext file contains one or more documents, separated by <DOC> tags. Each document has a unique document number, specified by the <DOCNO> tag, which comes right after the opening <DOC> tag. The text of the document is contained within <TEXT> tags. Here is an example document:

<DOC>
<DOCNO> AP890101-0005 </DOCNO>
<TEXT>
   The Associated Press reported erroneously on
Dec. 29 that Sen. James Sasser, D-Tenn., wrote a letter to the
chairman of the Federal Home Loan Bank Board, M. Danny Wall, that
questioned the bailouts of insolvent savings and loan associations.
The letter was written by Sen. Timothy Wirth, D-Colo.
</TEXT>
</DOC>

What does a trecweb file look like?

A trecweb file is similar to a trectext file, except for the additional DOCHDR section, and the missing TEXT tags.

A trecweb file contains one or more documents, separated by <DOC> tags. Each document has a unique document number, specified by the <DOCNO> tag, which comes right after the opening <DOC> tag. After a few optional tags, the <DOCHDR> section contains the HTTP request information. Indri uses the <DOCHDR> section to extract the URL, and throws away the rest. Immediately following the <DOCHDR> section comes the HTML text of the document. The </DOC> tag signifies the end of the document.

<DOC>
<DOCNO>WTX001-B01-10</DOCNO>
<DOCOLDNO>IA001-000000-B008-97>/DOCOLDNO>
<DOCHDR>
http://sd48.mountain-inter.net:80/hss/teachers/Prothero.html 204.244.59.33 19970101013145 text/html 440
HTTP/1.0 200 OK
Date: Wed, 01 Jan 1997 01:21:13 GMT
Server: Apache/1.0.3
Content-type: text/html
Content-length: 270
Last-modified: Mon, 25 Nov 1996 05:31:24 GMT
</DOCHDR>
<a href="teachers.html">Back to Teachers' Home Page</a>
</BODY></HTML>
</DOC>

Manipulating Retrieval

How do I run a query with the Indri API?

First, create a QueryEnvironment object, and use the addIndex call to open your index. Use the runQuery call to get query results. To get the names of the documents, use the documentMetadata call. A short sample program looks like this: (Thanks to David Fisher for verifying that this program works)

#include "indri/QueryEnvironment.hpp"
#include "lemur/Exception.hpp"

using namespace indri::api;
using namespace lemur::api;

int main( int argc, char** argv ) {
  try {
    QueryEnvironment env;
    std::string myIndex = argv[1];
    std::string myQuery = argv[2];
    std::vector<ScoredExtentResult> results;
    std::vector<std::string> names;

    // open an Indri repository
    env.addIndex( myIndex );

    // run an Indri query, returning 10 results
    results = env.runQuery( myQuery, 10 );

    // fetch the names of the retrieved documents
    names = env.documentMetadata( results, "docno" );

    // print the results, including document score,
    // first and last word position, and document name
    for( int i=0; i<results.size(); i++ ) {
      std::cout << names[i] << " "
                << results[i].score << " "
                << results[i].begin << " "
                << results[i].end
                << std::endl;
    }

    env.close();
  } catch( Exception& e ) {
    LEMUR_ABORT(e);
  }

  return 0;
}

How about the same thing, but in Java?

import lemurproject.indri.*;

public class RunQuery throws Exception {
  public static void main( String[] args ) {
    QueryEnvironment env = new QueryEnvironment();
    String myIndex = args[1];
    String myQuery = args[2];
    ScoredExtentResult[] results;
    String[] names;

    // open an Indri repository
    env.addIndex( myIndex );

    // run an Indri query, returning 10 results
    results = env.runQuery( myQuery, 10 );

    // fetch the names of the retrieved documents
    names = env.documentMetadata( results, "docno" );

    for( int i=0; i<results.length; i++ ) {
      System.out.println( names[i] + " " +
                          results[i].score + " " +
                          results[i].begin + " " +
                          results[i].end );
    }

    env.close();
  }
}

How do I get the full text of a document back from Indri?

Use the QueryEnvironment::documents call to get an array of ParsedDocument structures back. The text string contains the full text of the document, while the positions array contains the start and end positions of every indexed term in the document. The metadata array contains all the metadata associated with this document. Note that the terms and tags entries of this structure will be empty.

How can I print just the passage returned by an Indri passage query?

This code is adapted runquery. documents is an array of ParsedDocument structures returned from the QueryEnvironment::documents() call, while results is an array of ScoredExtentResults returned from runQuery.

 int passageBegin = results[i].begin;
 int passageEnd = results[i].end;

 int byteBegin = documents[i]->positions[ passageBegin ].begin;
 int byteEnd = documents[i]->positions[ passageEnd-1 ].end;

 char* startText = documents[i]->text + byteBegin;
 int byteLength = byteEnd - byteBegin;

 std::cout.write( startText, byteLength );
 std::cout << std::endl;

How can I get the terms used in a document, or the tags in a document?

Use the QueryEnvironment::documentVectors call. You'll get back an array of DocumentVector structures.

Each DocumentVector contains three arrays; positions, stems and fields. stems is an array that contains every unique word in the document. positions maps word positions in the document to words in the stems array. The tags array contains tags; each with a begin position, and end position, a tag name, and a number element (for numeric fields).

Since the stems and positions arrays are often confusing at first, here is an example. Suppose we have a document that looks like this:

The cats are in the hat.

We run this document through a stemmer during indexing, which removes the 's' from 'cats'. The DocumentVector for this document would then be:

stems	"[OOV]", "the", "cat", "are", "in", "hat"
positions	1, 2, 3, 4, 1, 5

Notice how since 'the' is used twice in the document, it occurs only one in the 'stems' array, but is referenced twice by the 'positions' array. We can print this text using the following C++ snippet:

for( int i=0; i<positions.size(); i++ ) {
  std::cout << stems[ positions[i] ] << " ";
}

This code would print:

the cat are in the hat

If we had removed stopwords at index time, the DocumentVector would be:

stems	"OOV, "cat", "hat"
positions	0, 1, 0, 0, 0, 2

and the code above would print:

[OOV] cat [OOV] [OOV] [OOV] hat

The special

[OOV]

term stands for 'out of vocabulary', and is used to indicate that a stopword is missing.

How do I put an Indri query interface on the web?

A full web interface for Indri is included with Indri 2.2. Look in the lang/php directory. You'll need to edit config.php to get it running, but hopefully the documentation is good enough for you to get started. You'll also need to build Indri with PHP support enabled (watch the configure script as you run it to make sure it finds your PHP installation).

I also have some Java/JSP code lying around, but it isn't clean enough for official distribution. E-mail me and ask me about it if you'd like it.

Do you have an Indri logo/button that can be added to Indri search result pages?

Yes, we have two:

How can I build snippets of text (like on Yahoo or Google), with query terms highlighted?

You need to use the QueryEnvironment::runAnnotatedQuery() method when running queries. This call returns a structure that will show you where all the query terms appeared in each retrieved document.

Once you have these QueryAnnotation structures, you can fetch the ParsedDocument structures to get the full text of the document. With these two pieces of information, you can build retrieval snippets.

Both the PHP and Java web front-ends to Indri include snippet generation code.

How do I use the QueryAnnotation structure?

The QueryAnnotation object returned from runAnnotatedQuery has three methods: getResults, getQueryTree, and getAnnotations.

getResults() returns the same results as you would get if you had used the runQuery method. It returns a vector of ScoredExtentResults.

getQueryTree() returns a tree of QueryAnnotationNode objects. Each one has a name which is meaningless, except to Indri. An example name might be "493ab4c". The type is a node type, like "CombineNode", "ExtentRestrictionNode", "IndexTerm", "OrderedWindowNode", etc. queryText is an Indri query representation of this tree. Finally, children is a list of the child nodes for this node.

As an example, for the following query:

#combine[title] (apple banana orange #4(keep away))

the query tree might have these nodes:

name=1
type=ExtentRestrictionNode
queryText=#combine[title] (apple banana orange #4(keep away)) 
children=2

name=2
type=CombineNode
queryText=#combine (apple banana orange #4(keep away)) 
children=3,4,5,6

name=3
type=IndexTerm
queryText=apple

name=4
type=IndexTerm
queryText=banana

name=5
type=IndexTerm
queryText=orange

name=6
type=OrderedWindowNode
queryText=#ow4(keep away)
children=7,8

name=7
type=IndexTerm
queryText=keep

name=8
type=IndexTerm
queryText=away

Finally, the getAnnotations method gives you the locations of the query terms (actually, it gives you the locations of the query nodes). For instance, you might write code like this:

using namespace std;
using namespace indri::api;

const map<string, vector<ScoredExtentResult>& matches =
                 annotation->getAnnotations();
const vector<ScoredExtentResult>& keepMatches = *matches.find("7");

Since "7" is the name of the node for 'keep', keepMatches now holds the positions of 'keep' in your results. The ScoredExtentResults contain a documentID, a begin position, and an end position. For your query, one of the matches might be:

document=159 begin=17 end=18 score=0

That means that the word 'keep' appeared in document 159, starting at position 17 and ending at position 18.

From this data, you can find matching query terms, like 'keep', and matching operators, like #1(keep away).

How do I make relevance models with Indri?

I have a tool that does this. Again, this will probably make it into the distribution eventually.

How do I cluster documents with Indri?

You may want to look into doing clustering with the tools in the Lemur toolkit. You should be able to index your collection with the Indri tools, then do clustering with your index with the standard Lemur tools.

I am working on a k-nearest-neighbor application (approximately computes the k nearest neighbors of every document in the collection based on some distance metric). Once that's done, you should be able to do some interesting cluster-type things with that data. I don't know when that application will be done.

Using Indri with a Cluster of Machines

Indri can be used with many computers in order to increase retrieval speed.

Here's a general outline explaining how this is done:

Split the documents in your collection into n pieces, one piece for each of your n machines.
Run buildindex on each collection piece separately, so that you build n separate Indri repositories. You can build each index on a different machine in order to increase performance.
Start an Indri daemon (indrid) on each machine in your cluster, with each indrid acting as a server for a different index.
Use runquery to run your query set, using the server option to connect to the indrid processes.

The indrid process can be started like this:

indrid -index=index/part2

If you need to run more than one indrid on a single machine, you need to make sure that each one uses a different port:

indrid -index=index/part2 -port=5600

To use runquery to connect to your indrid processes, make a parameter file specifying which indrid machines to use. Suppose we are using a cluster of machines called nodek.cs.umass.edu, where k runs from 0 to 9. We can make a file (called serverList in this example) with the following text:

<parameters>
  <server>node0.cs.umass.edu</server>
  <server>node1.cs.umass.edu</server>
  <server>node2.cs.umass.edu</server>
  <server>node3.cs.umass.edu</server>
  <server>node4.cs.umass.edu</server>
  <server>node5.cs.umass.edu</server>
  <server>node6.cs.umass.edu</server>
  <server>node7.cs.umass.edu</server>
  <server>node8.cs.umass.edu</server>
  <server>node9.cs.umass.edu</server>
</parameters>

Now, to run runquery, you might use the following syntax:

runquery serverList queryFile

To learn more about parameter files in general, scroll down this page to the section entitled "Using Parameters".

About Indri Repositories

Lemur users may be confused by the layout of Indri repositories. In Lemur, an index is a collection of files, with one main table of contents file. The index is accessed by specifying the name of the table of contents file. In Indri, the repository is stored in a directory, and the repository name is the directory name.

For instance, if you make an index called '10g', Indri will make a directory called '10g' and put its files inside. If a directory called '10g' already exists, Indri will look to see if it is a valid Indri index. If it is, Indri will attempt to add documents to it. If it is not, Indri will delete it and make a new, empty directory.

Indri repositories contain one or more indexes, along with a compressed collection. The compressed collection stores the full text of each indexed document, including document metadata and byte offsets to each indexed term in the document.

To find out more about the structure of Indri repositories, read the appendix of this paper.

Using Parameters

While the documentation for Indri applications typically refers to command line options, all Indri applications can take either command line options, parameter files, or a combination of both.

For any Indri application (except 'dumpindex'), any item on the command that starts with a hyphen ('-') is assumed to be a command line argument. All such arguments are of the form -option=value. For example, this command builds an Indri index of the WT10G collection:

buildindex -corpus.path=collections/10g \
           -corpus.class=trecweb \
           -memory=1g \ 
           -index=index/10g

Notice that the memory parameter is '1g', which corresponds to 1 gigabyte. Indri parameter files can use the suffixes 'k', 'm' and 'g' to refer to multipliers of a thousand, a million, and a billion respectively.

This command could just have easily been executed with an Indri parameter file. The command line would have looked like this:

buildindex parameterFile

while the contents of parameterFile would have looked like this:

<parameters>
  <corpus>
    <path>collections/10g</path>
    <class>trecweb</class>
  </corpus>
  <memory>1g</memory>
  <index>index/10g</index>
</parameters>

Notice that the tag names in the parameter file correspond with the option names on the command line, where the period in the 'class' options corresponds to hierarchy. You can mix command line parameters with parameter files, as in this example:

buildindex parameterFile -stemmer.name=krovetz

This command adds Krovetz stemming to the options in the parameterFile.

When using runquery, you may want to use many parameter files at once. As an example, you may want to put your queries in one file, like this:

<parameters>
  <query>oil industry history</query>
  <query>lava lamps</query>
  <query>indri search engine</query>
  <query>#1(don metzler) dependence models</query>
</parameters>

and then, in another file, you might include query stopwords:

<parameters>
  <stopper>
    <word>a</word>
    <word>and</word>
    <word>the</word>
  </stopper>
</parameters>

and, in another file, your standard query options:

<parameters>
  <trecFormat>true</trecFormat>
  <index>index/10g</index>
  <count>100</count>
</parameters>

You can then use all three files together:

runquery queryFile stopwordFile queryOptions

If you would like to write an application that uses Indri command options, look at the indri::api::Parameters object documentation.

Using Explicitly Numbered Queries

In TREC-style experiments, you may wish to run queries that have particular query numbers. Indri allows you to specify those query numbers in your query file, like this:

<parameters>
  <query>
      <number>701</number>
      <text>oil industry history</text>
  </query>
</parameters>

More Complicated Build Parameters

The following example shows how to use the fields and metadata features of the indexer.

In this parameter file, the many <field> tags indicate the fields in the corpus that should be indexed. These fields can be used in query language expressions, such as "#1(george bush).person".

The <metadata> section tells the indexer how to treat the metadata fields in the document. Metadata fields can't be referenced in queries, but they can be retrieved by using the QueryEnvironment::documentMetadata and QueryEnvironment::documentsFromMetadata calls.

The <forward> tags indicate that a particular field should be stored in a B-Tree so it can be looked up quickly. The <forward> tags are strictly optimizations; if they aren't present, the whole document has to be fetched and decompressed in order to retrieve a single metadata field. This only affects the speed of the QueryEnvironment::documentMetadata call.

The <backward> tags create B-Trees that are the inverse of those created by the forward tags. These B-Trees map from a particular field value to one or more documents. For instance, these backward lookups could be used to find which document has a particular URL. Unlike the <forward> tags, these tags are not optional; if they are not in place at index build time, the QueryEnvironment::documentsFromMetadata call will not work.

<parameters>
   <memory>1000m</memory>
   <index>index/bignews</index>
   <corpus>
      <path>/usr/ind1/tmp1/indri/collections/bignews/indri/</path>
      <class>trectext</class>
   </corpus>
   <metadata>
      <forward>docno</forward>
      <backward>docno</backward>
      <forward>numsentences</forward>
      <forward>ciirdate</forward>
   </metadata>
   <stemmer>
      <name>Porter</name>
   </stemmer>
   <field>
      <name>sentence</name>
   </field>
   <field>
      <name>person</name>
   </field>
   <field>
      <name>location</name>
   </field>
   <field>
      <name>organization</name>
   </field>
   <field>
      <name>time</name>
   </field>
   <field>
      <name>date</name>
   </field>
   <field>
      <name>percent</name>
   </field>
   <field>
      <name>money</name>
   </field>
</parameters>

You are here

Indri hints, composed by Trevor Strohman