Adding Namazu search engine to Mailman archive with user authentication

A case study by Gábor Kiss, NIIF Institute

Introduction

Mailman is one of the most popular mailing list managers. Unfortunaly its archiver has no built-in search capabilities.

Namazu is a full text search engine by Satoru Takabayashi et al.

Tom Morrison has successfully added Namazu to Mailman making archives searchable:
http://mail.python.org/pipermail/mailman-users/2004-June/037584.html .

Lindsay Haisley has improved Tom's work and he created Nmzproc .

Nmzproc works well but allows unauthorized persons to have a look into non public lists. I installed Nmzproc on our list server and I added basic authorization capability as well as some I18N code. So archives of private lists can be searched by members only. This case study shows this work step by step.

Preparation

Our list server host runs Debian Sarge with previously installed mailman package.

I installed the namazu2, namazu2-common and namazu2-index-tools packages from http://www.namazu.org/#download.

I created a new user called namazu that belongs to the list group also. Therefore Mailman archives are readable by namazu. Indexing processes and resulting index files are owned by namazu.

Layout

Figure below shows the location of existing and new files and directories. (It is far from being complete. Most components of Mailman are omitted for clarity.) Sample lists are called foo and bar. Click on the links in order to get detailed explanation.
/--+--home/namazu/--+--bin/--+--mailman_index
   |                |        |
   |                |        +--nmzproc
   |                |        |
   |                |        +--search.py
   |                |
   |                +--etc/templates/
   |
   +--usr/--+--lib/--+--cgi-bin/--+--namazu.cgi
   |        |        |            |
   |        |        |            +--mailman/search
   |        |        |
   |        |        +--Mailman/Cgi/search.py
   |        |
   |        +--share/namazu/template/
   |
   +--etc/mailman/*/--+--archtoc.html
   |                  |
   |                  +--archtocnombox.html
   |
   +--var/lib/--+--namazu/mailman/--+--foo/--+--namazurc
                |                   |        |
                |                   |        +--mknmzrc
                |                   |        |
                |                   |        +--NMZ.*
                |                   |
                |                   +--bar/--+--namazurc
                |                   |        |
                |                   |        +--mknmzrc
                |                   |        |
                |                   |        +--NMZ.*
                |                  ...
                |
                +--mailman/archives/private/--+--foo/
                                              |
                                              +--bar/
                                              |
                                             ...

Items explained

/home/namazu/bin/mailman_index [Download]
A new script that refreshes Namazu index files. Put it in namazu's crontab. E.g.:
44 23 * * *	ls /var/lib/namazu/mailman | xargs $HOME/bin/mailman_index

/home/namazu/bin/nmzproc [Download][Diff]
A Python script written by Lindsey Haisley and modified by me. It adds a new Mailman list to the search engine. Run it manually as namazu user:
namazu@myhost:~/bin$ ./nmzproc --uselower foo

/usr/lib/cgi-bin/mailman/search
A setgid list wrapper that calls /usr/lib/mailman/Mailman/Cgi/search.py.
Create it by yourself from rmlist (or any other 6 char name wrapper in this directory):
root@myhost:/usr/lib/cgi-bin/mailman# perl -p -e 's/rmlist/search/g' rmlist > search
root@myhost:/usr/lib/cgi-bin/mailman# chown root.list search
root@myhost:/usr/lib/cgi-bin/mailman# chmod 2755 search

/usr/lib/mailman/Mailman/Cgi/search.py
A symlink to /home/namazu/bin/search.py. Create it yourself.

/home/namazu/bin/search.py [Download][Diff]
This is a wrapper script that does authorization and sets up the environment of the search engine. Finally it calls /usr/lib/cgi-bin/namazu.cgi. It is a modified version of Lindsay Haisley's nmz_wrapper.cgi.

/usr/lib/cgi-bin/namazu.cgi
Off the self search engine as installed from the namazu2 package.

/usr/share/namazu/template/
Directory of original HTML templates as installed from the Debian package. These are currently unused. Listed for completeness.

/home/namazu/etc/templates/ [Download dir content]
Directory of HTML templates. nmzproc copies NMZ.* files from here to /var/lib/namazu/mailman/foo/.

/var/lib/namazu/mailman/
A new directory writable by namazu user. Create it yourself. mailman_index and nmzproc scripts put their output here.

/var/lib/namazu/mailman/foo/namazurc
Search configuration file for mailing list foo. Created by mailman_index.

/var/lib/namazu/mailman/foo/mknmzrc
Search configuration file for mailing list foo. Created by nmzproc.

/var/lib/namazu/mailman/foo/NMZ.*
Two kinds of files are mixed here.
NMZ.head*, NMZ.body*, NMZ.foot*, NMZ.result*, NMZ.tips* are customized language dependent HTML snippets created once by
nmzproc from /home/namazu/etc/templates/* templates.
Rest of the NMZ.* files are indices of the search engine. They are created/refreshed by mailman_index periodically.

/var/lib/mailman/archives/private/foo
Archive of Mailman list foo. It must be readable by mailman_index started by namazu user.

/etc/mailman/*/archtoc.html [en version] [hu version]
/etc/mailman/*/archtocnombox.html [en version] [hu version]
Language dependent HTML templates of Mailman. A search form is added as done by Tom and Lindsay. Edit your templates manually as you need.

Operations

Adding a new list

Run nmzproc (see above) once for each list you want to make searchable.
The modified script looks into Mailman configs to retrieve all allowed languages of the list foo then creates necessary /var/lib/namazu/mailman/foo/NMZ.* files as well as /var/lib/namazu/mailman/foo/mknmzrc.

Indexing

Run mailman_index a few times a day or once an hour or as you wish for each lists to be indexed. This script finally calls mknmz that reads new mails archived since the last indexing and updates /var/lib/namazu/mailman/foo/NMZ.* files.

Search

When the user fills the search form on a Mailman archive web page and clicks on Submit button cgi program /usr/lib/cgi-bin/mailman/search is started by web server tipically running with UID www-data. Search engine must read archives and index files therefore this program is just a setgid list wrapper that calls /usr/lib/mailman/Mailman/Cgi/search.py. The latter is just a symlink to /home/namazu/bin/search.py. In case of private lists this program checks if user is authorized to access archive content. (Note: authorization is based on regular membership only. Server and list administrator access rights are disregarded. Enabling admin staff to search archives may be subject of further development.)

A live example

Bonetools is an English language mailing list of archaeologists. Archive can be found here.
Go to the bottom and search "bone".

Lincense

All new programs as well as modifications of existing ones are licensed under GNU GPL.

Contact

If I was too terse, if you did not understand something or you had any problem with installation send a mail to <kissg@ssg.ki.iif.hu>. I hope I can help.

Gábor