User Guide

OnSearch

World Wide Web Searches

Table of Contents

  1. Starting OnSearch
  2. Searches
  3. Viewing Results
  4. Adding Files and Web Pages To Be Searched
  5. Volume Filters
  6. OnSearch Administration
  7. Copyright

Starting OnSearch

You can browse to the OnSearch search page by entering the following URL in your World Wide Web browser.
http://server-name/onsearch/index.shtml
You need to substitute server-name with the name of the Web server where OnSearch is installed.

Searches

Figure 1 shows the OnSearch search page. From this page, you can enter search terms, define options for the search, and begin searching.

OnSearch Search Page

Figure 1. The OnSearch search page.

Enter the search terms, and click on, "Search." Search terms are words composed of letters, numbers, and underscores ('_').

The phrase, "To be, or not to be," for example, is treated as six separate words: "To," "be," "or," "not," "to," "be."

Similarly, a technical term like, "Tk::MainWindow," the name of a Perl library module, is treated as two words: "Tk," and "MainWindow."

The search options determine whether OnSearch searches for documents that contain any or all of the search terms, or the exact phrase. OnSearch can match documents that contain words of the exact case of the search terms, or search for matches without regard for case. It can also search for partial word matches. For example, "be" can also match, "best," and, "not," can match, "knot."

Speeding Up Searches

Exact search terms allow OnSearch to perform its searches more quickly. Searches for documents that contain exact phrases like, "To be, or not to be," or, "X Window System," require only slightly more time than searches for single words.

Selecting, "Match text within words," causes searches to take much longer than exact word searches. The inexact terms match more words in each document, the process of determining each inexact match takes longer, and OnSearch has more results to handle.

Searches for words consisting of a single letter, when you have selected the option, "Match text within words," are the least efficient of all search types. In these cases, OnSearch matches documents with every indexed word that contains the letter.

Later searches, however, require less time if the results are already cached.

Search Preferences

OnSearch records search preferences, including the type of match, whether to match both upper and lower case letters, text within words, the number of matching terms to display from each document, and the number of documents to display on each page of results.

OnSearch also records if users have selected a specific volume or volumes (see, Volumes), and preferences for indexing Web pages from other sites.

The Web browser must allow cookies from the OnSearch Web site in order to record the search information.

Once OnSearch begins searching, it displays the first page of matching documents as soon as they are retrieved. The next section, Viewing Results, describes how to view matching documents.

Viewing Results

Figure 2 shows a typical page of results.

OnSearch Results Page

Figure 2. An OnSearch results page.

You can page backward and forward through the results by clicking on the page numbers at the top and bottom of the page. The total number of documents matched is the total found by a search in progress and could change as you view the results.

Adding Files and Web Pages To Be Searched

Clicking "Archive" on the menu takes you to the Archive page, as in Figure 3, where you can upload documents or retrieve individual Web pages or the pages of Web sites to be indexed and then searched by OnSearch.

OnSearch Archive Page

Figure 3. The archive upload and indexing page.

Adding Web pages. You can upload either a single Web page or all of the Web pages from another site. When you click on, "Update," OnSearch sends the request to the Web indexer, which begins to process your request as a background task, and then returns to the Archive page.

The pages are searchable as soon as OnSearch has indexed them.

OnSearch is able to handle URL redirection and content negotiation. So, for example, to retrieve the English language home page of a site that offers multiple languages, it's possible to enter just the Web server name without the index.html. The other Web server should send the actual English language page, which will be archived by OnSearch as, for example, index.html.en.

OnSearch does not retrieve Web pages that are on Web servers different than the Web server you entered, unless the OnSearchadministrator enables that capability. Nor does it index Web pages that the remote site requested to be excluded from indexing.

OnSearch records the progress of the Web page retrieval process in its Web log if the administrator enables this option.

Uploading files. When uploading files, OnSearch opens a file selection dialog. Choose the file to upload. Then, click on, "Update," and OnSearch processes the request in the background.

The file will be searchable and viewable as soon as OnSearch indexes it.

OnSearch can index document types other that text and HTML. For example, if the administrator has enabled filters for Postscript and PDF documents, you can upload documents of those types also to be searched and retrieved. See, "OnSearch Administration," or consult with the system administrator.

Volume Filters

Without any further configuration, OnSearch searches all of the documents on your Web site, and the name of the search volume, "Default," is visible on the Search page. However, if the administrator has configured alternate search volumes for the site, you can select which parts of the Web site you want to search.

For example, if the Web site's technical documentation is in a subdirectory called, "docs," the administrator can create a volume called, "Documentation," that specifies this subdirectory. When searching, if you select only the volume, "Documentation," OnSearch searches only those documents.

OnSearch records your volume preferences. Like other preferences, your Web browser must allow cookies from the OnSearch Web site for OnSearch to save your personal data.

OnSearch Administration

Administration Table of Contents

  1. Indexing
  2. Document Permissions
  3. The OnSearch Web Log
  4. Administrator User IDs and Passwords
  5. Onsearch.cfg Configuration

Indexing

The onindex program performs the work of indexing the Web site's documents. Onindex runs as a background process, and indexes files at the interval given by IndexInterval in onsearch.cfg, normally every four hours.

Onindex indexes only those directories that have been updated since the previous indexing task. Although the first time that you index a Web site might take minutes or even hours for a large site, later indexing is much faster.

Before indexing, make sure that you have edited onsearch.cfg with the ExcludeDir and ExcludeGlob values that will prevent onindex from indexing anything that you don't need indexed, like PHP and Java library files, or OnSearch's cache directory. See onsearch.cfg Configuration.

OnSearch does not run onindex automatically. You can start it from the Admin page by checking the, "Index Now," box, or as the system administrator from the command line using either of these shell commands.

# /usr/local/sbin/onindex
# /usr/local/etc/init.d/onindex start

If onindex is idle, the system administrator can start indexing immediately with the following shell command.

# /usr/local/etc/init.d/onindex index

To stop onindex, the system administrator can use the following command.

# /usr/local/etc/init.d/onindex stop

Onindex records its activity in the OnSearch Web log. For testing or monitoring, the administrator can tell onindex to provide more detailed information with the following shell commands.

# /usr/local/sbin/onindex -v
Or,
# /usr/local/sbin/onindex -vv

Note: To run onindex from the, "Admin," page, the Web server must have its own unique user and group ID, for example, user apache, group, apache, in order for the program to have the necessary permissions to start as a background process.

The onindex(8) manual page contains detailed information about onindex options and run-time files.

Document Permissions

Onindex writes indexes in each document's directory. In order to do this, it is necessary that the site's documents and document directories be writable by the Web server.

The Web server's httpd.conf file contains the user and group information as the values of the, "User," and, "Group,", directives.

The system administrator needs to set the site's document permissions to these values. Initially the following shell commands change the ownership of the Web site's documents and document directories, assuming the administrator has configured the Web server to run as user, apache, group, apache.

# chown -R apache /usr/local/apache2/htdocs
# chgrp -R apache /usr/local/apache2/htdocs
Note: Giving the Web server the ability to write to the document directories can be a security risk. The Apache manual document, manual/misc/security_tips.xml discusses these issues.

The OnSearch Web Log

OnSearch and onindex record their activity, and any warnings or errors, in their Web log, which is normally /usr/local/spool/onsearch/onsearch.log. The log uses a format similar to Apache's default Common Log Format, so you can parse the log for information as you would one of Apache's logs.

OnSearch's administrator can configure OnSearch and onindex to log extra information about local and remote indexing activity, and caching. Onsearch.cfg contains these options, which are described in OnSearch.cfg Configuration.

The system administrator can rotate the OnSearch log with the rotatelogs(8) utility that is distributed with the Apache server.

Administrator User IDs and Passwords

In a normal installation, OnSearch maintains its administrator user IDs, passwords, and groups in the following files.
/usr/local/etc/onpasswd
/usr/local/etc/ongroup

Refer to the htpasswd(8) manual page for information about adding users and passwords, and to the Apache documentation for mod_auth for information about configuring authorization with Apache.

Note: The installation user name of the OnSearch administrator is, "onsearch," password, "onsearch." You should change these values after installation.

Onsearch.cfg Configuration

This section describes the onsearch.cfg options.
BackupIndexes
If this value is non-zero, OnSearch makes backup copies of the indexes.
BinDir
This is the directory where onindex is located. Its value is normally /usr/local/sbin.
CacheResults
If non-zero, OnSearch caches the results of each search. This option should be enabled for normal use. Caching should only be disabled when testing OnSearch.
CacheReports
If set to a non-zero value, CacheReports logs extra information about cache activity.
DataDir
This is the directory where OnSearch and onindex store their run time information. In OnSearch's normal configuration, its value is /usr/local/var/run/onsearch.
DigitsOnly
This option, if non-zero, tells onindex to also index words that contain only digits.
ExcludeDir
The value of each ExcludeDir entry is a directory that you don't want to be indexed or searched. Onsearch.cfg can have as many ExcludeDir entries as necessary.

At the least, you should exclude OnSearch's cache directory from searching, as in this example.

ExcludeDir /usr/local/apache2/htdocs/onsearch/cache

If you add or delete ExcludeDir entries, the index contents could become invalid, and you should reindex the site. The onindex(8) manual page describes how to delete old indexes.

ExcludeGlob
The value of each ExcludeGlob directive is a wild-card pattern. OnSearch excludes files with names that match the entry's pattern from indexing and searching.

For example, because GIF image files contain no text, OnSearch excludes them with the following entries.

ExcludeGlob *.gif
ExcludeGlob *.GIF

If the operating system distinguishes between upper and lower case file names, you should include ExcludeGlob entries for both.

OnSearch's normal configuration excludes Postscript and PDF documents from indexing and searching, because it can significantly slow the search process. If you want to index and search documents of these types, comment out or remove these ExcludeGlob entries, as in the following example.

# ExcludeGlob  *.ps
# ExcludeGlob  *.PS
# ExcludeGlob  *.pdf
# ExludeGlob   *.PDF

You can also determine which files OnSearch indexes and searches based on the files' MIME types. See PlugIn.

After adding or deleting ExcludeGlob entries, the index contents are invalid, and it is necessary to re-index the site. The onindex(8) manual page contains instructions for deleting old indexes.

HasSymLinks
Set this value to non-zero if the operating system provides symlinks. It should not be necessary to change this value.
IndexInterval
The interval, in seconds between the times that onindex reindexes the files and directories that have changed. In a normal configuration, its value is 14400 seconds, or four hours.
IndexNontargetURLs
If set to non-zero OnSearch also indexes remote site URLs that are linked to by the site being indexed.

Warning: You should use this option with extreme caution, because the Web indexer can retrieve many more Web pages (and Web sites) than the site the user requested.

OnSearchDir
The value of OnSearchDir is the name of the subdirectory where OnSearch is installed. It is normally a subdirectory of the Web server's DocumentRoot directory, as in this example.
OnSearchDir /usr/local/apache2/htdocs/onsearch

In normal use, you should not need to change this value.

PageSize
The default number of documents to display on each page of results. This value is normally saved as a user preference.
PartialWordMatch
This is the default value of the, "Partial Word Match," search option, which is saved as a user preference.
PlugIn
Each PlugIn entry defines a plugin filter for a MIME type that filters the text for that file type.

Plugins are filter programs that run helper applications, like xpdf, ghostscript, and gzip, reading the document to be indexed as standard input and writing the searchable text to standard output.

Plugins are located in the plugins subdirectory of OnSearchDir.

OnSearch includes plugins for the following MIME types.

Each PlugIn entry contains the MIME type of the document and the name of the plugin. If a file does not contain text, or you don't want to index a certain file type, you can specify the null plugin as the second argument.

You can also exclude document types from searching based on their file extension. See ExcludeGlob.

After adding or deleting PlugIn entries, the index contents could become invalid, and you should reindex the site. The onindex(8) manual page contains instructions for deleting old indexes.

ResultsPerFile
This is the default value of the, "Results per file," user option, which is saved as a user preference.
ResultsPersist
The value is the amount of time, in seconds, that OnSearch keeps search results.
SearchContext
The value is the number of characters to display on either side of a matched word or phrase.
SearchRoot
If there is only one SearchRoot entry, its value is normally the Web server's document directory, the value of DocumentRoot in httpd.conf.

If the administrator has not configured any volumes, it is also the value of the volume, "Default." See Volume.

You can change this value to a subdirectory of DocumentRoot, if necessary, to limit searching to only a part of the site. In that case, you should also add SearchRoot entries for OnSearch's websites and uploads subdirectories.

ServerName
The fully qualified domain name of the Web server, or its numeric IP address. The value is the same as the value of ServerName in httpd.conf.
SingleLetterWords
If set to non-zero, this option instructs onindex to also index single letter words.

You should leave this option enabled, because OnSearch uses several sets of rules to try to avoid retrieving, for example, every instance of the letter, "x," or, "c," when searching, for example, for, "X Window System," and "C Language."

User
The user name of the Web server process owner. This is the same as the value of User in httpd.conf
VerboseWebIndexer
If the value is non-zero, OnSearch records information about individual URLs that users have requested to be retrieved and indexed.

VerboseWebIndexer also enables the logging of HTTP redirects and content negotiation, and notes URLs that are disallowed from retrieval.

Volume
The value of each Volume entry is the label of a volume and the subdirectory of the Web site that the entry refers to.

If the administrator has not configured any volumes, the volume, "Default," refers to the SearchRoot directory.

OnSearch saves users' volume selections as part of his or her preferences.

WebLogDir
The value is the subdirectory where OnSearch keeps its Web log. Its value is normally /usr/local/spool/onsearch.

Copyright

OnSearch is copyright © 2004-2005 by Robert Kiesling, and is distributed under the terms of the Perl Artistic License. Refer to the file, "Artistic," for information.

Credits
  • The name, "OnSearch," its index file format, and much information about search strategies in general, are from the articles, "On Search," by Tim Bray.
  • The vector search subroutines use an adaptation of the algorithms described in the Perl.com article, "Building a Vector Space Search Engine in Perl," by Maciej Ceglowski.
  • The string search libraries use strategies of the Boyer-Moore search algorithm, as described in Michael Abrash's Graphics Programming Black Book.