Figure 1. The OnSearch search page.
Enter the search terms, and click on, "Search." Search
terms are words composed of letters, numbers, and
underscores ('_').
The phrase, "To be, or not to be," for example, is treated as
six separate words: "To," "be," "or," "not," "to," "be."
Similarly, a technical term like, "Tk::MainWindow," the name
of a Perl library module, is treated as two words: "Tk," and
"MainWindow."
The search options determine whether OnSearch
searches for documents that contain any or all of the search
terms, or the exact phrase. OnSearch can match documents that
contain words of the exact case of the search terms, or
search for matches without regard for case. It can also
search for partial word matches. For example, "be" can
also match, "best," and, "not," can match, "knot."
Selecting, "Match text within words," causes searches to
take much longer than exact word searches. The inexact
terms match more words in each document, the process of
determining each inexact match takes longer, and OnSearch
has more results to handle.
Searches for words consisting of a single letter, when you
have selected the option, "Match text within words," are the
least efficient of all search types. In these cases, OnSearch matches documents with
every indexed word that contains the letter.
Later searches, however, require
less time if the results are already cached.
OnSearch also records if
users have selected a specific volume or volumes (see,
Volumes), and preferences for
indexing Web pages from other sites.
The Web browser
must allow cookies from the OnSearch Web site in order to
record the search information.
Once OnSearch begins
searching, it displays the first page of
matching documents as soon as they are retrieved. The next
section, Viewing Results, describes
how to view matching documents.
Figure 2. An OnSearch results page.
You can page backward and forward through the results by
clicking on the page numbers at the top and bottom of the
page. The total number of documents matched is the total
found by a search in progress and could change as you
view the results.
Figure 3. The archive upload and indexing page.
Adding Web pages. You can upload either a single Web
page or all of the Web pages from another site. When you click on,
"Update," OnSearch
sends
the request to the Web indexer, which begins to process your
request as a background task, and then returns to the
Archive page.
The pages are searchable as soon as OnSearch has indexed them.
OnSearch is able to handle URL
redirection and content negotiation. So, for example, to retrieve
the English language home page of a site that offers
multiple languages, it's possible to enter just the
Web server name without the index.html. The
other Web server should send the actual English language page,
which will be archived by OnSearch as, for example,
index.html.en.
OnSearch does not retrieve Web
pages that are on Web servers different than the Web server
you entered, unless the OnSearchadministrator enables that
capability. Nor does it index Web pages that the remote site
requested to be excluded from indexing.
OnSearch records the
progress of the Web page retrieval process in its Web log
if the administrator enables this option.
Uploading files. When uploading files, OnSearch opens a file selection
dialog. Choose the file to upload. Then, click on,
"Update," and OnSearch processes the request in
the background.
The file will be searchable and viewable as
soon as OnSearch indexes
it.
OnSearch can index document types
other that text and HTML. For example, if the administrator
has enabled filters for Postscript and PDF documents, you
can upload documents of those types also to be searched and
retrieved. See, "OnSearch
Administration," or consult with the system administrator.
For example, if the Web site's technical documentation is in
a subdirectory called, "docs," the administrator
can create a volume called, "Documentation," that specifies
this subdirectory. When searching, if you select only the
volume, "Documentation," OnSearch searches only those
documents.
OnSearch records your volume preferences. Like other
preferences, your Web browser must allow cookies from the
OnSearch Web site for OnSearch to save your personal data.
Onindex indexes only those directories that have
been updated since the previous indexing task. Although
the first time that you index a Web site might take minutes or
even hours for a large site, later indexing is much faster.
Before indexing, make sure that you have edited
onsearch.cfg with the ExcludeDir and
ExcludeGlob values that will prevent onindex
from indexing anything that you don't need indexed, like
PHP and Java library files, or OnSearch's cache directory.
See onsearch.cfg
Configuration.
OnSearch does not run onindex automatically. You
can start it from the Admin page by
checking the, "Index Now," box, or as the system
administrator from the command line using either of these
shell commands.
If onindex is idle, the system administrator can start indexing
immediately with the following shell command.
To stop onindex, the system administrator can use
the following command.
Onindex records its activity in the
OnSearch Web log. For testing or
monitoring, the administrator can tell
onindex to provide more detailed information with
the following shell commands.
Note: To run onindex from the,
"Admin," page, the Web server must have its
own unique user and group ID, for example, user
apache, group, apache, in order for the
program to have the necessary permissions to start as a
background process.
The onindex(8) manual page contains detailed
information about onindex options and run-time
files.
The Web server's httpd.conf file contains the user
and group information as the values of the, "User,"
and, "Group,", directives.
The system administrator needs to set the site's document permissions to
these values. Initially the following shell commands change
the ownership of the Web site's documents and document
directories, assuming the administrator has configured the
Web server to run as user, apache, group, apache.
OnSearch's administrator can configure OnSearch and
onindex to log extra information about local and
remote indexing activity, and caching.
Onsearch.cfg contains these options, which are
described in OnSearch.cfg
Configuration.
The system administrator can rotate the OnSearch log with the
rotatelogs(8) utility that is distributed with the
Apache server.
Refer to the htpasswd(8) manual page for
information about adding users and passwords, and to the
Apache documentation for mod_auth for information
about configuring authorization with Apache.
Note: The installation user name of the
OnSearch administrator is, "onsearch," password, "onsearch."
You should change these values after installation.
At the least, you should exclude OnSearch's cache directory
from searching, as in this example.
If you add or delete ExcludeDir entries, the index
contents could become invalid, and you should
reindex the site. The onindex(8) manual page
describes how to delete old indexes.
For example, because GIF image files contain no text,
OnSearch excludes them with
the following entries.
If the operating system distinguishes between upper and
lower case file names, you should include
ExcludeGlob entries for both.
OnSearch's normal configuration excludes Postscript and PDF
documents from indexing and searching, because it can
significantly slow the search process. If you want to index
and search documents of these types, comment out or remove
these ExcludeGlob entries, as in the following example.
You can also determine which files OnSearch
indexes and searches based on the files' MIME types.
See PlugIn.
After adding or deleting ExcludeGlob entries, the
index contents are invalid, and it is necessary to
re-index the site. The onindex(8) manual page
contains instructions for deleting old indexes.
Warning: You should use this option with
extreme caution, because the Web indexer can retrieve many
more Web pages (and Web sites) than the site the user requested.
In normal use, you should not need to change this value.
Plugins are filter programs that run helper applications,
like xpdf, ghostscript, and gzip,
reading the document to be indexed as standard input and
writing the searchable text to standard output.
Plugins are located in the plugins subdirectory of
OnSearchDir.
OnSearch includes plugins for the following MIME types.
Each PlugIn entry contains the MIME type of the
document and the name of the plugin. If a file does not
contain text, or you don't want to index a certain file
type, you can specify the null plugin as the second
argument.
You can also exclude document types from searching based on
their file extension. See ExcludeGlob.
After adding or deleting PlugIn entries, the
index contents could become invalid, and you should
reindex the site. The onindex(8) manual page
contains instructions for deleting old indexes.
If the administrator has not configured any volumes, it is
also the value of the volume, "Default." See Volume.
You can change this value to a subdirectory of
DocumentRoot, if necessary, to limit searching to
only a part of the site. In that case, you should also add
SearchRoot entries for
OnSearch's websites and uploads
subdirectories.
You should leave this option enabled, because OnSearch uses
several sets of rules to try to avoid retrieving, for
example, every instance of the letter, "x," or, "c," when
searching, for example, for, "X Window System," and "C
Language."
VerboseWebIndexer also enables the logging of HTTP
redirects and content negotiation, and notes URLs that are
disallowed from retrieval.
If the administrator has not configured any volumes, the
volume, "Default," refers to the SearchRoot
directory.
OnSearch saves users' volume selections as part of his or her preferences.
Table of Contents
Starting OnSearch
You can browse to the OnSearch
search page by entering the following URL in your World Wide
Web browser.
http://server-name/onsearch/index.shtml
You need to substitute server-name with the name of
the Web server where OnSearch is installed.
Searches
Figure 1 shows the OnSearch search page.
From this page,
you can enter search terms, define options for the search,
and begin searching.
Speeding Up Searches
Exact search terms allow OnSearch to perform its searches
more quickly. Searches for documents that contain exact
phrases like, "To be, or not to be," or, "X Window System,"
require only slightly more time than searches for single words.
Search Preferences
OnSearch records search
preferences, including the type of match, whether to match
both upper and lower case letters, text within words, the
number of matching terms to display from each document, and
the number of documents to display on each page of results.
Viewing Results
Figure 2 shows a typical page of results.
Adding Files and Web Pages To Be Searched
Clicking "Archive" on the menu takes you to
the Archive page, as in Figure 3, where you can upload
documents or retrieve individual Web pages or the pages of
Web sites to be indexed and then searched by OnSearch.
Volume Filters
Without any further configuration, OnSearch searches all of
the documents on your Web site, and the name of the search
volume, "Default," is visible on the Search page. However,
if the administrator has configured alternate search volumes
for the site, you can select which parts of the Web site you
want to search.
OnSearch Administration
Administration Table of Contents
Indexing
The onindex program performs the work of indexing
the Web site's documents. Onindex runs as a
background process, and indexes files at the interval given
by IndexInterval in onsearch.cfg, normally
every four hours.
# /usr/local/sbin/onindex
# /usr/local/etc/init.d/onindex start
# /usr/local/etc/init.d/onindex index
# /usr/local/etc/init.d/onindex stop
Or,
# /usr/local/sbin/onindex -v
# /usr/local/sbin/onindex -vv
Document Permissions
Onindex writes indexes in each document's
directory. In order to do this, it is necessary that the
site's documents and document directories be writable by
the Web server.
Note: Giving the Web server the ability to
write to the document directories can be a security risk.
The Apache manual document,
manual/misc/security_tips.xml discusses these
issues.
# chown -R apache /usr/local/apache2/htdocs
# chgrp -R apache /usr/local/apache2/htdocs
The OnSearch Web Log
OnSearch and onindex
record their activity, and any warnings or errors, in their
Web log, which is normally
/usr/local/spool/onsearch/onsearch.log. The log
uses a format similar to Apache's default Common Log Format,
so you can parse the log for information as you would one of
Apache's logs.
Administrator User IDs and Passwords
In a normal installation, OnSearch maintains its
administrator user IDs, passwords, and groups in the
following files.
/usr/local/etc/onpasswd
/usr/local/etc/ongroup
Onsearch.cfg Configuration
This section describes the onsearch.cfg options.
BackupIndexes
If this value is non-zero, OnSearch makes backup copies of
the indexes.
BinDir
This is the directory where onindex is located.
Its value is normally /usr/local/sbin.
CacheResults
If non-zero, OnSearch caches the results of each search.
This option should be enabled for normal use. Caching
should only be disabled when testing OnSearch.
CacheReports
If set to a non-zero value, CacheReports logs extra
information about cache activity.
DataDir
This is the directory where OnSearch and onindex
store their run time information. In OnSearch's normal
configuration, its value is /usr/local/var/run/onsearch.
DigitsOnly
This option, if non-zero, tells onindex to also
index words that contain only digits.
ExcludeDir
The value of each ExcludeDir entry is a directory that you
don't want to be indexed or searched. Onsearch.cfg
can have as many ExcludeDir entries as necessary.
ExcludeDir /usr/local/apache2/htdocs/onsearch/cache
ExcludeGlob
The value of each ExcludeGlob
directive is a wild-card pattern. OnSearch excludes files with
names that match the entry's pattern from indexing and searching.
ExcludeGlob *.gif
ExcludeGlob *.GIF
# ExcludeGlob *.ps
# ExcludeGlob *.PS
# ExcludeGlob *.pdf
# ExludeGlob *.PDF
HasSymLinks
Set this value to non-zero if the operating system provides
symlinks. It should not be necessary to change this value.
IndexInterval
The interval, in seconds between the times that
onindex reindexes the files and directories that
have changed. In a normal configuration, its value is
14400 seconds, or four hours.
IndexNontargetURLs
If set to non-zero OnSearch also indexes remote site URLs that are
linked to by the site being indexed.
OnSearchDir
The value of OnSearchDir is the name of the
subdirectory where OnSearch is installed. It is normally a
subdirectory of the Web server's DocumentRoot
directory, as in this example.
OnSearchDir /usr/local/apache2/htdocs/onsearch
PageSize
The default number of documents to display on each page of
results. This value is normally saved as a user preference.
PartialWordMatch
This is the default value of the, "Partial Word Match,"
search option, which is saved as a user preference.
PlugIn
Each PlugIn entry defines a plugin filter for a
MIME type that filters the text for that file type.
ResultsPerFile
This is the default value of the, "Results per file," user
option, which is saved as a user preference.
ResultsPersist
The value is the amount of time, in seconds, that OnSearch
keeps search results.
SearchContext
The value is the number of characters to display on either
side of a matched word or phrase.
SearchRoot
If there is only one SearchRoot entry, its value is normally
the Web server's document directory, the
value of DocumentRoot in httpd.conf.
ServerName
The fully qualified domain name
of the Web server, or its numeric IP address. The value is
the same as the value of ServerName in
httpd.conf.
SingleLetterWords
If set to non-zero, this option instructs onindex to also index
single letter words.
User
The user name of the Web server process owner. This is the same
as the value of User in httpd.conf
VerboseWebIndexer
If the value is non-zero, OnSearch records information about
individual URLs that users have requested to be retrieved
and indexed.
Volume
The value of each Volume entry is the label of a
volume and the subdirectory of the Web site that the entry refers
to.
WebLogDir
The value is the subdirectory where OnSearch keeps its Web
log. Its value is normally /usr/local/spool/onsearch.
Copyright
OnSearch is copyright ©
2004-2005 by Robert Kiesling, and is distributed
under the terms of the Perl Artistic License.
Refer to the file, "Artistic," for information.
Credits