About our Robot and Search Engine [ Robot Tech. Specs. ]
Here is a sample of what you see when you visit SpiderMonkey
This page contains a wide range of general web search-robot information applicable to any search engine. Here you can learn definitions; how to build simple META tags; robots.txt files; how to do boolean searches; how web crawlers work; and much more.
Our search robot is registered among the top search engines of the world as a sophisticated software "engine" running from our servers to reach out across the web and fetch content for our database (SQL) index.
It is automated by our programmers to systematically traverse the World Wide Web's hypertext structure and retrieve documents; thereafter recursively retrieving all documents that are linked from within the initial target document. "Recursive" here doesn't limit the definition to any specific traversal algorithm. Even though a robot might be programmed to apply some heuristic rule to the selection and order of documents it will visit; and also spaces out requests over a long span of time; it is still a "robot". Concomitantly, normal Web browsers are not robots because they are operated by a person and don't automatically retrieve referenced documents.
Robots are sometimes referred to as Wanderers, Crawlers, or Spiders. Although arguably apropos, for the lay person, these names are a little misleading if they give the impression the software itself moves between sites like a virus; this not the case. The robot is software, permanently resident in its own computer, communicating from that computer its requests for website documents from other computers (the document server(s)) upon which the target site is resident.A search engine is a software programme resident on a computer that searches through a (usually massive) database. In the context of the World Wide Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.
Like most search engine service providers, for both quality and security reasons, URLs submitted by our visitors directly are stored in a temporary database before they are finally crawled and entered into the main search engine's index. We allow interested visitors viewing access to the temporary database. Use the "Pre-Index" engine here by either entering key words; entering your site name; or leave the search field blank, press "Pre-Index" and the engine will show you the entire list of recent submissions. You can see how others describe their sites and get some ideas for your own. If you have submitted your site using our Add Url form, you can check here and see how it looks. If you don't like it, remember that the final index entry will be derived from your web page, so spend your time working on your web page and it's meta-tags instead of resubmitting.
Our Crawler (SpiderMonkey) visits and checks URLs during server off-peak load times and feeds the result to the index. All realms of the main database are refreshed no less than every 30 days. This temp. database is minimally crawled twice monthly and while a URL is fetched from the actual site, each entry here remains for a period of roughly 60 days to verify when and how it was submitted. Note: URLs submitted to our own Site Submit Service or submitted remotely by other authorized servers do not appear in the temp. database but can be found using the SpiderMonkey Search Engine.
SpiderMonkey abides by the Robot Exclusion Standard. Specifically,Spider Monkeyadheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard supersedes the 1994 standard, the proposed standard is followed.
SpiderMonkey will obey the first record in the robots.txt file with a User-Agent containing "SpiderMonkey". If there is no such record, It will obey the first entry with a User-Agent of "*".Before you submit your site for inclusion in our database (index), are there pages you don't want indexed? If so, put the following in the head of any web page you want excluded. Our crawler (Spider Monkey) will obey this instruction and skip the document.
<META NAME="robots" CONTENT="noindex,nofollow">Another way to block a robot's crawl is with a simple file (robots.txt) in the top-level domain (i.e.: www.domainname.com/robots.txt). It is important that every web site have a robots.txt file in the root directory to avoid the numerous 404 errors and to make the site more "robot-friendly".
We offer a resource for generating your robots.txt file but suggest you read and understand the following first.# EXAMPLE robots.txt
User-agent: * # You can enter specific user-agent (spider's name) or "*"
Disallow: /cgi-bin/
Disallow: /cgi-win/
Disallow: /tmp/
Disallow: /images/
Disallow: /includes/
Disallow: /public/~specific-user/
- - - - -
# EXAMPLE robots.txt to exclude a single robot
User-agent: ID-of-BadRobot
Disallow: /
- - - - -
# EXAMPLE robots.txt to allow a single robot
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
- - - - -
Do you use meta content tags? There are two extremely important page header tags. You should at least set out the title and content of the page as succinctly as possible. If present, this will become the introduction to your page in the search results our visitors see. An example follows:
<html>
<head>
<title>My Important Web Page's Name</title>
<meta name="Description" content="Learn, laugh and enjoy at the same time. International Information Technology firm has superb entertainment website for clients, employees and guests.">
</head>
[ MP3 Search | Site Search | Home | Contact ] | Bookmark |
Enter your search word or phrase. |
SpiderMonkey searched for george bush |
Contact | Help | Add URL ] ©2004 SpiderMonkey Search
Search Engine Help for doing Boolean + Simple SearchesOur search engine finds documents throughout the World Wide Web. Here's how it works: you tell our search engine what you're looking for by typing in keywords, phrases, or questions in the search box. SpiderMonkey responds by giving you a list of all the Web pages in our crawler's (SpiderMonkey technical details) index relating to those topics. The most relevant content will appear at the top of your results. Most foul language is ignored by SpiderMonkey. Conclude it is not a tool for seeking porn sites. |
Type the word or phrase you seek into the
text-entry box. When searching, think of a word as a
combination of letters and numbers. You can tell SpiderMonkey how
to distinguish words and numbers you want treated
differently. |
Doing Boolean Searches With SpiderMonkey
|
What is a Web Crawler's "Index"?Spider Monkey's index is a large, growing, organized collection of data comprised of Web pages of various types, their content and location, as well as discussion group pages from around the world. The 'index' becomes larger every day as people send us the addresses for new Web pages and as our administrators search for new material. We own sophisticated technology that crawls the World Wide Web daily during lower server load periods looking for links to new pages. When you use SpiderMonkey search engine, you search the entire collection using keywords or phrases, just like other search engines such as Google, Yahoo or Alta Vista |
Some Terminology Related To Search EnginesBoolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR. Concept search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Full-text index: An index containing every word of every document cataloged, including stop words (defined below). Fuzzy search: A search that will find matches even when words are only partially spelled or misspelled. Index: The searchable catalog of documents created by search engine software. Also called "catalog." Index is often used as a synonym for search engine. Keywordsearch: A search for documents containing one or more words that are specified by a user. Phrasesearch: A search for documents containing a exact sentence or phrase specified by a user. Precision: The degree in which a search engine lists documents matching a query. The more matching documents that are listed, the higher the precision. For example, if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%. Proximitysearch: A search where users to specify that documents returned should have the words near each other. Query-By-Example: A search where a user instructs an engine to find more documents that are similar to a particular document. Also called "find similar." Recall: Related to precision, this is the degree in which a search engine returns all the matching documents in a collection. There may be 100 matching documents, but a search engine may only find 80 of them. It would then list these 80 and have a recall of 80%. Relevancy: How well a document provides the information a user is looking for, as measured by the user. Spider: The software that scans documents and adds them to an index by following links. Spider is often used as a synonym for search engine. Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "swimming" and get back results also for the stem word "swim." Stop words: Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning. Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents. |