Links
Links in this page go to outside sites. There are links to general topics, theory and technology, and to actual projects - usually current ones, although some past projects are included for interest and perspective. If you have links you would like to contribute, please send them in.
General How-Tos and Search Engine Optimisation.
- About.Com - Web Searching and Search Engine Optimisation.
- Learn the Net, a guide to the Internet and World Wide Web: The Animated Internet: How search engines work.
- American Society of Indexers - Indexes and Search Engines.
- High Rankings Search Engine Optimization.
Indexing and Searching the Web.
- 'Designing a Regional Crawler for Distributed and Centralized Search Engines.'
- Metrics - Internet Domain Survey - The Domain Survey attempts to discover every host on the Internet by doing a complete search of the Domain Name System.
- YAHOO! Directory - Searching the Web.
- Search Engine Showdown - Search Engine Statistics.
- The Anatomy of a Large-Scale Hypertextual Web Search Engine (PDF).
- Indexing, indexing, indexing - from 2001 at www.infomotions.com.
- Indexing the Web.
- Design Issues for the World Wide Web - Tim Berners-Lee - 1998.
- Challenges in Web Search Engines - PDF - 2002.
Societies, Journals and Web Sites on Indexing
- Indexing and Abstracting Society of Canada.
- The British and Irish professional body for indexing.
- The American Society of Indexers.
- Search Engine Watch.
- ResearchBuzz - Obsessed with search engines, databases, and etc. since 1998.
Political and Social Issues
- Project Censored - The Mission of Project Censored is to educate people about the role of independent journalism in a democratic society and to tell The News That Didn't Make the News and why.
- The Internet Under Surveillance (Search Engine Watch 2003).
- SearchEngineWatchSearch - Privacy at Google and Other Search Engines - From Search Engine Watch - April 2003.
- Search Engines and Editorial Integrity.
- Report Shows Confusion Over Paid Listings.
- Proxy Blind - Staying anonymous in the Age of Surveillance.
- Peacefire has instructions for circumventing Internet censorship.
- Computer Professionals for Social Responsibility.
- Google Watch - A look at how Google's monopoly, algorithms,and privacy policies are undermining the Web.
Technologies - Search and Indexing
- Structuring and Indexing the Internet (1996).
- Search Engine Watch's Search Engine Technology Section.
- Indexing the Internet - Human-powered cataloging systems remain essential tools for indexing the Internet.
- Dublin Core metadata is specifically intended to support resource discovery.
Reviews and Compilations
- Comparing Open Source Indexers.
- Indexing and Search Projects at SourceForge.Net.
- Open Directory Project - Open Source: Search Engines.
- SearchTools.Com - Search Tools with Open Source Code (circa 2002 compilation).
Open Source - Development, Links, etc.
- OpenSourceSearch - A resource for open source development of search engines.
- OSDN - Open Source Development Network.
- Open Source Initiative (OSI) is a non-profit corporation dedicated to managing and promoting the Open Source Definition for the good of the community.
Licenses - Copyright and Patents.
- Open Source Initiative - Licenses for "OSI Certified Open Source Software."
- Free Software Foundation - Software Licenses.
- Free Software Foundation - Software Documentation Licenses.
- Creative Commons - Use the Copyright Chooser for your music, writings, and other 'creative content.' (They suggest you use OSI or GNU for software related copyright).
Projects - Mostly Open Source Spiders and Crawlers and Search Engines - Then and Now
-
Search Engines and Indexing - A Wide Range - Intranet and Internet
- DataparkSearch Engine is a full-featured open source web-based search engine.
- DSpace digital repository system captures, stores, indexes, preserves, and distributes digital research material.
- ASPseek is an multi-site search engine, written in C++ using the STL library. It consists of an indexing robot, a search daemon, and a search frontend (CGI or Apache module).
- Egothor is an Open Source, high-performance, full-featured text search engine written entirely in Java.
- ht://Dig is a system for indexing and searching a finite (not necessarily small) set of sites or intranet.
- KartOO is a proprietary (but cool!) metasearch engine with visual display interfaces.
- Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
- MKSearch is a research project to develop a metadata search engine.
- Mobilemaps is a nearby engine, which lets users physically locate information.
- mozDex is a search engine based on Nutch and Lucene.
- mnoGoSearchtm (former UDMSearch) web search engine software.
- Namazu is a full-text search engine intended for easy use. Not only does it work as a small or medium scale Web search engine, but also as a personal search system for email or other files.
- Nutch is an effort to implement an open-source web search engine. Nutch builds on Lucene Java to provide web search application software.
- OpenFTS (Open Source Full Text Search engine) is an advanced PostgreSQL-based search engine that provides online indexing of data and relevance ranking for database searching.
- OpenWebSpider is an Open Source multi-threaded Web Spider (robot, crawler) and search engine.
- PhpDig is a web spider and search engine written in PHP, using a MySQL database and flat file support.
- Red-Piranha is an open source search system that can 'learn' what you are looking for.
- Senas is an open source search engine created from scratch in Perl.
- Swish-e is a fast, flexible, and free open source system for indexing collections of Web pages or other files. Swish-e is ideally suited for collections of a million documents or smaller.
- Swoogle is a crawler-based indexing and retrieval system for Semantic Web documents in RDF or OWL.
- Terrier is software for the rapid development of Web, intranet and desktop search engines.
- Webglimpse and Glimpse: Unix-based search software, website index, intranet search software.
- Xapian is an Open Source Probabilistic Information Retrieval library you can add to your own applications.
- Yacy is a p2p-based distributed Web Search Engine.
- Zebra is a high-performance, general-purpose structured text indexing and retrieval engine.
- Alvis conducts research in the design, use and interoperability of topic-specific search engines with the goal of developing an open source prototype of a distributed, semantic-based search engine.
-
Peer to Peer
- The Free Network Project - Freenet!
- Hadoop is a distributed computing platform used by Nutch.
-
Clustering
- Carrot2 is a research framework for experimenting with automated querying of various data sources (such as search engines), processing search results and visualization. Carrot2 was primarily built with search results clustering in mind, but it can be configured to do other things.
-
Crawlers and Spiders
- OpenCola Folders - A distributed tool for spidering the Internet and locating new documents.
- Grub is a distributed web crawler. It's alive! It's alive!
- The WIRE project is an effort started by the Center for Web Research for creating an application for information retrieval, designed to be used on the Web.
- SearchTools.com - All About Search Indexing Robots and Spiders.