Design for a Public Index

Indexing is a big job. It takes a lot of computer power and storage space.

It needs to be fast. It has to be accessible, reliable, and accurate; and it has to be able to access every resource on the Internet.

A ‘public’ index is one that is owned and operated by the public. It’s a community service. Usually that means there’s no money, resources, or infrastructure, and support is unreliable and limited. How can we design an index that can be run with no great expense, using absurdly limited resources?

Distributed

Although we wouldn’t likely have large computers available, we could have many small ones, contributed by interested volunteers, and distributed across the community – even across the globe. Perhaps it’s the only way to have a publically-owned and operated index- it certainly seems appropriate.

A distributed system of servers would apportion all of the tasks of running an index among them. This would create a massive system of computers running in parallel, doing tasks as they are required. Costs would be distributed among the servers.

Decentralized

In this design, there is no center to the system, and there is no top or bottom. Although there are servers running all sorts of different tasks, no server is essential and no server is uniquely controlling.

Servers and all of the index’s activities and services would be distributed over the globe like a mist or constellation, each connected to the other, exchanging information and coordinating tasks.

Cooperative

There could be different types of server. Some servers would search for and process data (i.e. web pages), while others would manage searches, act as database servers, or perform other tasks.

Communications between the servers would manage load balancing and task assignment.

Self-Sufficient

The system could be designed so that any node could take over any aspect of the network.

If the Internet were fractured into many independent and isolated fragments, the incumbent servers could reconfigure to form many ‘island’ networks.

When the networks were rejoined, they would integrate into one large system again.

The result would be a system which is highly resistant to destruction and can provide outstanding coverage of global resources.

Regional Servers?

One idea is that there could be a ‘ring’ of regional caching servers containing frequently requested information acting as a buffer between users and core database servers. Users’ search requests would be directed to their regional servers and those servers would respond with what information they had, as well as forwarding the queries on to the core database servers as required.

A regional or local server may be something a home user, school, or company (such as an ISP) could set up. Based on information given by the operators, the server would start up and immediately begin to build up a cache database based on standard core material and material selected by the operators or users.

Localized Servers?

If regional servers are truly geographically regional, then cache could also contain local news and information.

If a corporation or other group set up the regional server, the company could establish the regional server as their own authoritative indexing server/spider for the company’s web site: putting indexing and update schedules in the hands of the company or group.

Servers could be configured to specialize in certain subject areas, so that people could benefit from fast and complete responses in their areas of interest.

Robustness = Redundancy

Robustness would be provided through the redundancy in the distributed network. In this design, the database itself would be spread across many machines throughout the world. Each database server would contain only part of the overall database, with each data store being somewhat different. If a database server were to drop off of the network, then data would be lost, but the loss would be only a small amount relative to the overall store of data on the network, and could quickly be restored.

Because of the amount of data stored on them, and the need for fast response times, database and caching servers have to be stable and reliable. Perhaps load balancing could be used to identify the most reliable servers over time and designate them as database servers.

Approaching Infinity – Six Degrees of Separation

A complete search of the index’s database would require querying all of the network’s database servers. Given that there might be thousands of them, each with slightly different information, undergoing constant updating, a complete search would be time-consuming – if it were even possible!

The six degrees of separation principle holds that any one person knows somebody who knows somebody who knows somebody… who knows any other person in the world. Perhaps a search engine could follow a similar trail from one index server to another.

More Meta Servers

By designing servers that have some idea of the information they hold and the information other servers hold, it should be possible to reply quickly to a search with results and referrals to other servers. The person doing a search would query a local server and that server would reply with the information it had, and also query other servers it suspected held more information.

Searches would extend out beyond the user’s closest assigned server to other servers, intelligently seeking for answers.

Jurisdictions and Censorship

If an index is distributed and uniform, there is no obvious focal point for the imposition of censorship.

As servers would be under the control of the volunteers who set them up, and subject to the same local laws and regulations that the volunteers are subject to, it would be prudent to allow the volunteers to limit what kind of information passes through their servers.

This is undeniably a form of censorship, but only on a local basis – assuming a network spreads across multiple jurisdictions, it would be possible for servers outside of a censoring jurisdiction to contain information which is censored elsewhere, thus making the network uncensored in whole, but locally compliant.

Open Source = …Open

If the software used for the network is open-source, then it would be possible for individuals to apply or remove constraints on the software on their own, and continue to link to an established indexing network, or to establish their own. Even in the case of a global imposition of censorship on a system, it would be possible for individuals acting cooperatively or independently to support or overcome that censorship as they see fit.

…But Secure?

Security issues provide one of the greater long-term challenges to such a project – there would be continual ‘challenges’ to the system!

And Private?

In contrast to security issues, there are legitimate reasons to capture user’s search queries and results in order to build caches of frequently requested resource information and to mine for previously undiscovered resources. How can these pressures be reconciled with privacy issues?

Who Can Do It?

There will always be grassroots interest in Internet search and indexing. Different groups will have their own reasons, whether hobbyist or activist. There is a growing body of available software and other resources, and the speed and capacity of computers and networks keeps growing.

There are always groups who are working on some aspect of the job: people are always working on something.

In the Links section you will find external links to projects from the past and and to those currently under way – projects that could use your support – as well as a body of resources if you decide to build your own….