Metacrap

November 4th, 2004

A 2001 essay by Cory Doctorow on problems with meta-data. We’re lazy, stupid and can’t be trusted. Seems obvious, really.

Metacrap

But you know - it’s true.

So many ideas presume that users will behave themselves, when the rule is that they won’t. And while users may be stupid, dishonest, and lazy, let’s not forget that users will also be smart and deliberately destructive. So it makes sense to design systems on the premise that people will abuse, misuse, and confuse them. If a design depends on the users (or operators) behaviour, then it’s not robust enough.

State control of the internet in China

September 27th, 2004

People’s Republic of China: State control of the Internet in China. - Amnesty International

A reminder not to take freedom for granted: this Amnesty International release calls attention to 33 “prisoners of conscience who have been detained for using the Internet to circulate or download information.”

“The authorities have introduced scores of regulations, closed Internet cafes, blocked e-mails, search engines, foreign news and politically-sensitive websites, and have recently introduced a filtering system for web searches on a list of prohibited key words and terms.”

Frankly, some of China’s rules for use of the Internet make sense. After all, what countries permit the distribution of state secrets on the Internet? Wouldn’t it be nice to live in a place without shit-disturbers? Hell, maybe my life would be better without porn…. Sigh. It would be nice to just let someone else take over and make my life a smooth journey to the end. Soothe away the worries, let me rest a while in the enveloping cocoon of an overseeing government that worries about things so I don’t have too.

Ah, hell, we all know that won’t work: if it’s not a house of cards, it’s a work camp.

Individuals and states live in a precarious balance of individual versus state interests. When states try to manufacture reality, the individual is in peril. Controlling access to information is part of manipulating the individual.

I can see you don’t want recipes for nerve gas being distributed around the Internet, but I don’t know who I’d trust to keep that information for me (or from me).

How do you balance the interests of individuals and states? How can you ensure that an individual has free access to information and at the same time ensure the rest of us are safe from him?

Recently (Google news service in China blocks banned sites), Google dropped sites banned by China in results returned to Chinese searchers. Their argument was that if people can’t get to the banned sites, then there is no point in including them in the results.

That’s considerate on the surface, but it also applies the finishing touch to the censorship - now people will not even know what they cannot know.

As I see it, if an index is distributed among many independent machines, then the operators of those machines can decide what they want to allow. Anyone on any of the other machines can subvert or support that by having his own set of rules. In the end, what’s available is a matter of consensus.

If information is not widely distributed throughout the (distributed) index, then the number of hits will go way down.

But that’s not enough. In an unregulated index, all sorts of information will get through, and it’s not always a good thing.

Imagine if the recipe for Ice-9 ever got out on the Internet.

That’s the formula Kurt Vonnegut writes about in Cat’s Cradle which has the potential to freeze all of the earth’s water, and destroy life itself.

Some guy would post it on his web site, and a spider would find it. Recognizing it as a key document, the index would rank it high enough that it would come out at the top of an innocent search for Ice-9 by a nine-year old girl living in Nebraska and doing an essay on Vonnegut for her school’s winter arts festival.

Her tragic and bizarre death at the kitchen sink while trying to make a batch of Ice-9 would be reported on in the local papers and would be picked up on by 20% of all teenaged boys who would do searches for Ice-9 in the index and post the information on their own web pages along with an infinite variety of misspellings of the word ‘cool’, thus boosting the ranking of “Ice-9″ in the index even more.

It would be a very short period of time before we were all dead - all because of a lack of effective regulation on the distribution of information.

So on one hand we have excellent reasons to promote the free exchange of information, and on the other we have equally excellent reasons to restrict it.

And while it’s important to keep certain information from reaching certain people, it’s also important to make sure that bad people never, ever, discover each other and form a club.

What are the individual’s fundamental rights in this case? Individual freedom of speech? The right of association? Is there a limit? Is the Internet a Pandora’s box of horrors?

Is an un-censorable index a good thing? I think the immediate response usually is: “can you trust the people who do the censoring?” History indicates the answer to that one is “no.”

American Civil Liberties Union : The Surveillance-Industrial Complex

August 24th, 2004

The full title is “The Surveillance-Industrial Complex: How the American Government Is Conscripting Businesses and Individuals in the Construction of a Surveillance Society,” and, well, the fourth conclusion is that the ACLU is leading the fight.

The expansion and aggregation of data about citizens by corporations, combined with post-911 paranoia is leading to the development of dossiers on everyone, eagerly consumed by governments, law enforcement agencies, and corporations beyond the constraints laid down by law and constitution.

The technology is in place, and the collections are for sale to anyone. The government wants to use them to fight terrorism by treating everyone as suspects. Is this the end for the concept of the free and private citizen? Is this the route to a police state?

For those concerned about data collections being used to spy on Americans (and Canadians, among others), this is an interesting and disturbing 33 page report.

American Civil Liberties Union : The Surveillance-Industrial Complex

The relevance to indexing and search is that search engine companies can track your searches (at least by IP address), and maintain a database of them. Companies like Google which offer diverse services such as Froogle, GMail, and Orkut can track your search terms, correspondence, shopping, and social interactions and combine them with outside data into a huge and comprehensive profile of you and everybody else. The sale of such information is a potentially huge revenue stream for a corporation, and in itself is a powerful resource for surveillance, analysis, and control.

How can a public index ensure that privacy is protected?

Doug Cutting Interview

August 15th, 2004

Here’s an interview in/at Google Blogoscoped with Doug Cutting , who is principal developer of Lucene and Nutch. He talks about managing spam and distributed search among other things.

Doug Cutting Interview

KnowItAll

August 14th, 2004

Let’s see if I’ve got this right - KnowItAll, out of the University of Washinton is a search engine that concentrates on extracting lists out of search results:

KnowItAll

There’s a short New Scientist article on it too:

Search engine tackles tricky lists

Is it open source? I don’t know! Although it does or will incorporate Nutch software and shares at least one contributor, nothing indicates it is intended to be open source.