Archive for the 'General' Category

Hadoop Crunches Web-Sized Data — Cloud software — InformationWeek

Monday, November 9th, 2009

Hadoop is a cluster-based software framework for distributed applications using large volumes (terabytes/petabytes) of distributed data running on thousands of nodes.

Hadoop was first developed for the Nutch search engine project, and Yahoo! is a major user and contributor. Yahoo was at the Cloud Computing Conference and Expo in Santa Clara, California, November 3rd., explaining how they use Hadoop to analyze the Web.

According to the article, Yahoo runs Hadoop on thousands of servers to batch-analyze data collected from spidering the web, and uses it to build the indexes behind Yahoo’s search engine. Yahoo runs it on 25,000 servers in clusters of 4,000 machines. The Hadoop file system subdivides files in 64 MB or greater chunks.

Hadoop Crunches Web-Sized Data — Cloud software — InformationWeek.

What is intelligence?

Tuesday, December 23rd, 2008

In 1980 I worked with a mainframe programming language called ‘APL’. You could write little functions and execute them interactively by typing the name in the command line. Over the course of a few months, you could write a lot of little programs – too many to keep track of.

Once, in the early morning hours of a long night, I sleepily typed in the name of a program I wanted to edit, and pressed the ‘Return’ key. To my shock, the computer replied helpfully:

“You probably meant to type <>….”

Well, this was 1980. Mainframe computers generally didn’t converse, but they could already induce terror. I froze, and I still remember the tingle of fear that ran up my spine, and the seconds that passed as I sat back in my chair.

It didn’t take long, after the first few tenuous taps on the keyboard before I realized what had happened….

Months earlier I had written helper programs named for mis-formed command strings an operator might mistakenly type. It was intended to guide them in the right direction. The computer was not exhibiting intelligence – I had been frightened by my own application. It was too complex for me to know what to expect.

The effect of that made me a little less awed by my natural intelligence, and a little more inclined to suspect that what we call ‘intelligence’ may just be a complex deterministic system we don’t comprehend (intelligence is therefore paradoxically that which we are too stupid to understand….)

It leads me to think that if you want an intelligence, you could start by making a lot of really simple elements, and build on that.

You might develop specialized modules analogous to the development of higher brain functions. Could you end up with an ‘intelligence?’

Let’s consider an OpenIndex scenario – you could have a globe full of distributed, networked computers doing information gathering. You could also have programs written to monitor and respond to the information stream.

If you populated the globe with sensors, indexing robots could visit them and collect information, and you’d have a nervous system of sorts. Well, of course the Internet is analogous to that already, but the Internet is just a network.

With a global network capable of responding to the information, you’d have a sort of stimulus-response system.

Does the sum of the programs running on the servers, perturbed by the data stream, with resultant complexity, lead to a form of ‘intelligence’?

An indexing/search organism, fed by data, driven by inquiries, and responding like an animal: if enough people care about a particular subject, that becomes part of its character: its obsession.

It would live in a strange world, buffeted by drifting tides of information trends, seeking out ever more information, breathing in queries, and exhaling responses, building up its corpus of knowledge.

That would be sort of cool. But is it intelligence?

But for me, it would be darn close, and I’d argue it’s just a matter of degrees.

In related news, Intel wants to build self-powered sensors:

“We could, in fact, litter the planet with these things,” …. “Rather than depend on satellite information, we could literally get instantaneous, near-global indication of the state of the planet.” (

Wikiasari forum

Thursday, February 8th, 2007

Jimmy Wales of Wikipedia fame wants to start up an open source, for-profit search engine based on Nutch to compete with Google.

There’s a community forum available:

Forum:Index – search

Common Spidey Sense

Tuesday, December 5th, 2006

We’ve come to accept over time that spiders visit this site as much as humans. Judging by the number of spiders that seem to live here, you’d think it was a cave.

The majority of the spiders we recognise, and we appreciate the attention. Google and Yahoo! come here all the time. It makes us proud (hi guys!).

But along with those mechanical spiders, we also get visits from a variety of baby-bots and wanna-bots who rummage through the site heavily for a while and move on, and others who strike repeatedly. They don’t say who they are, or what they are doing.

Many of them are right not to advertise, because they’d get banned right off – those include spam e-mail harvesters from Brazil, and self-appointed cyber-cops like Cyveillance, who come to sniff around to see if we might be offending them.

Those we ban as quickly as we can – we don’t like spammers or bullies (bad bullybot!).

But we also get bots run by regular, decent folk who just want to keep up on what’s going on here.

The problem is that some do it hourly for months on end, and if you are a regular reader of this site, you know one thing for sure – they are wasting their time, and our bandwidth, because this site only gets updated about once every three months.

There are web site managers who jealously guard their web sites, and who go through their stats looking for abusers. There are discussions and exchanges about who certain IPs are, and what they are up to – so bad bots can get a reputation.

A bot might be banned if it sticks its head up in the stats for things like bandwidth consumption, number of hits, frequency of visits, and so on.

A bot will especially attract notice if it doesn’t respect Robots.txt, doesn’t introduce itself, falsifies information, comes from a bad neighbourhood, has a bad reputation, or drags a site down to a crawl.

Bot banning could become a bigger issue in the future as more and more bots are unleashed, and the Internet becomes clogged with spiders, pre-fetchers, harvesters, comment spammers, scrapers, and other critters.

It’s possible that there will come a time when bots are automatically banned at first sight, and the sub-uber bot (come on, say it out loud) will need to beg for an invitation.

Predictions for a Web 2.0 social experience

Saturday, November 18th, 2006

Right on brother!

“The next killer app isn’t an app.

It will be a new networking platform that builds on today’s world-wide web and makes possible new generations of more powerful and useful applications. “

What distributed open source search lacks in storage space and speed, it can make up for in processing power. What to do… what to do….

Predictions for a Web 2.0 social experience