Curation: A History Part V

The Age of Search

This is part 5 of a 7 article series that examines the technological history and theory behind the curation of human thought and discovery.

The Internet, and it’s subsequent evolution into the Web, led to an explosion of curation efforts on an unthinkable scale. Personal curation devices in the form of cameras, recorders, computers, and the smartphone made curation accessible to the masses. The problem was no longer about getting your hands on all of this human knowledge but instead became a question of how one could sort through it all.

For as long as humans have been curating we have also been indexing. Labeling, organizing and storing information in a way that made it easy to find the right information at the right time.

It was this need for indexing that would spur the efforts of countless engineers and researchers that would give us the search experience we rely on today.

And it is the history of those efforts we shall turn to next.

The Early Days

In 1987, the school of computer science at Canada’s McGill University launched a project to connect the school to the Internet. While the Internet was already being used by many other universities and research teams around the world, this particular project would lead to the creation of the Archie (short for ‘archive’) search engine.

While the search functionality provided was nothing compared to what most people think of today when it comes to search, this was an important first step. Disparate curated information could be tracked and held in a singular location for retrieval at a later point in time. It was one thing for people to be curating en masse, but if that curation couldn’t be easily found it might as well not have been saved at all. Archie provided the early foundation for connecting those with questions to an archive of answers.

Other iterations and spin-offs of Archie would be developed, but it would be the advent of the World Wide Web which would lead to the next big shift in search.

The move from the Internet to the Web was major. Previously, the sharing of files required the curator to set up a server with a file holding their information so that an individual seeking that information could request it. The problem with this model was that you needed to know the server the file was stored on. This wasn’t a problem when it came to curating and sharing information in small groups, but it suffered from the fragmentation that came with the growth of the Internet.

There was a lot more information on servers you or your search engine didn’t know about. The Internet was proving to be an amazing tool for archiving knowledge but finding it was still prohibitive.

Crawling the Web

The Web helped solve this problem with a technology called Hypertext Transfer Protocol (HTTP). Combined with Hypertext Markup Language (HTML) and the Internet Protocol Suite (TCP/IP), HTTP (in conjunction with a web browser) allowed an individual to request information from a server that could then provide that information through HTML in the form of a website with TCP/IP facilitating the entire exchange.

When you visit a website today this is what happens in between clicking a link on Google and loading the website associated with that link. This technology proved to be the catalyst that was needed to bring together the curation power of computing and the potential the Internet held to provide global access to that information. The functionality provided by the HTTP/HTML system set the stage for a major curation turning point that came in the form of hyperlinks.

Hyperlinks, those links we use to navigate from page to page on the internet, proved to be the missing piece of the puzzle that was uniting a fragmented digital world of human insight. Their inception enabled the proliferation of web crawlers, software that indexes sites on the Web by jumping from hyperlink to hyperlink.

These crawlers would start with a set of popular and well-connected sites known as ‘seed’ pages. Using these crawlers, we were able to index the pages they were visiting in order to build a collection of all the sites on the Web. Suddenly, you didn’t need to know what server a particular piece of information was being curated on in order to access it, all you had to do was search the index built by the crawlers. It was from this advancement that we took our first steps toward search as we know it today.

Thinking with Portals

Web crawlers and indexing had connected the web and solved the accessibility problem of uniting digitally curated information with anyone that had the inclination to seek it out. But as the Web continued to grow the amount of curated content made it impractical for a human to sift through if they needed an answer to a question. Curation was no longer about saving knowledge in an accessible way. Instead, humanity needed to figure out how to find the right information from within this seemingly endless web of sites.

Early search algorithms were introduced as a way of connecting users to the right indexed web page, but they had a long way to go before they even resembled the algorithms people rely on in the modern day. They still had trouble understanding how best to connect users with the most appropriate answers, but more than that they were easily gamed by spammers and other opportunists. This compounded the issue even further.

Humanity had created a knowledge repository unlike anything the world had ever seen, but it was getting harder and harder to navigate. With an adequate automated search option still years away, the Web fell back on manual curation and human-assisted search.

This manifested in the form of web portals. Sites that used humans to curate collections of curated online content found on other sites. These collections would result in a bundle of hyperlinks that served as a portal to sites of high-quality related to that particular collection’s topic.

This solved the problem query-based searches suffered from by leading the user down a path of links toward a page that was most appropriate for answering their question or meeting their needs. The persisting problem was that it was impractical (if not impossible) to create portals for every potential question. Portals played a vital role in indexing curated human understanding online in the early days of the Web and helped drive mass adoption of this new technology.

But these portal sites were victims of their own success. The portal model couldn’t keep up with humanity’s need to access a diverse spectrum of information. Fortunately, breakthroughs were about to arrive in the form of natural language search and PageRank.

The Era of Modern Search

When most people begin a search online today they tend to format their search query the same way they’d ask a question to another human being, but this wasn’t always the case. Early search engines required you to know the exact name of the page or file you were seeking. Later engines got a little better than this but were still restrictive on the formatting of your question. This threw a wrench into the accessibility to content provided by indexing. We could now track virtually everything online but you had to know the right words to get your hands on it. For the average person, this was asking a lot.

Natural language search would break down the communication barrier between humans and computers that enables the search experience we use every day. Natural language technology allowed search engines to translate search queries phrased as a human would normally speak into search parameters for the engine to follow in order to find indexed pages related to that query. This finally opened the entire Web to the masses, but the solution for connecting people to the best indexed pages was still suboptimal.

However, a solution wasn’t far behind. For years, search companies and research teams had worked tirelessly to understand how best to identify the right indexed page to serve up for a particular query. Natural language solved the query part, but even if the engine understood what it was looking for it struggled to find the best result. The dots would finally be connected in the late ’90s by Google with their PageRank algorithm.

While not the first attempt to solve the problem of evaluating indexed pages, PageRank was the first solution that provided results with a level of accuracy unheard of at the time of its debut. PageRank (named after Google’s co-founder Larry Page) organized indexed pages based on assigning numerical weights to hyperlinks in order to evaluate a particular page’s authority (or rank) in the index. That authority (along with other proprietary aspects of the algorithm) would determine which pages the Google search engine would display for a given natural language web query.

The PageRank model would go on to influence several other search algorithms over the years, but it was through PageRank that modern search (as we think of it today) would be born.

Digital Curation Enabled

Curation had reached its local maximum. Web 2.0 provided any individual (that could get online) the tools to curate any and all ideas and discoveries they could collect. Combined with modern search technology, this made an ever-growing index of human knowledge usable and provided the accessibility to information that was necessary for the age of digital curation to finally begin.