Deep Search – How to Explore the Internet More Effectively

Where to Search

For the everyday things, Google and the conventional search engines do a good job. But detailed information is not always easy to find, especially when the engines throw up thousands of pages of results.

Most people rarely venture beyond the first page or two and after 14 minutes of fruitless looking even the most determined usually give up.

Understanding how to interrogate any search engine will certainly help. But knowing where to look is often more important.

The regular search engines only index a tiny fraction of the data stored on the Internet. They do this by extracting the ‘visible’ data on websites. This is then searchable with keywords.

But there is another world largely invisible to the conventional search engines. This is the Deep Web. This is not to be confused with the Dark Net – an area of the Deep Web largely concerned with illegal activities.

The information held on the Deep Web is generally contained inside databases and archives, and this content is not indexed by the conventional engines because they are rarely programmed to enter these data stores.

As such, this so-called Deep Web information can only be found by interrogating the database or archive directly through their own search facilities. The archives themselves can usually be found by asking a conventional Surface search engine to find them for you.

For example, suppose a Boeing 767 crashes and you want to look for similar incidents. You would begin your search in the conventional way with Google. But rather than asking Google to find the actual information itself, ask it to find a database dealing with air accidents, such as Plane Crash Info.

The data held within this site is, therefore, Deep Web because it will not have been indexed by the Surface search engines, so they won’t know what is inside.

But, once at the database, you can directly enter the make and model of the aircraft, along with a daterange, and pull up every accident report for every incident globally, along with all the probable causes. Trying to do this by interrogating a Surface search engine alone would take more time than most people are ever likely to devote.

The same thing works for historical documents and quotations. For example, you may come across the line “I am the most unhappy man. I have unwittingly ruined my country” and want to pin it down. Google will certainly provide the apparent quote by Woodrow Wilson – “I am the most unhappy man. I have unwittingly ruined my country. A great industrial nation is now controlled by its system of credit. We are no longer a government by free opinion, no longer a government by conviction and the vote of the majority, but a government by the opinion and duress of a small group of dominant men”.

But what you see is not necessarily the actual quotation and you will also see endless examples of people regurgitating the line without pinning it down to a time and a place or to a specific document. You will also see a lot of debate as to its validity because most people do not know how to search effectively.

But, by knowing that Woodrow Wilson said this, Google can find the right archive, taking you to woodrowwilson.org where you can quickly track part of the line down to a 1912 campaign speech and find the remainder in the full-text version of Wilson’s book The New Freedom published in 1913, the year he signed the Federal Reserve into existence.

You could spend all day on Google and achieve nothing, as opposed to 20 minutes reading in the right archive.

So it is not that difficult. You just need to know where to look and, of course, how to phrase the right question.

It would be difficult for this book to list every possible archive and database and all the other portals within the Deep Web, but below you will find some of the most useful. For the rest, ask a conventional search engine.

Search Engines – How they Work

Search engines work by storing key information from the webpages that they retrieve using an automated web browser known as a Crawler. This information is extracted from the site’s title page, content, headings and meta tags. Results are generally presented in list form and can cover webpages, images and some file types. A few engines also mine data inside databases but coverage is a long way from comprehensive.

Some search engines like Google store all or part of the source page (known as a cache). Others, like AltaVista, store every word of every page they find. When a user enters a query into a search engine, the engine examines its index and provides a listing of best-matching webpages, usually with a short summary containing the document’s title and sometimes some of the text. The engine looks for words or phrases exactly as entered.

Most search engines rank their results to provide the “best” results first. How a search engine decides which are the best pages tends to vary.

Most search engines are in it for the money and some charge advertisers to have their listings ranked higher. Those which don’t charge make money by placing search-related ads alongside the regular search results and get paid whenever someone clicks on one.

Google Alternatives — Google, along with most search engines, stores detailed information about your interests. Each year, the FBI compels these companies to hand over the personal details of hundreds of users without presenting a court order. There are, however, alternative engines that do not store information on you in the first place:

Deep Web Search — no single engine can search the entire Deep Web and no single directory can cover it all, but these go some way:

InfoMine — built by librarians at the University of California, California State University, the University of Detroit-Mercy, and Wake Forest University.

Librarians’ Internet Index — search engine listing sites deemed trustworthy by librarians.

SurfWax — practical tools for dynamic search and navigation.

BUBL — catalogue of Internet resources.

Pinakes Subject Launch Pad — academic research portal.

Search.com — dozens of topic-based databases from CNet.

OAIster Database — millions of digital resources from thousands of contributors.

Metasearch Engines — a good way to perform a detailed search is to employ a metasearch engine to search multiple search engines simultaneously. These include:

DogPile

Mamma

Kartoo

Database Search — there are specialized search engines for finding databases. Arguably the best of the bunch is CompletePlanet which scours over 70,000 searchable databases and specialty search engines. Other notables include:

Search.com — seeks out databases and allows you to search multiple engines with a single query.

TheInfo.com — search specific engines and databases.

Beaucoup — one of the first specialized search engine guides, listing over 2,500 selected engines, directories and indices.

FinderSeeker — breaks searches down by country and even cities.

Fossick — covers over 3,000 specialized search engines and databases.

Repositories and Gateways

Repositories of Primary Sources — direct links to over 5,000 archives, databases and websites globally.
WWW Virtual Library — first catalogue of the web, started by Tim Berners-Lee in 1991. Run by a loose confederation of volunteers.
Librarians’ Internet Index — compiled by librarians offering a searchable, human-reviewed gateway to quality sites in the Surface and Deep Webs.
Digital Librarian — a librarian’s choice for the best of the web’s databases and research resources.
GPO — US Government Printing Office, access to multiple databases including records, hearings, reports, manuals, court opinions, etc.
Library of Congress Catalogs — gateway to a vast collection of academic institutions, universities, libraries, and miscellaneous databases.
CIA Electronic Reading Room — search for declassified CIA documents.
Project Vote Smart — database of US government officials and candidates.
USPTO — patient full-text and image database.
US Census Bureau International Database — demographics, world population data, etc.
WebLens — portal to academic and scholarly research papers and thousands of useful Internet research tools.
DOAJ — Directory of Open Access Journals, free full-text scientific and scholarly journals, covering numerous subjects and languages.
Geniusfind — directory of thousands of search engines, databases and archives organized into categories and subcategories.
Ask Eric — education resources information center.

Open Directories — assembled by human beings who use editorial judgment to make their selections and not by Crawlers running algorithms. A web directory is not a search engine and does not display lists of webpages based on keywords but divides the web into categories.

The categorization is usually based on the whole website rather than one page or a set of keywords. Most directories are general in scope and list websites across a wide range of subjects, regions and languages. But some niche directories focus on countries, languages, industries, products, etc.

Popular directories include the Yahoo! Directory and the very comprehensive Open Directory Project.

User-Edited Directories — are compiled, as the name suggests, by users who are generally experts in their field and who wish to share favorite sites and improve search results. These include IllumiRate and JoeAnt.