The Internet and the World Wide Web

 

1.       Finding Secondary Data on the Internet

     While gathering secondary data is necessary in almost any research project, it has traditionally been a tedious and boring job.  The researcher often had to write to government agencies, trade associations, or other secondary data providers and then wait days or weeks for a reply that might never come.  Often, the researcher made one or more trips to the library, only to find out that the needed reports were checked out or missing.  However, beginning in the late 1990s the rapid development of the Internet and the World Wide Web helped eliminate the drudgery associated with secondary data.  This secondary data source continues to expand, with publicly available documents on the Web doubling every 18 months.

     First, a few definitions:

Internet – a worldwide computer network consisting of smaller, interconnected networks.  A network consists of two or more computers connected to one another for the purpose of sharing communication and resources.  The Internet links both public and private computer systems to allow users to access information and documents from distant sources.

World Wide Web (WWW) (or “Web”) – that portion of the Internet servers that support a graphical interface retrieval system that organizes information into thousands of interconnected pages or documents called Web pages.  Content providers are the parties that provide information on the WWW known as Web sites, which consist of one or more Web pages with related information about a particular topic.  Marketing content providers include companies, direct marketers, electronic retailers (e-tailers) and other organizations that have their own Web sites <Dell.com>, as well as Web sites that contain secondary data.

There are two general ways people search for information on the Internet.  First is directed search/prepurchase search (“shopping”) with purchase of one or more particular products in mind.  Consumers know what they are looking for and usually have some existing information to rely on (e.g., a producer’s name, brand name, or a set of terms that describes the product category.  Here, they usually search for key/search terms in search engines as described below.  Second is browsing or casual search, with no particular immediate purchase in mind.   Here, the user might not have an immediate need or might have a less precise view of the information that might be available.  Browsing relies heavily on hyperlinks between documents, allowing the browser to navigate through cyberspace in a non-sequential manner. 

     The Web has been likened to a library lacking a card catalog—there is no central authority that lists all possible sites accessible via the Internet.  This results in surfing—gliding in an unplanned fashion from home page to home page.  There are, however, two organized ways to find information on the WWW—general-purpose search tools such as browsers and search engines, and specially designed tools, such as shopping bots:

(1)    The uniform resource locator (URL) is the Web site address or domain name, which identifies a particular location [Web server and file on that server of the site where the information you need is located.  You can enter the URL for that site in the search window (by clicking “File- Open Window”) of your Web browser, which takes you directly to that site’s home page or start page (the introductory page or opening screen of a Web site).  (The original six top level domains widely used in the U.S were: .com [commercial or “dot.com”], .edu [education], .net [network operations], .gov [U.S. government], .mil [U.S. military], and .org [organization.  In 2001 to these were added .biz, .info, .name, .pro, .aero, .museum, and .coop]. 

(2)    Use a search engine—a computerized directory that allows users to search the WWW for information in a systematic way.  The user types in keywords.  The big four sites are Google, Yahoo!, MSN, and America Online.  Others include Alta Vista, Lycos, and Ask Jeeves, among others.  Each search engine contains collections of links to documents throughout the world, and each uses its own indexing system to help you locate the information you are looking for.  Some search for titles or headers of documents, while others search words in the documents, and still others search other indexes or directories.  You can also search for directories by typing in the word "database” in the search box. 

       All search engines allow you to enter one or more key words or search terms into the text box.  They then return listings of hyperlinks or links (electronic connections from one Web site to another Web site).  Alternatively, you can click on a list of broad topics (art, business, entertainment, etc.) to go to subdirectories or else home pages.  Most search engines use a best match search process and present search output in order ranked by relevance, based on: how many of the search items were found in the document, how often the search items were found in the document, where in the document the search items were found (e.g., URL< meta tags, etc.), and proximity of the terms to one another. 

      It is not uncommon to find a large number of hits; if this is the case, the rule of thumb is to scan the first 50 hits, and if these don’t provide useful information, to consider redesigning the search strategy.   The outcome of a search (“retrieval set”) usually takes the form of a list of Web pages representing the records retrieved, ranked in order of their potential relevance to the query and presenting a certain number (say, ten) at a time.  Each of these incorporates a hypertext link to the source document. 

There are five types of search engines:

(a) Hierarchical search engines or directories add value through human intervention in the assignment of subject headings to records in databases.  In addition, all sites are evaluated prior to inclusion.  Such sites only contain submissions from users—they don’t perform a search of the Web—hence, they are not comprehensive,omitting a large portion of the information on the Web.  Web site creators may their page for inclusion in the evaluation process.  The maintenance of such directories is a labor-intensive process—therefore such search services are selective in the sites that are included.  However, such selection reduces the amount of garbage one often encounters in an Internet search. 

     <Exhibit 17.6> Yahoo! is an example of a search engine built on a hierarchical, subject-oriented guide.  All sites have to fit into a certain category/subject heading and subcategories (e.g., Stolichnaya vodka is indexed as Business and Economy/Companies/Drinks/Alcoholic/Vodka).  Going to Business and Economy/Companies/Sports/Snowboarding/Board Manufacturers gives almost 60 companies that sell snowboards on the Web. <Exhibit 17.7> Searching is via menus of these subject headings and/or through keyword searching. 

(b)    Collection search engines. <Exhibit 17.8> Alta Vista is an example of a search

engine that uses a spider—an automated program that crawls around the Web and collects information.   The advantage of these is that they tend to be very comprehensive.  Because there are so many sites, they rank the best matches first. <Exhibit 17.9>

(c)    Concept search engines. Excite is an example of a concept search engine—they use

a concept, rather than a word or a phrase, as the basis for the search. <Exhibit 17.10> To narrow the original search, one clicks on one of the sites found in that search to do another search.  The percentage key gives the user an idea of how close a particular site is to his or her concepts.  For example, Ask Jeeves <Exhibit 17.11> allows users to type in natural-language questions.  Concept search engines can be a relatively efficient and focused way of searching.  The disadvantage is that they aren’t as comprehensive as collection search engines. 

(d)    Meta-engines/meta search engines/mega-search engines search multiple search engines simultaneously

for words and phrases.  They then combine results, remove duplicate entries, and /present a single listing.  Examples include MetaCrawler, Dogpile, and Debriefing (the latter is maintained by librarians who are constantly refining and upgrading the site).  Some of these can be found in the list of search engines when you click on the “search” button of your browser, and others are found by typing “www.searchenginename.com (e.g., www.dogpile.com; www.debriefing.com).  They are a quick way of searching across several search tools, although they might not support some of the more sophisticated search facilities.  There are also specialty search engines that limit searches to specific topic areas such as law, business, and medicine, as well as Web community sites such as www.theglobe.com. 

(e)    Robot search engines/search bots. This newest type of search engine acts like meta

search tools and searches many Internet search engines in parallel.  They differ from meta search tools in that they are loaded at the local workstation rather than operating in client server mode.  Also, they use robots (“Bots”) or intelligent agents to roam the Internet in search of information.  Once a search has been performed, the user needs to assign relevance rankings to the items retrieved.  The intelligent agent uses this information in the next iteration to modify its search operation.  <Exhibit 17.12>  For example, Travelocity.com finds the best deals for your traveling needs, while BargainFinder (www.BargainFinder.com) does so for your music needs.  Some Web retailers have designed their sites to either refuse the robot admission or to confuse the robot, as they wish to avoid a “cheap” image.

     (f) Search engines for specific sites.  E-tailers with large catalogs of products, such as Amazon.com, need a search engine to support users in navigating their way through the cyber-store. 

     Some search engines (e.g., Yahoo! and Lycos) serve as portals (entry/starting points), for Internet exploration.  <Exhibit 17.14> America Online is a well-organized Web portal from which a Web surfer can link/jump to many locations highlighted by AOL.  Commercial Web sites pay AOL to be featured in this way. .  Such portals can be vertical—serving one industry or market (such as an ethnic market) or horizontal—serving multiple industries and markets. 

     Which search engine should you use?  The best search engines cover about 30 percent of the estimated pages out there.  A 1999 study found that Northern Light, Snap, and Alta Vista index significantly more (16%) of the Web than the other popular search engines.  The most up-to-date search engines were Alta Vista, Excite, and Hotbot.  It is also a good idea to use multiple search engines since there is surprisingly little overlap between the major search engines. Meta-engines search multiple search engines simultaneously, e.g., MetaCrawler, Dogpile, and Debriefing (the latter is maintained by librarians who are constantly refining and upgrading the site).  Some of these can be found in the list of search engines when you click on the “search’ button of your browser, and others are found by typing “www.searchenginename.com (e.g., www.dogpile.com; www.debriefing.com).  There are also specialty search engines that limit searches to specific topic areas such as law, business, and medicine as well as Web community sites such as www.theglobe.com.  These niche or “vertical” search engines only search within a narrow band of interest.  They are sometimes called vortals (a contraction of “vertical” and “portal”), and they might also offer expert reviewers and provide the “best” recommended sites in a given area.

     (g) Blog search engines such as Technorati, Feedster, or Blogdiggger.  If you’re looking for very current information (such as today’s buzz), these are useful. 

(3)    Shopping bots are specialized search bots designed to locate and compare products.  

They take a query, visit shops that might have the sought product, bring the user the results, and present them in a consolidated, compact format that facilitates comparison shopping.  Many also provide access to an order form. 

      Searching is on the basis of full text and/or product categories.   Some shopping bots are comprehensive in coverage (e.g., MySimon, NetMarket, and Planet Retail) while others focus on a specific product range (e.g., BargainBot for books, Bargain Finder Agent for Music and CDs, Gift finder for gifts, and Price Scan for computer software and hardware). 

     Most shopping bots claim to eliminate the searching necessary to identify the right product at the best price. 

The procedure for searching is:

  1. Get to the search engine, either by clicking on the “search” button in the browser or by typing the search engine’s URL (e.g., www.yahoo.com) in the search window.
  2. Enter your search request (keyword[s]) in the search engine’s search window.  Although different search engines use slightly different rules for controlling the parameters of your search, there are certain rules of Boolean logic/search language (remember the “new math” diagrams that illustrated the inclusivity or exclusivity of sets with AND, OR, and AND NOT?) that can generally be used <Operators>:

·         Use a plus sign (+) in front of a word to indicate that it must appear in each Web page of the query results (e.g., hotels+San+Fransisco).  Without the plus sign the word isn’t considered mandatory.

·         Use a minus sign (-) in front of any word that shouldn’t be included in any Web page in the search results (e.g., Cars-Ford)

·         Enclose a multiword phrase in quotation marks to tell the search engine to list only sites that contain those words in that exact order (e.g., “Seattle Preparatory School”).

·         AND works like the plus sign, indicating that all the words joined by AND must appear in the document (e.g., to find documents that contain the words wizard, oz, and movie, enter: wizard AND Oz AND movie).

·         OR joins words, at least one of which must appear in the document (e.g., to find documents that contain the word dog or puppy, type: dog OR puppy).  OR is often used to broaden a search (e.g.: (travel OR tourism OR cruises OR cruising OR vacations OR vacationing OR vacationers) AND (Caribbean OR Bermuda OR Jamaica OR Virgin Islands))

·         AND NOT or OR NOT is similar to the minus sign and is used to exclude words in the document, words that are likely to match your search requirements but have nothing to do with the search topic. (E.g., to find documents that contain the word pets but not the word dogs, enter: pets AND NOT dogs; e.g.: Dolphins NOT NFL).

·         NEAR should be used when words should be near each other (e.g., Moon NEAR River).

·         () Parentheses are used to group portions of Boolean queries together (e.g., to find documents containing the word fruit and either banana or apple type “fruit AND (banana OR apple”).

·         Title search allows you to search for titles of web documents (e.g., “title:Mars” or “t:Mars” will retrieve all documents with the word “Mars”). 

·         * Wild card (e.g., eco* will return economy, economics, ecology, etc.)

+ Some Hints for Searching:

·         Be specific.  Tying in “DVD Players Reviews” will give you a better set of results than the more general “DVD Players.”

·         Add quotation marks.  Keep exact phrases and proper names intact by enclosing them in quotation marks.  Use words most likely to be used (e.g., try “John F. Kennedy” and “born” rather than “John F. Kennedy” and “birth date”).

·         Use the “advanced Search” feature tool.  For example, you can scour only certain kinds of documents by excluding pages with certain words. 

3.       After typing the search request, click on the search button.  (The search engine then searches the entire Web or a subset of the Web to locate sites meeting your search parameters.)

  1. In a period of several seconds to up to a minute or more, the search engine returns a list of sites containing the information that meets your criteria.  Look at the top or bottom of the current page to see how many sites are listed.  If there are too many, you can either just look at those at the top of the list (click on a link to go to a document) since those should be the best match, or start over and narrow your search (e.g., by typing in additional terms, using AND, or using NOT).

     Web sites are also discovered via word-of-mouth communication as well as checking favorite Web sites on others’ home pages.

     Also, much information, like airline on-time records, is buried in databases and not in Web pages scoured by search engines.  Access such hidden information through www.invisible-web.net. 

 

2.       Finding Federal Government Data on the Internet

     A great source here is the Statistical Universe, created by the Congressional Information Services (CIS).  It is available at www.cispubs.com.  This is the most comprehensive and fully indexed source for federal stats online.  Entire reports can be downloaded.  Links to the 70 federal agencies recognized by the Office of Management and Budget as issuing statistical data can be found at www.fedstats.gov.

 

3.       Internet Discussion Groups and Special Interest Groups

     Newsgroups are Internet sites devoted to a specific topic where people can read and post messages.  They are a primary means of communication between professionals and among members of special interest groups.  You can visit any newsgroup supported by your Internet service provider (ISP).  If your ISP doesn’t offer these groups or carry the one you are interested in, you can find one of the publicly available newsgroup servers that carries the group you’re interested in.  Both Netscape and Internet Explorer, as well as other browsers, come with newsgroup readers.

 

     Newsgroups are like bulletin boards for a particular topic.  People stop by the newsgroup to read messages left by other people, post responses to others’ questions, and send rebuttals to comments with which they disagree.  There is usually some management to keep the discussion on topic and to remove offensive material.

       Newsgroup messages look like e-mail messages, containing a subject title, author, and message body.  They differ in that they are threaded discussions—any reply to a previous message will appear linked to that message. Therefore, you can follow a discussion by starting at the original message and following the links (threads) to each successive reply. 

      Although Stonehill doesn’t have a newsgroup server, the usual procedure is:

       1, Open your newsreader program in your browser.  In Navigator click Communicator-Newsgroups, or in Netscape e-mail click on “News.”

       2. Search for the newsgroups of interest by keywords or topics.

            3.Select the newsgroup of interest.

      4. Begin scanning messages.  The title of each message generally gives an indication

       about the subject matter.

 

     Usenet is a collection of discussion groups in cyberspace.  People in a Usenet group can read messages on a given topic, post new messages, and respond to existing messages.  For advertisers, a Usenet group can be a highly targeted audience for advertising messages.  Marketing researchers can also use Usenet as a form of unobtrusive observational research.  By visiting them, you can get the latest opinions on products, services, stores, etc.  There are also company-specific sites, such as U-Hell (for U-Haul0, Untied.com (for United Airlines), and TheWorst.com (for Sprint PCS), as well as more general complaint sites, such as UgetHeard.com, Bitchaboutit.com, Complain.com, and Epinions.com.  Some of these sites actually send the complaints to the offending companies, who can choose to respond with some sort of restitution. 

Unfortunately, some of the complaints are suspect of being fake. 

Chatrooms—either one set up by your company on your web site, or others’ chat rooms, can be monitored for word-of-moth communication.  These can function as virtual focus groups operating in near-continual session, enabling companies to track consumer buzz as it develops.