Data-mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
-
Author:
-
Subject:
-
Published by:John Wiley & Sons Inc (US)
-
Published:12/04/2007
-
Price:$120.00
- < Buy this book >
Biography
Daniel T. Larose, PhD, is Professor of Statistics in the Department of Mathematical Sciences at Central Connecticut State University. He is the author of three data mining books and a forthcoming textbook in undergraduate statistics. He developed and directs CCSU's DataMining@CCSU programs.
Table of Contents
PART I: WEB STRUCTURE MINING.
1 INFORMATION RETRIEVAL AND WEB SEARCH.
Web Challenges.
Web Search Engines.
Topic Directories.
Semantic Web.
Crawling the Web.
Web Basics.
Web Crawlers.
Indexing and Keyword Search.
Document Representation.
Implementation Considerations.
Relevance Ranking.
Advanced Text Search.
Using the HTML Structure in Keyword Search.
Evaluating Search Quality.
Similarity Search.
Cosine Similarity.
Jaccard Similarity.
Document Resemblance.
References.
Exercises.
2 HYPERLINK-BASED RANKING.
Introduction.
Social Networks Analysis.
PageRank.
Authorities and Hubs.
Link-Based Similarity Search.
Enhanced Techniques for Page Ranking.
References.
Exercises.
PART II: WEB CONTENT MINING.
3 CLUSTERING.
Introduction.
Hierarchical Agglomerative Clustering.
k-Means Clustering.
Probabilty-Based Clustering.
Finite Mixture Problem.
Classification Problem.
Clustering Problem.
Collaborative Filtering (Recommender Systems).
References.
Exercises.
4 EVALUATING CLUSTERING.
Approaches to Evaluating Clustering.
Similarity-Based Criterion Functions.
Probabilistic Criterion Functions.
MDL-Based Model and Feature Evaluation.
Minimum Description Length Principle.
MDL-Based Model Evaluation.
Feature Selection.
Classes-to-Clusters Evaluation.
Precision, Recall, and F-Measure.
Entropy.
References.
Exercises.
5 CLASSIFICATION.
General Setting and Evaluation Techniques.
Nearest-Neighbor Algorithm.
Feature Selection.
Naive Bayes Algorithm.
Numerical Approaches.
Relational Learning.
References.
Exercises.
PART III: WEB USAGE MINING.
6 INTRODUCTION TO WEB USAGE MINING.
Definition of Web Usage Mining.
Cross-Industry Standard Process for Data Mining.
Clickstream Analysis.
Web Server Log Files.
Remote Host Field.
Date/Time Field.
HTTP Request Field.
Status Code Field.
Transfer Volume (Bytes) Field.
Common Log Format.
Identification Field.
Authuser Field.
Extended Common Log Format.
Referrer Field.
User Agent Field.
Example of a Web Log Record.
Microsoft IIS Log Format.
Auxiliary Information.
References.
Exercises.
7 PREPROCESSING FOR WEB USAGE MINING.
Need for Preprocessing the Data.
Data Cleaning and Filtering.
Page Extension Exploration and Filtering.
De-Spidering the Web Log File.
User Identification.
Session Identification.
Path Completion.
Directories and the Basket Transformation.
Further Data Preprocessing Steps.
References.
Exercises.
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING.
Introduction.
Number of Visit Actions.
Session Duration.
Relationship between Visit Actions and Session Duration.
Average Time per Page.
Duration for Individual Pages.
References.
Exercises.
9 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION.
Introduction.
Modeling Methodology.
Definition of Clustering.
The BIRCH Clustering Algorithm.
Affinity Analysis and the A Priori Algorithm.
Discretizing the Numerical Variables: Binning.
Applying the A Priori Algorithm to the CCSU Web Log Data.
Classification and Regression Trees.
The C4.5 Algorithm.
References.
Exercises.
INDEX.
IDC: Delivering Customer Value with Enterprise Flash Deployments
When it comes to flash, “one size does not fit all.” IDC examines recent flash trends in enterprise storage deployments. This includes: highlighting how SSDs are filling in gaps of existing storage systems when coupled with intelligent archiving and automated tiering, the pros and cons of different SSD approaches, and tips to overcome concerns of reliability, manageability and scalability.
Angry IP Scanner
Angry IP Scanner (or simply ipscan) is an open-source and cross-platform network scanner designed to be fast and simple to use. It scans IP addresses ...
Deploying Flash in the Enterprise
Flash is quickly emerging as the preferred way to overcome the nagging performance limitations of hard disk drives. However, because flash comes at a significant price premium, outright replacement of HDDs with flash only makes sense in situations in which capacity requirements are relatively small and performance requirements are high. Learn how deployment approaches-including hybrid storage arrays, server flash, and all-flash arrays-that combine the performance of flash with the capacity of HDDs can be cost effective for a broad range of performance requirements.
- FTTechnical Consulting ManagerNSW
- FTSenior Projects EngineerNSW
- FTTest EngineerVIC
- FTSenior Field Engineer - MSNSW
- FTTechnical Account Manager - MSP + CloudVIC
- FTTest Analyst (MS Environment) .netNSW
- FTTest Analyst (MS Environment) .netNSW
- FTTest Manager - IMMEDIATE STARTNSW
- FTSnr Web Developer PHP/Magento/API integration into E-commerce sites. $100k+SuperNSW
- FTR&D EngineerSA
- FTWeb Developer- Drupal and PHP. Exciting new position- #2 in Dev team.$100k+SuperNSW
- FT2nd Level Support EngineerNSW
- FTLead Software EngineerSA
- FTQuality ManagerSA
- FTSenior Python Web Applications DeveloperNSW
- FTOS Web Applications DeveloperNSW
- FT.NET - Sitecore Developer - Melbourne - PermNSW
- FTSenior E-Commerce PHP Developer- North Sydney- E-commerce Software $110kNSW
- FT1st Line Support EngineerNSW
- FTSenior Python DeveloperNSW
- FTSenior Python DeveloperNSW
- Brocade’s Meyer appointed to OpenDaylight Project Committee
- Barracuda Networks raises free capacity of Copy.com to 15GB
- EXCLUSIVE: Cyan lays out Australian expansion plan
- EXCLUSIVE: Channel training integral to Intel smartphone/tablet growth
- Reseller network important in auxiliary sales of bookings: Expedia










