DATA MINING

What is Data Mining?


Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical and machine learning techniques, and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples. Special emphasis will be give to the Machine Learning methods as they provide the real knowledge discovery tools. Important related technologies, as data warehousing and on-line analytical processing (OLAP) will be also discussed. The students will use recent Data Mining software. Enrollment in this course is limited to 15 students.



Data Mining Training Syllabus:-


Introduction to Data Mining

  • What is data mining?
  • Related technologies – Machine Learning, DBMS, OLAP, Statistics
  • Data Mining Goals
  • Stages of the Data Mining Process
  • Data Mining Techniques
  • Knowledge Representation Methods
  • Applications
  • Example: weather data

  • Data Warehouse and OLAP

  • Data Warehouse and DBMS
  • Multidimensional data model
  • OLAP operations
  • Example: loan data set

  • Data preprocessing

  • Data cleaning
  • Data transformation
  • Data reduction
  • Discretization and generating concept hierarchies
  • Installing Weka 3 Data Mining System
  • Experiments with Weka – filters, discretization

  • Data mining knowledge representation

  • Task relevant data
  • Background knowledge
  • Interestingness measures
  • Representing input data and output knowledge
  • Visualization techniques
  • Experiments with Weka – visualization

  • Attribute-oriented analysis

  • Attribute generalization
  • Attribute relevance
  • Class comparison
  • Statistical measures
  • Experiments with Weka – using filters and statistics

  • Data mining algorithms: Association rules

  • Motivation and terminology
  • Example: mining weather data
  • Basic idea: item sets
  • Generating item sets and rules efficiently
  • Correlation analysis
  • Experiments with Weka – mining association rules

  • Data mining algorithms: Classification

  • Basic learning/mining tasks
  • Inferring rudimentary rules: 1R algorithm
  • Decision trees
  • Covering rules
  • Experiments with Weka – decision trees, rules


  • Data mining algorithms: Prediction

  • The prediction task
  • Statistical (Bayesian) classification
  • Bayesian networks
  • Instance-based methods (nearest neighbor)
  • Linear models
  • Experiments with Weka – Prediction

  • Evaluating what’s been learned

  • Basic issues
  • Training and testing
  • Estimating classifier accuracy (holdout, cross-validation, leave-one-out)
  • Combining multiple models (bagging, boosting, stacking)
  • Minimum Description Length Principle (MLD)
  • Experiments with Weka – training and testing

  • Mining real data

  • Preprocessing data from a real medical domain (310 patients with Hepatitis C).
  • Applying various data mining techniques to create a comprehensive and accurate model of the data.

  • Clustering

  • Basic issues in clustering
  • First conceptual clustering system: Cluster/2
  • Partitioning methods: k-means, expectation maximization (EM)
  • Hierarchical methods: distance-based agglomerative and divisible clustering
  • Conceptual clustering: Cobweb
  • Experiments with Weka – k-means, EM, Cobweb

  • Advanced techniques, Data Mining software and applications

  • Text mining: extracting attributes (keywords), structural approaches (parsing, soft parsing).
  • Bayesian approach to classifying text
  • Web mining: classifying web pages, extracting knowledge from the web
  • Data Mining software and applications