Data Mining

M Sasikumar, NCST, Mumbai - 400 049

Introduction

``Computers have promised us a fountain of wisdom, but delivered a flood of data'', a frustrated MIS executive is reported to have made this remark. The speed with which information is being generated and the technology which makes it available to everyone's finger tips has come a long way. Today's problem is in keeping your head above the flood of information, at the same time not losing touch with what is happening in areas of your interest. This is turning out to be an increasingly difficult task. That, in brief, is the motivation for Data Mining (DM).

The Earth satellites are expected to generate a tera-byte of data every day, mainly as pictures. These are potentially useful for a variety of applications from global weather studies and planning to military uses. But, one person would need many years just to examine all the pictures produced in a day. Manual processing of such data is likely to lead to highly outdated results, if at all.

The US Census is estimated to be about a trillion bytes. Once again the volume of data is mind-boggling; at the same time, such data contain a wealth of information critically important for planning the development of the country.

In most business houses, databases are common. Almost every transaction is recorded in a database. Thus, most houses have a large amount of data about which customer did what transaction of what items when. These are also data mines - a careful analysis of such databases could possibly tell you who are your most loyal customers, who switches brands often, what kind of people prefer a given brand of an item, etc. These are, obviously, immensely valuable items of information for planning your market, organisation of your store, inventory, etc. But once again, the complexity of the task is normally quite daunting.

Wal-mart, a chain of over 2000 retail stores uploads about 20 million point of sale transactions to an AT&T massively parallel system with 483 processors running a centralised database. They want to know the trends down to the last Q-tip.

"Advances in data gathering, storage, distribution technologies have far outpaced computational advances in techniques for understanding and analysing data," says an introduction to this field, pointing to a clear need for a generation of tools for DM.

Announcing its suite of offerings for DM area, IBM stated, "IBM's DM tools and techniques can provide the knowledge necessary to improve a company's market presence and differentiate their products and services in today's global market presence." And this in a way sums up the reason behind the excitement of DM, from a commercial point of view.

In an age of fierce international competition and technological advances, staying in business and retaining your market share is no easy job. Companies are turning to the help of such technologies to analyse and find the market segments to target, to plan suitable marketing strategies for different segments of the market, etc.

In the next section, we explore the motivations of data mining further, in a bit more technical sense. That also opens up other application frontiers for this area. Then we look at what the term, data mining, actually means. Its component tasks and the issues related to implementation are explored. Section 4 explores the issues in DM, in general. The quality of data, directing the mining process, etc., are considered here. Then we move on to look at the other areas that have had a strong impact on DM and outline the contributions that they are making to the field. As we will see, DM is a highly interdisciplinary field.

In Section 6, we take a look at a set of applications of DM, covering a range of domains from health-care to legal records. Section 7 samples the tools being developed, academically and commercially, for this purpose. Section 8 concludes the report with a summary and taking a look at the future.

Why Data Mining?

As detailed in the previous section, the major driving force is the combination of enormous volume of data and the importance of analysing them. Given volumes like tera-bytes, manual analysis is impossible. Computer assistance is essential. There are fairly conventional uses of computer assistance in database processing - find the average yield, the plot of profit against loss on a monthly basis, etc. The main characteristic here is that it requires human driven queries, where the person knows clearly what he is looking for and, where and how to find them.

With use of techniques from AI, etc, other more sophisticated avenues are opening up. Some of these are extensions of conventional applications; others are new arenas.

What is DM?

Data mining is a highly inter-disciplinary area spanning a range of disciplines from artificial intelligence to databases and statistics. Its association with the different areas and its vast scope has led to a proliferation of names for this. Some of these are: knowledge extraction, database exploration, data pattern processing, data archaeology, information harvesting, siftware, and data dredging. Though not all of these are strictly equivalent, the overlap is quite substantial and hence we will not attempt to offer any hair-splitting analysis of the differences.

Very broadly, a DM system "consists of a collection of components that together can efficiently identify and extract interesting and useful new patterns from data stored in real-world databases." [Mathews et al, 1993] Thus a DM process can be viewed as consisting of the following stages:

All these stages may not be present, nor relevant, in all cases. But this gives a general picture of the stages. We will largely concentrate on the Pattern Extraction stage here. A wide range of tools and algorithms exists for this task, depending on the user requirement. As we described in Section 2, the goals of a DM application could range over a wide spectrum. Depending on the goal, the tool class is selected, which includes dependency analysis, deviation detection, predictive modeling, association discovery, classification, concept description, etc.

All the lofty goals notwithstanding, today the technology is not ripe enough to achieve a fully automated solution for DM. The reasons are manifold. On one side, the goals (ie, the gold you are looking for) is not well-defined. We only indicate that it must be interesting and useful. Both these are heavily domain dependent. What is useful for one domain may not be for another. Thus, in most applications, human control and guidance plays a significant role.

The basic functions in a DM system are

By nature, DM is an iterative process, repeated with revised estimates/questions, till satisfactory/useful answers are obtained.

Issues

A major requirement of DM is that the knowledge it uncovers must be interesting. Typically, assessing this automatically is nearly impossible. In some restricted contexts, quantitative bounds can be specified. For example, if the price varies more than 10% on a stock in a day, that variation is likely to be interesting. In general, interestingness and utility are heavily context and use dependent. Discovering that all those who are pregnant are female is, obviously, not a useful information for the doctors; whereas observing that people with BMI above a certain threshold are not pregnant may be an important observation. Distinguishing these two cases require substantial amount of domain knowledge and common-sense.

In a database, almost no rule will hold to 100% certainty. If nothing else, errors in data entry, may show up examples where the discovered pattern is violated. Hence, insisting on patterns that are valid on the entire database, is likely to cause loss of useful information also. Thus in general, systems allow a measure of certainty to be associated with discovered knowledge, which provides a feel for the human user.

From the earlier description of DM, it is clear that DM's search space of exploration is quite vast. The actual definition of search space depends on the type of knowledge representation pattern employed to represent the derived knowledge. In general, the number of possible patterns that can emerge from a database is of exponential complexity, making an exhaustive search impractical. Though one cannot frame general rules or procedures to say which part of the space will yield useful knowledge, there is a lot that can be done to control the complexity and prevent blind alleys, using background knowledge about the field. Thus domain knowledge plays a critical role in making DM practical. Some of these roles are:

The databases generally used for DM, are created for routine uses and hence likely to contain incomplete or even inconsistent entries. The affected fields may be very important from a DM point of view. Since DM is unlikely to have been one of the aims of existing databases, often parameters or attributes which are useful or important for such analysis are absent.

In addition, one must also worry about the nature of the attributes. Time is one dimension which affects validity of values in a record. We have constant value fields (such as name, registration number, etc), slowly changing values (weight of a person) and highly time critical values (such as temperature, pulse rate, etc). For the first two, one can estimate missing values if values suitably close to the time-period are known. But values of fields such as the last one, can not be estimated.

Knowledge Representation

A major issue in DM is the form in which the knowledge is to be represented. This has significant repercussions on the efficiency of search, intelligibility and power of the system. In most cases, the knowledge is meant to be used for human use - in planning, etc. Hence a symbolic representation is preferred. Some form of uncertainty representation such as, probabilities, belief measures, fuzzy values, etc, are normally added to capture the extent of uncertainty associated with the derived knowledge.

Some of the commonly used representation models are described briefly below:

In the next section, we will explore some of the sister disciplines which have a say in the development of DM. We will also outline the contributions.

Relevant Technologies

As mentioned earlier, DM is a multi-disciplinary area. The use of a large database as the data source brings in the area of database systems. The use of heuristic search and learning involved brings in AI and the associated discipline of machine learning. Very often, discoveries have to be made in the strength of statistical occurrences of patterns in the database. This brings in use of statistics. The similarity to human discovery of ideas/concepts in scientific discovery makes that another area to watch for. Let us look at each of these briefly.

Databases

In almost all cases reported, the source of data for DM has been a conventional database. Typically quite large in size, running to millions of records. Although conventional database techniques are not optimal for DM algorithms, invariably researchers have found the resulting overhead a small price to pay for the advantage of having a ready-made efficient and reliable mechanism. The algorithms for mining are tailored to understand and workaround the resulting limitations of a database system.

AI

A pioneer in DM, Gregory P Shapiro, summarises the role that AI techniques play in this inter-disciplinary area: "In my experience, KBS part in a KDD system is perhaps 5% or less of the total system, but it is the most critical part - like the human brain, which is also less than 5% of the weight of the body."

Given the vague definition of what one is looking for and the vast search space, heuristic search and use of domain knowledge becomes important constituents of the task. And Machine Learning [Shavlik and Dietterich, 1990] has been an active AI discipline.

Machine Learning

Faced with the difficulty of acquiring all the required knowledge by human input (and interaction with human experts), machine learning became an important area of AI research early enough. A variety of models have been proposed, ranging from inductive learning to explanation based learning. Since in the context of DM, human expertise to explain items will not be practical and the data source is a bunch of worked out examples, inductive learning techniques are the most popular ML paradigm for DM. It must be noted that, in typical ML applications, the examples are hand-crafted by humans and hence are normally small and reliable. In the DM context, neither of these are true.

Machine learning techniques [Shavlik and Dietterich, 1990] in this context can be categorised as two: supervised and unsupervised. In the case of supervised learning, each instance in the training set is tagged with the actual class name. The system's task is to find descriptions which identify all the instances of a class. In unsupervised learning, no such tagging is available. This is typically called clustering; the task being to group 'similar' instances using some kind of a distance measure. Each resulting group forms a class.

In general, given a set of instances, there will be more than one description which fits the classification of the instances. A guiding factor commonly used in such cases for a choice is Ockham's razor. This says, choose the simpler one. The idea here is that simpler rules are likely to be more realistic.

Inducing class descriptions from examples is generally called inductive learning. There are many algorithms for this. A popular one represents the description as a decision tree and uses statistical measurements in choosing a suitable tree. This is the ID (inductive dichotomizer) family. This is now part of many expert system shells.

Rule induction has been applied with interesting results in many practical applications. [Langley and Simon, 1995] describes many applications in this area, including chemical process control, making credit decisions, diagnosis of mechanical devices, classification of celestial objects, reducing banding in rotogravure printing, improving separation of gas from oil and preventing breakdown of electrical transformers.

ID3 works with a training set of instances described in terms of attribute-value combinations. The model arrives at a decision tree (See section 4) made using these attributes. In order to decide on the attribute to be used at a specific node, ID3 uses an information theoretic measure. The available attributes are considered one at a time and the resulting break-up of instances is examined. The attribute which maximises the information gain is selected [Quinlan, 1993]. Using this approach, ID3 is able to generate fairly compact trees.

Although these systems attain a high classification accuracy, they are not able to make use of any domain knowledge in guiding them. Systems which generate decision rules, eg AQ15, also offer the possibility of doing constructive induction. These approaches find it hard to handle incompleteness and inconsistency; they resort to suitable pre or post processing of the database for this purpose. Different extensions of ID3 for handling noise, missing values, etc have also been proposed.

ILP, Inductive Logic Programming, [Muggleton, 1991;1992] is another emerging area in this context. Offering the power of full FOL in the knowledge it derives and the ability to use domain knowledge are two strong points of ILP. In ILP, the task is to create a FOL program, which together with the domain knowledge specified can explain every element of the training set. FOL, Golem and Progol are examples of systems in this category.

Statistics

Statistical procedures and techniques are widely used in many of the DM tools available today. Similarity detection, deviation analysis, dependency analysis, etc are some of the areas which uses statistical models. [Scheines and Spirtes, 1992] describes use of statistical models for finding latent variables in databases. [Klosgen, 1992] describes Explora, a DM tool based on statistical analyses.

Neural Networks

Neural networks [Haykin, 1994], though very popular in AI, has found little role to play in DM so far. This is largely due to the fact that they produce numerical networks which are not intelligible to humans. Thus the knowledge derived is not useful for transfer to other domains or tasks. Another draw back of NN is the inability to use domain knowledge. Given that random exploration is likely to be quite expensive, this becomes a serious concern.

Parallel Computing

Given the complexity of the task, there is no short cut to almost exhaustive enumeration of the search space. More so, because you are not looking for a well-defined solution, but exploring for any possible surprises. We have no definition for what constitutes a surprise/interestingness. Therefore, this is an application which can use any amount of computing power. And naturally, this has now become the target of many parallel processor vendors. There are DM tools such as Orchestrate, which comes with a parallel processor hardware. Thinking Machines Inc, on returning from bankruptcy, has decided to target at the market of Parallel Processing and Data Mining, combined.

Example Applications

In this section, we will examine a number of applications of DM in a range of areas and tasks.

SKICAT (Sky Image Cataloguing and Analysis Tool), was developed by the team of Usama Fayyad at Jet Propulsion Laboratory, USA for analysing photographs of sky and identify objects. A survey of Northern sky was done by Palomer Observatory, which yields about 3000 plates which captures about 2 billion celestial objects. The task was to analyse these images and classify the objects as stars, galaxies, quasars etc. It is reported that out of a similar survey conducted previously, only 5% of the images were ever processed for analysis. The result of using SKICAT for the task was impressive. They could analyse images which are one magnitude fainter than what was possible with manual analysis and obtain 94% accuracy. The effort identified 10 new high-red-shift quasars using this. This was important from an astronomic point of view, since these correspond to periods much closer to the birth of the universe, compared to what people have been able to do so far. A press release dated December 1, 95 says, "Astronomers have discovered 16 new extremely distant quasars, the result of a search made nearly 40 times more efficient than previously possible by applying AI to the new Palomer digital sky survey."

HEALTH-KEFIR is a data mining application developed at the GTE Labs - one of the leading centres in this area. KEFIR stands Key Findings Reporter. The task was to discover keyfindings in large relational databases. The system examines the health-care data of GTE employees with a view to reduce the company expense on this account, by identifying more cost effective remedial measures. The data are examined in many different ways and "interesting" trends are reported. For example, an increase in pregnancy will not be very useful (GTE may not want to reduce this rate!); whereas an increase in premature births will be useful to know. GTE could invest in educating the would-be mothers and take advance action to reduce the risk of a premature birth. This will improve the value of the health-care, at the same time reducing the expenses incurred. In the context of this system, "interesting" patterns are those for which GTE has procedures to improve the outcome and/or reduce costs. The system was developed on SUN Sparc station with Informix database, Tcl, C and SQL interface. GTE is currently planning to make this into a general package for health-care database analysis.

Another application in the health-care side was reported from Southern California Spinal Disorders Hospital in Los Angeles. They are using the IDIS system to discover subtle factors affecting success/failure in back surgery.

US Department of Treasury has been reported to be using a DM system to watch transactions. It handles 200,000 transactions a week, ruling out any manual procedures of inspection. The system has identified about a billion dollar of potential laundering and has led to one criminal conviction. Similarly, the US Internal Revenue Service has developed a system to detect fraud and improve tax collection. This system is PC based and was developed using packages Knowledgeseeker, ModelWare and AIM.

Victoria's Secret Stores is a market-chain specialising in dress materials such as lacy lingerie. It is a 1.3 billion dollar company with 670 shops and a range of about 1000 items. They have been using an inventory model based on the notion of an average shop. Recently, they invested 5 million dollars in a data warehousing project. In the process, they discovered to their surprise, there is no concept of an average shop in businesses such as these. For example, they noticed that when the average shop is assumed to sell an equal number of black and ivory bras, in Miami area, the ivory bras sell 10 times as many as the black ones. Thus the inventory model for Miami has to be different from the others. Similarly, they also found that discounts need not be announced globally. There are areas where people can afford to pay the price, where the products may be priced without discounts, without any loss of business. The stores expects to save as much as 3 million dollars per year, by this move alone.

This is an example of using a data analysis tool to clean-up and reorganise a database. Office of Children Administrative Research (OCAR) of the Dept of Social and Health Services has a record of 4.5 million records. Each entry is for a child and has fields for name, social security number and date of birth. But as it usually happens, most of these fields are left blank or incorrectly filled. So identifying records related to a specific child was a mammoth task. A system called 'data cleaner' was used for this task to gather related records together. The system provides the user with a rule-based notation to specify equivalent records. OCAR has been able to obtain over 97% accuracy in this job with this tool.

This one is in the sports area. This story was reported in Wall Street Journal on 22 March 1996. The new discovery here is that, talent is essential, but technology can be important. For long time, sports coaches have been using video recording of games to study and improve performances of players. This is a cumbersome process, since the extent of video can be long and there is no clear identification of what to watch for. NBA has been exploring use of high tech weapons to improve performance of its basketball teams, when it encountered Inderpal Bhandari of IBM, who was working on a DM system based on attribute focusing. He explored the sports domain and discovered some patterns on an experimental basis. Some of them were found to be known to the coaches, but there were some they did not know, for example, Charles Smith led team in shots blocked. The coaches were impressed and Bhandari was asked to take up the project. The system was remodelled as Advanced Scout and is being widely used by NBA to analyse playing patterns in the games.

Some of the other major beneficiaries of use of DM tools/technology include American Express (fraud detection), the financial giant JP Morgan, etc.

The following story illustrates a frightening aspect of the power of DM techniques. Beverly Cook, a researcher at the University of Wisconsin used the IDIS package to analyse death-penalty related votes and opinions of a supreme court justice. The analysis revealed that the behaviour of the judge was linked to the race of the accused. Such discoveries can have significant implications to the judicial system as a whole. Today, as the technology of DM is maturing, people are also worried about privacy issues more than ever before.

Tools Available

The commercial world is fast reacting to the growth and potential in this area. A wide range of tools are marketed under the label of DM. Unfortunately, many of them are simple report generators or database query languages, which have no relevance to DM at all.

The tools range from data cleaning tools to sophisticated visualisation tools (for visual inspection of patterns to help identify interesting trends) and application specific DM suites. Many tools offer a suite of techniques to choose from.

Many of the tools are priced at a high-end, indicating the niche they are targetting. For example, IDIS on PC costs 1900 USD which requires a server licence in addition depending on the number of records being processed. For a one million record, the fee is 25,000 USD. A tool called PRISM costs 400,000 USD for a 1-million record database. DM Workstation, another system, costs around 50,000 USD and Analytix is upwards of 100,000 USD.

General purpose tools typically require a significant amount of domain knowledge to be applied during the mining, by the user. This requires much background knowledge to use the system. Thus there is a move to offer application oriented tools. For example, AT&T offers "Sales and Marketing Solution Packs" for vertical markets. The cost of this, including s/w, h/w and services, is over 250,000 USD. GTE is commercialising its KEFIR system for health-care applications.

Conclusion

An indication of the rapid growth of this area is the growth of the yearly conference on Data Mining. The number of papers are increasing steadily; KDD'96 is reported to have received over 200. There are two journals recently announced in this area: "Intelligent Data Analysis - an international journal" from Elsevier and "The Journal of Data Mining and Knowledge Discovery" by Kluwer Academic Publishers. Both of these are due in 96-97. A report of the NSF Workshop on the future of database research held in May 1995 [Silberschatz et al, 1995] mentions Data mining and Data warehousing as important developments from the database research point of view.

DM is becoming a strategically important area for many business organisations, opening up novel avenues. IBM has also entered the market in a big-way by announcing its `decision support offerings for mining', which includes a mining package as well as consultancy and application suites. The kit includes ready-made, but customisable, applications such as customer segmentation, item set analysis and fraud detection. Claims of DM h/w and s/w market goes as high as a billion dollars.

E X DeJesus introducing the topic of DM, in Byte October 1995, takes a look at the future: "Computers may reveal new treatments for diseases or new insights into the nature of the universe. We may well see the day when the Nobel prize for a great discovery is awarded to a search algorithm."

References

[Agrawal et al, 1993] Agrawal Rakesh, Imielinski Tomasz and Swami Arun. Database Mining: A Performance Perspective. IEEE transactions on data and knowledge engineering, Vol 5, No 6, December 1993.

[Frawley et al, 1992] Frawley WJ, Shapiro GP and Mathews CJ. Knowledge Discovery in Databases: An Overview. AI Magazine, Vol 14, No 3, Fall 1992.

[Haykin, 1994] Haykin Simon. Neural Networks: A Comprehensive Foundation. IEEE Press, 1994.

[Holsheimer and Siebes, 1994] Holsheimer M and Siebes APJM. Data Mining: The Search for Knowledge in Databases. CS-R9406, CWI Technical Report, 1994.

IBM White paper on Data mining available from:
http://booksrv2.raleigh.ibm.com/cgi-bin/bookmgr/bookmgr.cmd/BOOKS/datamine/

[Kantola et al, 1992] Kantola M, Mannila H, Raiha K and Siirtola H. Discovering Functional and Inclusion Dependencies in Relational Databases. International Journal of Intelligent systems, Vol 7, 1992.

[KDD Nuggets, 1995-1996]. This is a bi-weekly digest maintained and produced by Gregory P Shapiro and mailed free to those interested in KDD. There is a wealth of uptodate information in these. Contact: http:// info.gte.com/~kdd.

[Klosgen, 1992] Klosgen Willi. Problems for Knowledge Discovery in Databases and Their treatment in the Statistics Interpreter Explora. International Journal of Intelligent systems, Vol 7, 1992.

[Langley and Simon, 1995] Pat Langley and Herbert A Simon. Applications of Machine Learning and Rule Induction. CACM, November 1995.

[Lewinson, 1993] Lisa Lewinson. Data Mining: Intelligent Technology Gets Down to Business. PC-AI, Vol 7, No 1, 1993. A non-technical introduction to the area.

[Major and Riedinger, 1992] Major John A and Riedinger Dan R. EFD: A Hybrid Knowledge/Statistical-based System for the Detection of Fraud. International Journal of Intelligent Systems, Vol 7, 1992.

[Mathews et al, 1993] Mathews CJ, Chan PK, Shapiro GP. Systems for Knowledge Discovery in Databases. IEEE transactions on data and knowledge engineering, Vol 5, No 6, December 1993.

[Muggleton, 1991] S Muggleton. Inductive Logic Programming. New generation computing, 8/4, 1991.

[Muggleton, 1992] S Muggleton (ed). Inductive Logic Programming. Academic press, London 1992.

[Scheines and Spirtes, 1992] R Scheines and P Spirtes. Finding Latent Variable Models in Large Databases. International Journal of Intelligent Systems, Vol 7, 1992.

[Shapiro and Mathews, 1992] Shapiro GP, Mathews CJ. Knowledge Discovery Workbench for Exploring Business Databases. International Journal of Intelligent Systems, Vol 7, 1992.

[Shapiro et al, 1994] Shapiro GP, Matheus C, Smyth P and Uthuruswamy R. KDD-93: Progress and Challenges in Knowledge Discovery in Databases. AI Magazine, Spril 1994.

[Shavlik and Dietterich, 1990] Shavlik JW and Dietterich TG. Readings in Machine Learning. Morgan Kaufmann, 1990.

[Silberschatz et al, 1995] Silberschatz A, Stonebraker M and Ullman J (ed). Database Research: Achievements and Opportunities into the 21st Century. NSF Workshop report, May 1995.

[Quinlan, 1989] Quinlan JR. Learning Relations: Comparison of a Symbolic and a Connectionist Approach. TR 346, Univ of Sydney, 1989.

[Quinlan, 1993] JR Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.