Data mining

The motivation behind mining data, whether commercial or scientific, is the same the need to find useful information in data to enable better decision making or a better understanding of the world. Traditionally, data analysts have turned to data mining techniques when the size of their data has become too large for manual or visual analysis. In the science and engineering domains the size of the data is only one reason why data mining techniques are gaining popularity. Science data in areas such as remote sensing, astronomy, and computer simulations is routinely being measured in terabytes and peta bytes. However, what makes the analysis of these data sets challenging is not just the size, but the complexity of the data. Advances in technology have introduced complexity in scientific data, complexity that can take various forms such as multi-sensor, multi-spectral, multi-resolution data; spatiotemporal data; high dimensional data; structured and unstructured mesh data from simulations; data contaminated with different types of noise; three-dimensional data, and so on ( 2003). As a result of this complexity, visual data analysis, given its subjective nature and the human limitations in absorbing details, is becoming impractical even for moderate-sized scientific data sets large to massive data sets visual analysis is practically impossible. As a result, science and engineering data sets provide a very rich environment for the application of data mining ( 2003). The paper will discuss about the IRIS data mining software. The paper will use various kinds of classification and clustering techniques on the different IRIS varieties.

IRIS data mining software

Data mining is a set of dissimilar and independent analytical tools, such as neural networks, decision tree algorithms, logistic regression, multiple regression, fuzzy logic, genetic algorithms, clustering, market basket analysis, and other analytical methods. All of these techniques are designed for a certain purpose in the analytical arena with their goal in mind, their own set of assumptions, their own data structure requirements, their own range of applicability, and their own ways of interpreting results ( &  2003) Data mining software is useful to sift through large amounts of data to discover meaningful relationships that were previously unknown and then use the resulting business intelligence to make crucial business decisions Data mining allows different individuals to retrieve as much or as little data and information as they need. They can retrieve current detailed data and information from within or outside the organization and summarize it by any desired category ( &  2003)

 

Data mining products provide a basic analysis capability and the ability to drill down to obtain more detail and to summarize details as necessary. Additionally, some data mining products go beyond basic analysis capabilities by providing statistical and mathematical routines to calculate the coefficients and powers of pre specified independent variables. This is known as curve fitting or trend analysis ( &  2003). Trend analysis is used to determine patterns and relationships, and if key measurements are still within their limits and expectations. It is not difficult to determine patterns when the variables are known. Mathematical techniques are available to determine the relationship of one dependent variable to several independent variables. However, the difficulty is identifying key patterns and trends when the analyst does not know the independent variables or, for that matter, may not even know what dependent variable should be analyzed. This is where more sophisticated and powerful knowledge-based techniques can be useful ( &  2003)

 

Some data mining products utilize data-visualization techniques to present these relationships graphically. The analyst can change the scale, display format, and present factors to better represent the relationships. The relationship need not be a causal one, nor is it most likely readily apparent ( &  2003). Data mining software is capable of slicing and dicing a company’s database looking for patterns in order to provide the most useful nuggets to guide marketing efforts. Data mining solutions help a company provide sophisticated analysis functionality for predictive modelling, trend analysis, application scoring, and customer segmentation. The software helps identify the most profitable customers, detect and define customer attrition trends, determine optimal timing for product rollouts, model customer buying behavior, and discover previously unrecognized patterns in customer data that lead to new marketing opportunities ( 2001). IRIS provides visual analysis of spatially referenced data. It makes use of thematic maps to present data to its user. It uses generic knowledge in combining and presenting statistical data in thematic maps encoded as heuristic rules together with specialized applications.

Classification technique

The decision tree is constructed by using a single variable to partition the data set into one or more subsets. After each step each of the partitions is partitioned as if each were a separate data set. Each subset is partitioned without any regard for the other subsets. This process is repeated until the stopping criteria are met. This recursive partitioning creates a tree structure. The root of the tree is the entire data set ( 2003).  The subsets and sub-subsets form the branches. When a subset meets the stopping criteria and is not repartitioned, it is a leaf. Each subset in the final tree is called a node. The major difference between a decision tree and a multiple partition decision tree is the number of variables used for partitioning the data set. In a multiple partitioning the number of subsets for a variable can range from two to the number of unique values of the predictor variable ( 2003).

 

The predictor variable used to form a partition is chosen to be the variable that is most significantly associated with the dependent variable. A chi-square test of independence for a contingency table is used as the measure of association. The stopping criterion is the p value for the chi-square test. If a predictor has more than two categories, there may be a large number of ways to partition the data. A combinatorial search algorithm is used to find a partition that has a small p value for the chi-square test ( 2003).  This algorithm does not specifically support continuous explanatory variables. Continuous variables are treated as ordinal. Only values of a variable that are adjacent to each other can be combined. With a large number of continuous variables and a large number of objects, this can have a large impact on execution time ( 2003).  Tree models are fit in a forward stepwise fashion. The analysis begins with all items classified as measuring a single undifferentiated skill. Potential improvements to this model are evaluated by using a recursive partitioning algorithm to estimate the reductions in unexplained variation resulting from all possible splits of all possible hypothesized skills. This evaluation is accomplished using deviance; a statistical measure of the unexplained variation remaining after each new variable is added to the model. Deviance is calculated as the sum of squared differences between the observed and predicted values of item difficulty ( . & , 2003).  The next part will classify the variables using GINI splits.

Clustering

As an applied statistical technique, cluster analysis has been studied extensively for more than 40 years and across many disciplines, including the social sciences. This is because clustering is a fundamental data analysis step and a pattern is a very general concept. One may desire to group similar species, sounds, gene sequences, images, signals, or database records, among other possibilities. Clustering also has been studied in the fields of machine learning and statistical pattern recognition as a type of unsupervised learning because it does not rely on predefined class-labelled training examples. However, serious efforts to perform effective and efficient clustering on large data sets started only in recent years with the emergence of data mining ( 2003).

 

There are also a variety of soft clustering techniques, such as those based on fuzzy logic or statistical mechanics, wherein a data point may belong to multiple clusters with different degrees of membership. The main reason there is such a diversity of techniques is that although the clustering problem is easy to conceptualize, it may be quite difficult to solve in specific instances. Moreover the quality of clustering obtained by a given method is very data dependent, and although some methods are uniformly inferior, there is no method that works best over all types of data sets ( 2003).  Clustering is the process of organizing data into similar groups. To cluster the data, the datasets b and c have the same qualities and characteristics thus the two datasets should be the ones that are bounded together.

References





Credit:ivythesis.typepad.com


0 comments:

Post a Comment

 
Top