accuracy In the context of a supervised model, accuracy refers to how well the model can make predictions.
algorithm A specific technique or procedure for producing a data mining model. An algorithm uses a specific model representation and may support one or more functional areas. Examples include decision trees, backpropagation neural networks, naïve bayes algorithms for supervised mining functions, and apriori for the association mining function.
algorithm settings A collection of settings, or parameters, to affect algorithm-specific behavior during model building.
anomaly detection A mining function that produces models for detecting deviations from the norm in a dataset. The data provided for model building consists of normal cases from which an anomaly detection algorithm learns patterns that are captured in the resulting model. Applying the model flags cases that deviate or are unusual from the normal cases in some way.
antecedent In an association rule, the left-hand side is called the antecedent. For example, in the rule “If A, then B,” “A” is the antecedent. See also consequent.
API Application Program Interface.
apply The data mining operation that scores data using a data mining model (i.e., it applies a model to apply data to produce results according to apply settings). In JDM, apply is performed using an apply task.
apply data The data used as input when applying a model.
apply settings A user specification detailing the output desired from applying a model to data. This output may include predicted values, associated probabilities, key values, and other supplementary data.
apply task A task that when executed applies the specified model to the apply data. The results are produced according to the apply settings.
association A data mining technique that identifies relationships among items, producing association rules. One of the JDM mining functions.
association rules Association rules capture co-occurrence of items among transactions. A typical rule is an implication of the form A B, which means that the presence of itemset A implies the presence of itemset B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the itemsets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset A is present. See also antecedent and consequent.
attribute A generic column of data, minimally with a name and datatype. There are several specializations of attribute; see logical attribute, physical attribute, and signature attribute. Attributes are used in statistics, data mining, and other disciplines to describe observations, objects, data records, and other entities. Attributes are also referred to as variables, fields, columns, dimensions, features, and properties. Attributes are often categorized with regard to their mathematical properties, that is, in terms of the intrinsic organization or structure of the associated values (or value range or scale).
attribute assignment The mapping of one attribute to another used to associate input data with a model’s attributes, or a model’s output with an output table.
attribute importance A data mining technique that ranks the attributes in order of influence to predicting a target, or importance to model quality. One of the JDM mining functions. Also, a measure of the importance of an attribute within a model as recorded in the model signature.
attribute type Specifies how a logical attribute is to be interpreted during model building. Commonly, four types of attributes are distinguished: nominal or categorical attributes, ordinal or rank attributes, interval attributes, and real or real-valued attributes (also called true measures). JDM restricts itself to three types: categorical, numerical, and ordinal.
attribute usage Specifies how a logical attribute is to be used when building a model: active, supplementary, or inactive.
binning A data mining transformation which maps a set of input values to a smaller set of bins. The input values may be discrete or continuous.
build The data mining operation that produces a model.
build data The data used as input to building a model. Sometimes referred to as the training data.
build settings A collection of settings, or parameters, specifying the type of data mining model to build, including mining function and algorithm settings. Build settings exist for each of the mining functions, including: classification, regression, association, sequences, attribute importance, and clustering.
build task A task that when executed builds a model as specified by the build settings.
case A collection of related attribute values used as input to model building, testing, or scoring. In a simple table, a case corresponds to an individual record. In transactional format data, a case may be represented by multiple records, where columns play the roles of identifier, attribute name, and attribute value. See also single record case and multi-record case.
case identifier The unique identifier associated with a case. Also referred to as “case id.”
categorical attribute An attribute where the values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA, etc.). Categorical attributes are either nonordered (nominal) like state, gender, and so on, or ordered (ordinal) such as high, medium, or low temperatures. Categorical attributes tell us which of several categories a thing belongs to. For example, we can say that a beverage is BEER, LIQUOR, SOFTDRINK, or WINE. Categorical attributes exhibit the lowest degree of organization, since the set of categories such an attribute may assume posses no systematic intrinsic organization or order. The only relation between the categories of such attributes isthe identity relation, that is, if two categories are equal. The lack of an order relation makes it impossible to tell if one attribute category is greater than another, or that one category is closer to another.
category A distinct value of a categorical attribute. Also referred to as a class.
category set A named collection of related categories.
centroid A cluster centroid is a vector that encodes, for each logical attribute, either the mean (numerical attributes) or the mode (categorical attributes) of the cases in the build data assigned to a cluster.
classification A supervised data mining technique that produces a model capable of classifying cases into categories or assigning cases to categories. A classification model requires a categorical target attribute in the build dataset. One of the JDM mining functions.
cluster A collection of cases that are similar to one another as determined by a clustering mining function. A cluster can be defined by its centroid, or by an area determined by an attribute vector space—a set of attribute value ranges (numerical) and attribute values (categorical). Predicate rules involving the cluster attributes are often used to define clusters in a human-understandable way.
clustering An unsupervised data mining technique that given a set of cases, each having a set of attributes, and a similarity measure among them, groups the cases into different clusters such that cases in the same cluster are more similar to one another while cases in different clusters are less similar to one another. One of the JM mining functions.
confusion matrix A table that counts of the actual versus predicted class values. It indicates where the model correctly predicted outcomes, and where it became confused or made mistakes.
consequent In an association rule, the right-hand side is called the consequent. For example, in the rule “If A, then B,” “B” is the consequent. See also antecedent.
cost matrix A two-dimensional, N_N table that defines the cost associated with incorrect predictions. A cost matrix is typically used in classification models, where N is the number of distinct categories in the target, and the columns (reflecting predicted categories) and rows (reflecting actual categories) are labeled with target categories.
cross validation A method of evaluating the accuracy of a classification or regression model, typically used when there are relatively few cases to divide between build and test datasets. In cross validation, the build data is divided into several parts, with each part in turn being used to evaluate a model built using the remaining parts.
cycle In time series, cycle describes the cyclic behavior of the target attribute, or signal. A cycle can be periodic, or regular, that is, having the same number of values within the cycle period. Alternatively, a cycle can be seasonal, or irregular, that is, having an irregular number of values within the cycle period. For example, monthly cycles have an irregular number of days per month, whereas a day has a constant number of hours. Time series can generate useful information about the periodicity or seasonality of a time series sequence.
data mining The process of discovering hidden, previously unknown and usable information from a large amount of data. This information is represented in a compact form, referred to as a model.
data mining engine (DME) The component in the JDM architecture that implements the algorithms to support data mining. The data mining engine may also support the persistent MOR.
data mining server (DMS) The component in the JDM architecture that implements the data mining engine and persistent MOR. This is distinguished from the data mining engine since a server implies a separate component as in a client-server architecture.
data preparation status An indication of whether a logical attribute provided as input to a build operation has been prepared by the user, or if the user expects the DME to perform automatic data preparation on the input data. A user may specify a logical attribute as prepared or unprepared.
DBMS Database Management System.
descriptive data mining Data mining that results in a transparent model that can be inspected to understand the process or behavior of a model. Effectively provides a characterization of a dataset in a concise and summary manner determined by the mining function and algorithm used. See also predictive data mining.
ensemble model A collection of primitive supervised data mining models (e.g., as produced from the classification mining function) that can be used together to improve model accuracy.
enterprise information system (EIS) Generically, the application or enterprise system that supports a set of business processes and information technology infrastructure. The business processes are provided as a set of services. In support of data mining, an instance of an enterprise information system can be a set of backend component(s) that provide data mining functionality to the enterprise.
explode A transformation that translates a discrete (categorical or ordinal) attribute into n attributes using the indicator or thermometer approach, where n corresponds to the cardinality of the attribute (number of distinct values). The indicator approach assigns the value 1 to the attribute that maps to the discrete value of the original attribute. The thermometer approach assigns the value 1 to the attribute that maps to the discrete value of the original attribute and all attributes that precede that value in the ordered sequence.
export The operation that supports taking mining objects from within the DME and representing them in a transportable format for storage in an external system such as a file or database table cell, or for exchange with other systems or applications. In JDM, export is performed using an export task. See also import.
extension A feature that is not covered by any of the relevant JDM specifications, or a nonstandard implementation of a feature that is covered.
feature extraction An unsupervised mining technique that produces new attributes as combinations of input attributes, producing a reduced set of attributes containing more highly summarized information about those attributes.
feature selection The process of selecting the features (attributes) that are deemed important to producing a quality data mining model. Feature selection is done based on the importance computed using attribute importance algorithms. See also attribute importance.
irregular component In time series, the random or chaotic noisy residuals of data after the time-dependent components have been removed, namely, the trend, periodic, and seasonal components. It results from short-term fluctuations in the series that are neither systematic nor predictable. In a highly irregular series, these fluctuations can dominate movements, which will mask the trend and seasonality.
generic interface An approach used in JDM 2.0 to enable vendors to specify name-value pairs of settings for build settings, algorithm settings, apply settings, and statistics settings. This provides a way for vendors to extend the standard settings while using a standard interface definition.
import The operation that supports taking mining objects from an external system such as a file or database table cell and importing them to the DME and MOR. In JDM, import is performed using an import task. See also export.
incremental learning An aspect of model building that refines or enhances an existing model by taking into account new data, thereby avoiding the need to rebuild the model on the complete dataset.
item An element that can be compared against another to determine if they are different. Typically used in the context of association models. For market basket analysis, an item may correspond to a retail product.
itemset A set of items, typically used as an antecedent or consequent in a rule, as produced from an association model. No item in an itemset can appear more than once. Itemsets can be compared to determine if they are different.
Java Specification Request (JSR) The actual description of a proposed and final specification for the Java platform following Sun’s Java Community Process. See http://www.jcp.org.
JDM implementation A JDM technology-enabled client API data mining engine, and mining object repository.
lift A measure of how well a classification model improves identifying or prediction cases with the positive target value over a random selection given actual results. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a dataset with actual outcomes, lift compares how well a model performs with respect to this dataset on predicted outcomes. Lift allows a user to infer how a model will perform on new data.
logical attribute A description of a domain of data used as input to mining operations. Logical attributes specify attribute type, data preparation status, among others.
logical data A set of logical attributes used as input to building a data mining model.
mining function A major subdomain of data mining that shares common high-level characteristics. For JDM 1.1, functions include: classification, regression, attribute importance, association, and clustering.
mining object repository (MOR) The logical or physical architectural component that stores JDM mining objects, such as tasks, models, settings, and their components.
mining result The end product(s) of a mining operation. For example, a build task produces a mining model, a test task produces a test metrics object.
missing value A data value for an attribute of a case that is missing because it was not measured, not answered, was unknown or was lost. Data mining methods vary in the way they treat missing values. Typically, they may ignore the missing values, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values.
missing values treatment A transformation that specifies how to replace missing values, for example, with the attribute mean or mode, a specific value, and so on.
model A compact representation of patterns found using historical data. A model is the result of executing a build task. Model representation is specific to the algorithm used. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent attribute or target) from other, known values (independent attributes or predictors).
model comparison A phase in the data mining process that involves comparing multiple models to select the model of highest quality or that best matches the needs of the business problem. Comparison can be based on various criteria, for example, maximum accuracy, minimum Type I error, and so on.
model detail The specific representation of a model that is algorithm dependent. For example, a decision tree has specific model detail of the tree nodes and their relationships.
model signature A collection of signature attributes, derived from the logical data used to build a model. The input data to a model for scoring must be compatible with the model signature.
multi-record case A representation of physical data that uses multiple records to store a single case. The data typically has three columns with roles of sequence id, attribute name, and value.
multi-target model A type of supervised model that can predict multiple targets, both categorical (classification) and numerical (regression). A multi-target model may be more efficient at representing the knowledge extracted during model building, and more efficient to compute.
normalization A transformation that maps numerical values to a particular numerical range, typically 0 … 1. There are several types of normalization (e.g., z-score, min-max, and shift-scale).
numerical attribute An attribute whose values are numbers. The numeric value can be either an integer or a real number. See also categorical attribute and ordinal attribute.
OLAP Online Analytical Processing.
ordinal attribute An ordinal attribute is similar to a categorical attribute except that there is an order defined on the discrete categorical values, for example, temperature where the discrete values are high, medium, and low. There is an order defined on the values: high > medium > low. Ordinal attributes define a total order relation on the categories. For example, if x, y, and z are ranked, 5, 6, and 7, we can tell x < y < z, but not if (z _ y) < (y _ x). Consider the ordinal attribute speed that takes the following ranked categories: STATIONARY, SLOW, FAST, VERY FAST, where rank (STATIONARY) _ 1, rank (SLOW) _ 2, rank (FAST) _ 3, and rank (VERY FAST) _ 4. We can tell that SLOW represents a smaller speed value than FAST. However, it is not possible to tell if, for example, the difference between two adjacent values is the same or not: is the difference between SLOW and FAST equal to, smaller or greater than the difference between FAST and VERY FAST.
outlier A data value that does not (or is not thought to have) come from the typical population of data. Outliers are values that fall outside the boundaries that enclose most other values in the data. This can apply to values of an attribute, or of entire cases.
outlier treatment The approach to replacing outliers in numerical data attributes. There are several techniques including specifying explicit boundaries, percentages in the tails of the distribution, and number of standard deviations, such that values outside the valid range are replaced either by null values or edge values.
percentage A value between 0 and 100 that represents a part of a whole. For example, 75 percent indicates three quarters of a whole.
physical attribute An object that corresponds to a field in a formatted file, or column in a database table.
physical dataset Identifies data as a set of cases to be used as input to data mining. Using tasks, physical attributes can be mapped to logical attributes of a model’s signature or logical data of a build settings object. The data referenced by a physical dataset object can be used in model building, scoring (apply), lift computation, statistical analysis, etc.
physical data record A collection of named attribute values used as input and output for single or multi-record scoring.
predictor A logical attribute used as input to a supervised model or algorithm to build a model. Also referred to as an independent variable. predictive data mining Data mining that results in a model by performing inference on build data, and attempting to predict outcomes for cases in apply datasets. See also descriptive data mining.
prior probabilities The set of prior probabilities, or priors, specifies the distribution of categories, or classes, in the original population. Through skewed sampling, such as stratified sampling, prior probabilities will differ from the distribution observed in the build dataset. Priors allow the algorithm to adjust predictions to reflect original population distributions.
probability A value between zero and one (0 … 1) that indicates the likelihood of an event. Zero indicates there is no chance of the event occurring. One indicates it is probabilistically certain the event will occur.
quality of fit In clustering, a value between zero and one that is a measure of how well a given case fits in the predicted cluster. Values closer to zero indicate a poor fit; values closer to one indicate a good fit.
receiver operating characteristics (ROC) A measure of comparison between individual models to determine thresholds that yield a high proportion of positive hits. ROC curves aid users in adjusting the cost matrix to minimizing error rates. ROC was originally used in signal detection theory to gauge the true hit versus false alarm ratio when sending signals over a noisy channel.
recode A transformation that defines an explicit set of mappings, where each mapping involves an original value and replacement (orrecoded) value. Upon performing the transformation on a column, all matching original values are replaced with the recoded values.
reference implementation A software implementation of a JSR specification that validates the interface for practical implementation and usage. It must meet the tests defined in the TCK. See also technology compatibility kit (TCK).
regression A supervised data mining technique that predicts continuous targets. One of the JDM mining functions.
residual(s) In regression, the difference between the actual target value and the predicted value. In time series, residual is what remains after accounting for trend, cyclic variations, and interventions.
return on investment A measure used to make capital investment decisions. One possible calculation involves (increased revenue _ costs)/investment.
rule An expression of the general form if X, then Y. An output of certain models (e.g., association rules models or decision tree models). The X may be a compound predicate.
sample (n) A representative set of cases taken from a larger data population. (v) To extract a set of cases from a larger population, typically at random to minimize bias in the dataset.
seasonality In time series, this is a periodic effect due to the recurrence of certain drivers of the time series, for example strong sales around holidays. See also time series and cycle.
session The duration of an open connection to the DME.
settings The parameters used to control mining operations. See build settings, apply settings, algorithm settings.
signature attribute A type of attribute used to define one of the inputs to a model for test and apply. See model signature.
single-record case A representation of physical data that uses a single record to store each case. Each column contains data to be mined that can correspond to a logical attribute.
SOA Service Oriented Architecture.
statistics The science and practice of collecting, organizing, and analyzing data. In JDM, statistics refers to the type of summary data made available on individual attributes (univariate) and analysis of multiple attributes (multivariate). Univariate statistics include values such as the mean, mode, median, standard deviation. Multivariate statistics include tests such as F Tests and T Tests.
stratified sampling A sampling technique such that the cases selected are based on percentages or counts of class values from a specific attribute. For example, a target attribute with values high, medium, and low, where the original distribution of cases is 75 percent, 20 percent, and 5 percent, respectively, may be stratified to ensure that there are equal number of cases in the sampled dataset.
structured data Data that contains primitive data types such as integers, floats, or category strings. Examples include age, marital status, temperature.
supervised learning The process of building data mining models using a known dependent attribute, referred to as the target. All classification and regression techniques are supervised.
system default For an enumeration class, an implementationdefined default value that corresponds to one of the allowed values for the enumeration class. This default value may be different according to the context. Vendors must document the system default for each context.
system determined For an enumeration class, a user may request the implementation to determine what is the best value for this enumeration. The implementation-selected value may take into account, for example other settings or data, to determine an enumeration value. JDM implementers are expected to document the behavior users can expect.
target In supervised learning, the identified logical attribute that is to be predicted. Also referred to as a dependent variable.
taxonomy A hierarchical grouping of a set of categorical values. For example, a geography taxonomy groups cities into states, states into regions, and regions into countries.
task A container within which to specify arguments to data mining operations to be performed by the DME. Data mining tasks include: model build, test, apply, import, and export.
Technology Compatibility Kit The suite of tests, tools, and documentation, as defined through the Java Community Process, that allows implementers of a specification to determine if their implementation is compliant with that specification.
test The data mining operation that determines the accuracy of a model. This is typically performed by using held-aside (test) data identical in form to the build data, scoring that test data, and comparing the actual target value with the predicted target value. Testing is only applicable for supervised models. In JDM, test is performed using a test task.
test data The input data used for testing a model.
test task A task that when executed produces test results for supervised models.
text mining A data mining technique for extracting patterns and insights out of unstructured, text data. Text mining goes beyond the notion of search in that previously unknown information can be discovered through the use of data mining algorithms.
time series A data mining technique that supports the analysis of time series data. A series of values X(t) are recorded according to some function of time and are thus ordered by an index describing the time (t) at which the values were recorded.
training The step in the model building process that produces a possibly nonoptimized form of the model. For example, a tree algorithm may produce a full tree during training, but may require an evaluation phase to effectively select the best subtree. See build.
training data See build data.
transformation A function applied to data resulting in a new form or representation of the data. Binning and normalization are examples of data transformations. See also binning, explode, and normalization.
trend In time series, this is typically considered to be a long-term change in the mean level of a series. What constitutes “long-term” depends on the sampling rate of the time series. See also time series.
UML Unified Modeling Language.
URI Uniform Resource Identifier.
unstructured data Data that represents complex content, often with an inherent structure. Examples of unstructured data include text, images, audio, and video. See also structured data.
unsupervised learning The process of building data mining models without the guidance (supervision) of a known, correct result. In supervised learning, this correct result is provided in the target attribute. Unsupervised learning has no such target attribute. Clustering and association are examples of unsupervised learning.
Web service A software application identified by a URI, whose interfaces and bindings are capable of being defined, described, and discovered as XML artifacts. A Web service supports direct interactions with other software agents using XML-based messages exchanged via Internet-based protocols.
weight A numeric value associated with an attribute or case. Weights associated with attributes instruct the DME to consider the contribution of attributes with greater weights more important than those with lesser weights. Weights associated with cases—by identifying an attribute as containing weight values—instruct the DME to consider the contribution of cases with greater weights more important that those with lesser weights.