ANSWERS|| 2026 LATEST UPDATE||
VERIFIED A+
What is data mining? - ANSWERThe process of sorting through large data sets to
identify patterns and establish relationships to solve problems through data analysis
What are the steps involved in data mining when viewed as a process of knowledge
discovery? - ANSWERData Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
What are the data mining functionalities - ANSWERCharacterization and discrimination
Mining of frequent patterns, associations, and correlations Classification and regression
Clustering analysis
Outlier analysis
Data Characterization - ANSWERA summary of the general characteristics or features
of a target class of data. The data corresponding to the user-specified class is typically
collected by a query. For example, to study the characteristics of software products with
sales that increased by 10% in the previous year, the data related to such products can
be collected by executing an SQL query on the sales database.
Data discrimination - ANSWERcomparison of the target class with one or a set of
comparative classes
Data mining methodology challenges - ANSWERMining various and new kinds of
knowledge
Mining knowledge in multidimensional space
Integrating new methods from multiple disciplines
Boosting the power of discovery in a networked environment
Handling uncertainty, noise, or incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
Explain one challenge of mining a huge amount of data in comparison with mining a
small amount of data. - ANSWERAlgorithms that deal with data need to scale nicely so
that even vast amounts of data can be handled efficiently, and take short amounts of
time
,What is an outlier? - ANSWERAn object which does not fit in with the general behavior
of the model.
Does an outlier need to be discarded always? - ANSWERIn most cases of data mining,
outliers are discarded. However, there are special circumstances, such as fraud
detection, where outliers can be useful.
The mode is the only measure of central tendency that can be used for nominal
attributes. (T/F) - ANSWERTrue. An example of this would be hair color, with different
categories such as black, brown, blond, and red. Which one is the most common one?
Nominal attribute - ANSWERrefer to symbols or names of things. Categorical. It can
also be represented using a number, however, they are not meant to be used
quantitatively. Has no median, but has a mode
Binary Attributes - ANSWERA nominal attribute with only two categories or states: 0 or
1, where
0 typically means that the attribute is absent, and 1 means that it is present.
Ordinal Attributes - ANSWERAn attribute with possible values that have a meaningful
order or
ranking among them, but the magnitude between successive values is not known.
Numeric Attributes - ANSWERQuantitative; that is, it is a measurable quantity,
represented in
integer or real values. Can be interval-scaled or ratio-scaled.
Discrete Attribute - ANSWERhas a finite or countably infinite set of variables
Continuous Attributes - ANSWERtypically represented as floating-point variables.
The mean is in general affected by outliers (T/F) - ANSWERTrue
Not all numerical data sets have a median. (T/F) - ANSWERFalse
What are the differences between the measures of central tendency and the measures
of dispersion? - ANSWERThe measures of central tendency are the mean, median,
mode and midrange. They are used to measure the location of the middle or the center
of the data distribution, basically where the most values fall. Whereas, the dispersion
measures are the range, quartiles, interquartile range, the five-number summary,
boxplots, the variance and standard deviation of the data. They are mainly used to find
an idea of the dispersion of the data, how is the data spread out, and to identify outliers.
How would you catalog a boxplot, as a measure of dispersion or as a data visualization
aid? Why? - ANSWERAs a data visualization aid. The boxplot shows how the
boundaries relate to each other visually, where the minimum, maximum values lie, and
, the Interquartile ranges with a line signifying the median. It does not give you a specific
measure, but allows you to somewhat visualize the data set. For example, if you have a
boxplot for the grades in a class, if the box is closer to the minimum boundary then you
can see that most scores were low.
What do we understand by similarity measure? - ANSWERIt quantifies the similarity
between two objects. Usually, large values are for similar objects and zero or negative
values are for dissimilar objects.
What is the importance of similarity measures - ANSWERThey are important because
they help us see patterns in data. They also give us knowledge about our data. They
are used in clustering algorithms. Similar data points are put into the same clusters, and
dissimilar points are placed into different clusters.
What do we understand by dissimilarity measure and what is its importance? -
ANSWERMeasuring the difference between to objects, the greater the difference
between two objects the higher the value.
What is the importance of dissimilarity measures - ANSWERThe importance of this is
that in some instances, having two objects with low dissimilarity could mean something
negative. For example, cheating.
Discuss one of the distance measures that are commonly used for computing the
dissimilarity of objects described by numeric attributes. - ANSWEREuclidean distance
d(i, j) =sqrt((xi1 − xj1)^2 + (xi2 − xj2)^2 +··· )
Manhattan Distance |x1 - x2| + |y1 - y2|
Minkowski distance d(i, j) = sqrt(h, |xi1 − xj1|^h + |xi2 − xj2|^h + ...)
Supremum distance d(i, j) = max(f, p) |xif − xjf |
In many real-life databases, objects are described by a mixture of attribute types. How
can we compute the dissimilarity between objects of mixed attribute types? -
ANSWERIn order to determine the dissimilarity between objects of mixed attributes
there are two main approaches. One of them indicates to separate each attribute type
and do a data mining analysis for each of them. This method is acceptable if the results
are consistent. Applying this method to real life projects is not viable as analyzing the
attribute types separately will most likely generate different results. The second
approach is more acceptable. It processes all attributes types together and do only one
analysis by combining the attributes into a dissimilarity matrix
What do we understand by data quality and what is its importance? - ANSWERWhen an
object satisfies the requirements of the intended use. It has many factors like: including
accuracy, completeness, consistency, timeliness, believability, and interpretability. It
also depends on the intended use of the data, for some users the data may be
inconsistent, but for others, it can just be hard to interpret.