### **Solutions to the Data Mining II Question Paper (4986)**
#### **Section A**
1. **(a) How does the number of clusters affect anomaly detection in k-means clustering? (2 marks)**
- In k-means clustering, the number of clusters ($ k $) directly influences anomaly detection.
- **Fewer Clusters ($ k $ is small)**: With fewer clusters, anomalies may be absorbed into larger clusters, making them less distinguishable.
This can lead to underfitting, where anomalies are not effectively identified.
- **More Clusters ($ k $ is large)**: With more clusters, anomalies are more likely to form their own cluster or be isolated as outliers. This can
improve anomaly detection but may also lead to overfitting, where noise is mistaken for meaningful patterns.
- **Optimal $ k $**: Choosing an appropriate $ k $ is crucial. Techniques like the elbow method or silhouette analysis can help determine the
optimal number of clusters for effective anomaly detection [[5]].
2. **(b) In a dataset of monthly sales figures for a retail store, the mean monthly sales are Rs. 50,000 with a standard deviation of Rs. 5,000. In a
certain month, the store recorded sales of Rs. 65,000. Calculate the z-score for this month's sales. (2 marks)**
- The formula for the z-score is:
\[
z = \frac{x - \mu}{\sigma}
\]
where:
- $ x $ = observed value (Rs. 65,000)
- $ \mu $ = mean (Rs. 50,000)
- $ \sigma $ = standard deviation (Rs. 5,000)
- Substituting the values:
\[
z = \frac{65,000 - 50,000}{5,000} = \frac{15,000}{5,000} = 3
\]
- **Answer**: The z-score for this month's sales is $ \boxed{3} $.
, 3. **(c) Consider a dataset with binary labels. The dataset is trained using Adaboost method. The decision boundary obtained after one
iteration is shown in Figure II.**
- **(i) Which points shall have higher weights? Justify your answer. (2 marks)**
- In Adaboost, misclassified points are given higher weights in subsequent iterations to focus on difficult cases. From Figure II, the points that
lie on the wrong side of the decision boundary will have higher weights. Specifically, the points marked with triangles ($ \Delta $) that are
misclassified by the current weak learner will receive higher weights.
- **Answer**: Points marked with triangles ($ \Delta $) will have higher weights because they are misclassified by the current weak learner.
- **(ii) What is overfitting in the context of classification? Name two methods to prevent it. (3 marks)**
- **Overfitting**: Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant details, leading to poor
generalization on unseen data.
- **Methods to Prevent Overfitting**:
1. **Regularization**: Adds a penalty term to the loss function to constrain model complexity (e.g., L1 or L2 regularization).
2. **Cross-Validation**: Uses multiple subsets of the data for training and validation to ensure the model generalizes well.
3. **Early Stopping**: Stops training when the performance on a validation set starts to degrade.
4. **Feature Selection**: Reduces the number of input features to avoid learning noise.
- **Answer**: Overfitting is when a model performs well on training data but poorly on unseen data. Two methods to prevent it are
**regularization** and **cross-validation**.
- **(iii) Can clustering be used for dimensionality reduction? Justify your answer. (3 marks)**
- **Yes**, clustering can be used for dimensionality reduction in certain contexts. For example:
- **Prototype-based Clustering**: Algorithms like K-means can represent each cluster with a centroid, reducing the data to a smaller set of
representative points.
- **Hierarchical Clustering**: Agglomerative clustering can be used to identify groups of similar features, which can then be aggregated or
reduced.
- However, clustering is not typically designed for explicit dimensionality reduction like PCA or t-SNE. It focuses on grouping similar data
points rather than reducing feature space.
- **Answer**: Yes, clustering can be used for dimensionality reduction by representing clusters with prototypes (e.g., centroids) or aggregating
similar features.
#### **Section A**
1. **(a) How does the number of clusters affect anomaly detection in k-means clustering? (2 marks)**
- In k-means clustering, the number of clusters ($ k $) directly influences anomaly detection.
- **Fewer Clusters ($ k $ is small)**: With fewer clusters, anomalies may be absorbed into larger clusters, making them less distinguishable.
This can lead to underfitting, where anomalies are not effectively identified.
- **More Clusters ($ k $ is large)**: With more clusters, anomalies are more likely to form their own cluster or be isolated as outliers. This can
improve anomaly detection but may also lead to overfitting, where noise is mistaken for meaningful patterns.
- **Optimal $ k $**: Choosing an appropriate $ k $ is crucial. Techniques like the elbow method or silhouette analysis can help determine the
optimal number of clusters for effective anomaly detection [[5]].
2. **(b) In a dataset of monthly sales figures for a retail store, the mean monthly sales are Rs. 50,000 with a standard deviation of Rs. 5,000. In a
certain month, the store recorded sales of Rs. 65,000. Calculate the z-score for this month's sales. (2 marks)**
- The formula for the z-score is:
\[
z = \frac{x - \mu}{\sigma}
\]
where:
- $ x $ = observed value (Rs. 65,000)
- $ \mu $ = mean (Rs. 50,000)
- $ \sigma $ = standard deviation (Rs. 5,000)
- Substituting the values:
\[
z = \frac{65,000 - 50,000}{5,000} = \frac{15,000}{5,000} = 3
\]
- **Answer**: The z-score for this month's sales is $ \boxed{3} $.
, 3. **(c) Consider a dataset with binary labels. The dataset is trained using Adaboost method. The decision boundary obtained after one
iteration is shown in Figure II.**
- **(i) Which points shall have higher weights? Justify your answer. (2 marks)**
- In Adaboost, misclassified points are given higher weights in subsequent iterations to focus on difficult cases. From Figure II, the points that
lie on the wrong side of the decision boundary will have higher weights. Specifically, the points marked with triangles ($ \Delta $) that are
misclassified by the current weak learner will receive higher weights.
- **Answer**: Points marked with triangles ($ \Delta $) will have higher weights because they are misclassified by the current weak learner.
- **(ii) What is overfitting in the context of classification? Name two methods to prevent it. (3 marks)**
- **Overfitting**: Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant details, leading to poor
generalization on unseen data.
- **Methods to Prevent Overfitting**:
1. **Regularization**: Adds a penalty term to the loss function to constrain model complexity (e.g., L1 or L2 regularization).
2. **Cross-Validation**: Uses multiple subsets of the data for training and validation to ensure the model generalizes well.
3. **Early Stopping**: Stops training when the performance on a validation set starts to degrade.
4. **Feature Selection**: Reduces the number of input features to avoid learning noise.
- **Answer**: Overfitting is when a model performs well on training data but poorly on unseen data. Two methods to prevent it are
**regularization** and **cross-validation**.
- **(iii) Can clustering be used for dimensionality reduction? Justify your answer. (3 marks)**
- **Yes**, clustering can be used for dimensionality reduction in certain contexts. For example:
- **Prototype-based Clustering**: Algorithms like K-means can represent each cluster with a centroid, reducing the data to a smaller set of
representative points.
- **Hierarchical Clustering**: Agglomerative clustering can be used to identify groups of similar features, which can then be aggregated or
reduced.
- However, clustering is not typically designed for explicit dimensionality reduction like PCA or t-SNE. It focuses on grouping similar data
points rather than reducing feature space.
- **Answer**: Yes, clustering can be used for dimensionality reduction by representing clusters with prototypes (e.g., centroids) or aggregating
similar features.