SIT384 Cyber security analytics
Pass Task 8.1P: PCA dimensionality reduction
Task description:
PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space. It can be used to reduce high dimensional data into 2 or 3 dimensions so that we can visualize and hopefully understand the data better.
In this task, you use PCA to reduce the dimensionality of a given dataset and visualize the data.
You are given:
• Breast cancer dataset which can be retrieved from:
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() detailed info available at: https://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
• PCA(n_components=2)
• 3D plot settings: (Please refer to prac7 for 3D plot examples) from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8)) cmap = plt.cm.get_cmap(-Spectral-) ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=10, azim=10) ax.scatter(x,y,z, c=cancer.target, cmap=cmap)
• Other settings of your choice
You are asked to:
• use StandardScaler() to first fit and transform the cancer.data,
• apply PCA (n_components=2) to fit and transform the scaled cancer.data set
• print the scaled dataset shape and PCA transformed dataset shape for comparison
• create 2D plot with the first principal component as x axis and the second principal component as y axis
• set proper xlabel, ylabel for the 2D plot
• print the PCA component shape and component values
• create a 3D plot with the first 3 features (as x,y and z) of the scaled cancer.data set
• create a 3D plot with the first principal component as x axis and the second principal component as y axis, no value for z axis
• set proper title for the two 3D plots
Sample output as shown in the following figures are for demonstration purposes only. Yours might be different from the provided.
Submission:
Submit the following files to OnTrack:
1. Your program source code (e.g. task8_1.py)
2. A screen shot of your program running
Check the following things before submitting:
1. Add proper comments to your code
SIT384 Cyber security analytics
Pass Task 7.1P: K-Means and Hierarchical Clustering
Task description:
In machine learning, clustering is used for analyzing and grouping data which does not include prelabelled class or even a class attribute at all. K-Means clustering and hierarchical clustering are all unsupervised learning algorithms.
K- means is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. It is a division of objects into clusters such that each object is in exactly one cluster, not several.
In Hierarchical clustering, clusters have a tree like structure or a parent child relationship. Here, the two most similar clusters are combined together and continue to combine until all objects are in the same cluster.
In this task, you use K-Means and Agglomerative Hierarchical algorithms to cluster a synthetic dataset and compare their difference.
You are given:
• np.random.seed(0)
• make_blobs class with input:
o n_samples: 200
o centers: [3,2], [6, 4], [10, 5] o cluster_std: 0.9
• KMeans() function with setting: init = -k-means++-, n_clusters = 3, n_init = 12
• AgglomerativeClustering() function with setting: n_clusters = 3, linkage = average
• Other settings of your choice
You are asked to:
• plot your created dataset
• plot the two clustering models for your created dataset
• set the K-Mean plot with title “KMeans”
• set the Agglomerative Hierarchical plot with title “Agglomerative Hierarchical”
• calculate distance matrix for Agglomerative Clustering using the input feature matrix (linkage = complete)
• display dendrogram
Sample output as shown in the following figure is for demonstration purposes only. Yours might be different from the provided.
Submission:
Submit the following files to OnTrack:
1. Your program source code (e.g. task7_1.py)
2. A screen shot of your program running
Check the following things before submitting:
 
  