Demographic information - Academic Bible

Assignment 2

Q1. This assignment requires the understanding of the concepts explained data mining and predictive

modeling.

(a) For this exercise, your goal is to build a model to identify inputs or predictors that differentiate risky

customers from others (based on patterns pertaining to previous customers) and then use those inputs

to predict new risky customers. This sample case is typical for this domain. The sample data to be used

in this exercise is CreditRisk.xlsx .

The data set has 425 cases and 15 variables pertaining to past and current customers who have borrowed

from a bank for various reasons. The data set contains customer-related information such as financial

standing, reason for the loan, employment, demographic information, and the outcome or dependent

variable for credit standing, classifying each case as good or bad, based on the institution’s past

experience.

Take 400 of the cases as training cases and set aside the other 25 for testing. Build a decision tree model

to learn the characteristics of the problem. Test its performance on the other 25 cases. Report on your

model’s learning and testing performance. Prepare a report that identifies the decision tree model and

training parameters, as well as the resulting performance on the test set.

You can use either R Package rattle (GUI-based) or Weka.

To use Weka go through Learning Resource for Weka decision tree

To use rattle Learning resources for rattle for decision tree

(b) Using the same dataset also develop a Neural Network (NN) model using either rattle or Weka

(Multilayer Perceptron)

validation and Leave-one-out for classification assessment). Also generate ROC plots. Explain and

discuss the results.http://www.cs.waikato.ac.nz/ml/weka/https://www.youtube.com/watch?v=CV6dohykPhYhttps://www.youtube.com/watch?v=IPh8PxDtgdQhttp://www.springer.com/cda/content/document/cda_downloaddocument/9781441998897-c1.pdf?SGWID=0-0-45-1277951-p174110667https://www.youtube.com/watch?v=ARGfOHPVERchttps://rattle.togaware.com/rattle-examples.htmlhttps://www.youtube.com/watch?v=mo2dqHbLpQo

Report everything in a .pdf file with descriptions of preprocessing steps, model development, model

output interpretation and the explanations of the model significance.

Q2. The attached Excel file Iris.xlsx lists a dataset that consists of samples from each of three flower

species of Iris (Iris setosa, Iris virginica and Iris versicolor) and 4 attributes – Petal Length, Petal Width,

Sepal Length, Sepal width. The dataset has been partitioned to training and testing set.

Develop a k-nearest neighbor (kNN) model that predicts the class of the test sets given the attribute

information. Assume that the model uses Euclidean distance to find the nearest neighbor using k=1.

Show all the calculation steps in Excel for the prediction the testing set classes. Develop a confusion

matrix of the predicted and actual class. Calculate overall classifier accuracy from the confusion matrix.

Do not use Weka or rattle for this exercise, use only Excel. Upload the Excel file showing all the calculation

details.

Q3. In this exercise you’ll use R package tidyverse (see chapter 4 of Introduction to Data Science Data

Analysis and Prediction Algorithms with R by Rafael A. Irizarry. You need to go through chapter 4

before attempting the following questions.

Using dplyr functions (i.e., filter, mutate ,select, summarise, group_by etc. ) and “murder” dataset

(available in dslab R package) and write appropriate R syntax to answer the followings:

a. Calculate regional total murder excluding OH, AL, and AZ (Hint: filter(! abb %in% x) # here x is

the exclusion vector)

b. Display the regional population and regional murder numbers.

c. How many states are there in each region? (Hint: n ())

d. What is Ohio’s murder rank in the Northern Central Region (Hint: use rank(), row_number())

e. How many states have murder number greater than its regional average (Hint: nrow() )

f. Display 2 least populated states in each region (Hint: slice_min() )

Use pipe %>% operator for all the queries. Show all the output results.

Upload one single pdf file with the answers of Q1 & Q3. For question 2, just upload the Excel file.

This is a group assignment. Do not share or interact between the groups.https://rafalab.github.io/dsbook/tidyverse.html

The post Demographic information appeared first on My Assignment Online.

Get Instant Help on WhatsApp