Question Description

BIT 5524

Introduction to Business Inte

intelligence & Analytics

Homework

1.(6 points)

Assuming that data mining approaches are to be used in the following cases, identify whether the

task required is supervised or unsupervised learning:

a.Printing of custom discount coupons at the conclusions of a grocery store checkout based on what you just bought and what others have bought previously.

b.Deciding whether to issue a loan to an applicant based on demographic and financial data, with reference to a database of similar data on prior customers.

c. Identifying a network data packet as dangerous (e.g., virus or hacker attack) based on comparison to other packets whose threat status is known.

d. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms.

e.In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions.

f.Automated sorting of mail by zip code scanning.

2. (24points)

A company that manufactures riding mowers wants to identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying households as prospective owners or non

– owners on the basis of Income (in $1000s)

, Lot Size (in 1000 sq ft), and whether or not they own a dog. The marketing expert looked at a random sample of 24 households, given in the file RidingMowers_BH.xls.

a.Calculate the entropy of the entire dataset

b.Calculate the information gains resulting from an initial split on each attribute, and determine the attribute that provides the most information about the target value. (For Income, use the split <60, > =60. For Lot Size, use the split <19, >=19.)

c.Propose a complete classification tree for the data, using the attribute from b to make the first split. You do NOT need to do additional information gain calculations to determine the subsequent splits just use inspection to choose any additional splits

The terminal nodes do NOT need to be pure, but aim for at least 75% certainty in anyterminal node.(Note–to get full points,your tree does not have to match mine, but it does have to be logically correct.)

d.Predict whether a homeowner with an income of $72,000 who owns a dog and lives on a lot that is 18,500 square feet will purchase a riding mower.

3.(20points)

A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as legitimate (920 correctly so).

a.Construct the classification matrix(using the convention that“positive”= fraudulent)Calculate the following rates:

b.Accuracy

c.True positive rate

d.True negative rate

e.Precision

Remember to use the file naming convention outlined in the Syllabus when you upload your file

(e.g.,SmithJ_HW2_5524FA18.doc). Keep in mind that your writing is to be your own words, with citations as appropriate. This homework is to be individual work,and not discussed in your Learning Community.Please feel free to ask any clarifying questions in the appropriate discussion forum on the Canvas site.