Need help with your Discussion

Get a timely done, PLAGIARISM-FREE paper
from our highly-qualified writers!

glass
pen
clip
papers
heaphones

Data Mining Assignment

Data Mining Assignment

Data Mining Assignment

Question Description

Please answer the questions below. They are picked from the textbook.

Chapter 2:

18.

This exercise compares and contrasts some similarity and distance measures.

(a)

For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001

y = 0100011000

(b)

Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain.

(c)

Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(d)

If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain.

Chapter5:

20.

Consider the task of building a classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “−.” Half of the data set is used for training while the remaining half is used for testing.

(a)

Suppose there are an equal number of positive and negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected error rate of the classifier on the test data?

(b)

Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probability 0.8 and negative class with probability 0.2.

(c)

Suppose two-thirds of the data belong to the positive class and the remaining one-third belong to the negative class. What is the expected error of a classifier that predicts every test record to be positive?

(d)

Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probability 2/3 and negative class with probability 1/3.

Chapter 6:

17.

Suppose we have market basket data consisting of 100 transactions and 20 items. If the support for item a is 25%, the support for item b is 90% and the support for itemset {a, b} is 20%. Let the support and confidence thresholds be 10% and 60%, respectively.

(a)

Compute the confidence of the association rule {a} -> {b}. Is the rule interesting according to the confidence measure?

(b)

Compute the interest measure for the association pattern {a, b}. Describe the nature of the relationship between item a and item b in terms of the interest measure.

(c)

What conclusions can you draw from the results of parts (a) and (b)?

(d) NOT NEEDED FOR THE TEST

Chapter 7:

5.

For the data set with the attributes given below, describe how you would convert it into a binary transaction data set appropriate for association analysis. Specifically, indicate for each attribute in the original data set.

(a) How many binary attributes it would correspond to in the transaction data set,

(b) How the values of the original attribute would be mapped to values of the binary attributes, and

(c) If there is any hierarchical structure in the data values of an attribute that could be useful for grouping the data into fewer binary attributes.

The following is a list of attributes for the data set along with their possible values. Assume that all attributes are collected on a per-student basis:

• Year : Freshman, Sophomore, Junior, Senior, Graduate: Masters, Graduate: PhD, Professional

• Zip code : zip code for the home address of a U.S. student, zip code for the local address of a non-U.S. student

• College : Agriculture, Architecture, Continuing Education, Education, Liberal Arts, Engineering, Natural Sciences, Business, Law, Medical, Dentistry, Pharmacy, Nursing, Veterinary Medicine

• On Campus : 1 if the student lives on campus, 0 otherwise

• Each of the following is a separate attribute that has a value of 1 if the person speaks the language and a value of 0, otherwise.

– Arabic

– Bengali

– Chinese Mandarin

– English

– Portuguese

– Russian

– Spanish

Chapter 8:

1.

Consider a data set consisting of 2^(20) data vectors, where each vector has 32 components and each component is a 4-byte value. Suppose that vector quantization is used for compression and that 2^(16) prototype vectors are used. How many bytes of storage does that data set take before and after compression and what is the compression ratio?

9.

Give an example of a data set consisting of three natural clusters, for which (almost always) K-means would likely find the correct clusters, but bisecting K-means would not.

Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."

Order Solution Now

Our Service Charter


1. Professional & Expert Writers: Eminence Papers only hires the best. Our writers are specially selected and recruited, after which they undergo further training to perfect their skills for specialization purposes. Moreover, our writers are holders of masters and Ph.D. degrees. They have impressive academic records, besides being native English speakers.

2. Top Quality Papers: Our customers are always guaranteed of papers that exceed their expectations. All our writers have +5 years of experience. This implies that all papers are written by individuals who are experts in their fields. In addition, the quality team reviews all the papers before sending them to the customers.

3. Plagiarism-Free Papers: All papers provided by Eminence Papers are written from scratch. Appropriate referencing and citation of key information are followed. Plagiarism checkers are used by the Quality assurance team and our editors just to double-check that there are no instances of plagiarism.

4. Timely Delivery: Time wasted is equivalent to a failed dedication and commitment. Eminence Papers are known for the timely delivery of any pending customer orders. Customers are well informed of the progress of their papers to ensure they keep track of what the writer is providing before the final draft is sent for grading.

5. Affordable Prices: Our prices are fairly structured to fit in all groups. Any customer willing to place their assignments with us can do so at very affordable prices. In addition, our customers enjoy regular discounts and bonuses.

6. 24/7 Customer Support: At Eminence Papers, we have put in place a team of experts who answer all customer inquiries promptly. The best part is the ever-availability of the team. Customers can make inquiries anytime.

We Can Write It for You! Enjoy 20% OFF on This Order. Use Code SAVE20

Stuck with your Assignment?

Enjoy 20% OFF Today
Use code SAVE20