Problem 1

Part of the challenge of data mining text is that is that the sequence and context of words matters in communication. Consider the use of the word “good” in a movie review. Briefly explain how the word “good” could be used to convey both positive and negative feelings about a movie, why this highlights the importance of context, and if you believe there is a way to work around this problem.

Problem 2

This module provided an overview of a handful of other commonly used data mining techniques.

Consider a problem from your current or a past job, a hobby, or an interest that would make for a good application of one of the following techniques:

• Text-based data mining

• Co-occurrence grouping and associations

• Profiling

• Link prediction

Describe why this would be an appropriate example of a problem that can be solved with one of the methods above and what the use of the results of this analysis would be.

Please do not choose a hypothetical example like something from the textbook or an example from the slides, it should be something with which you have personal experience (yes, this problem is like problem 2 from problem set 2).

Problem 3

You have been hired by a hotel chain to take another crack at improving their booking and profitability. Armed with more data mining knowledge than ever before, you decide to once again create a classification decision tree model to predict cancelations, only this time you brought in the big guns


ensemble methods.

Target variable:

· is_canceled: whether the reservation was canceled


· hotel_type: whether the hotel is a “resort” or “city” hotel

· summer: whether the was made for the summer season or not

· children: whether children are listed on the reservation

· previous_cancelations: if person who made reservation has canceled before

We have 3 different tree induction models,
each model on the test set.


1. A regular single decision tree

2. An
of trees using random forests (which BigML calls "decision forests")

3. An
of trees using boosting (which BigML calls "boosted trees")

Finally, describe and compare the performance of each model and comment on if their relative performance met your expectations.

