PS7
See document attached
Problem 1
Part of the challenge of data mining text is that is that the sequence and context of words matters in communication. Consider the use of the word “good” in a movie review. Briefly explain how the word “good” could be used to convey both positive and negative feelings about a movie, why this highlights the importance of context, and if you believe there is a way to work around this problem.
Problem 2
This module provided an overview of a handful of other commonly used data mining techniques.
Consider a problem from your current or a past job, a hobby, or an interest that would make for a good application of one of the following techniques:
• Text-based data mining
• Co-occurrence grouping and associations
• Profiling
• Link prediction
Describe why this would be an appropriate example of a problem that can be solved with one of the methods above and what the use of the results of this analysis would be.
Please do not choose a hypothetical example like something from the textbook or an example from the slides, it should be something with which you have personal experience (yes, this problem is like problem 2 from problem set 2).
Problem 3
You have been hired by a hotel chain to take another crack at improving their booking and profitability. Armed with more data mining knowledge than ever before, you decide to once again create a classification decision tree model to predict cancelations, only this time you brought in the big guns
:
ensemble methods.
Target variable:
· is_canceled: whether the reservation was canceled
Attributes:
· hotel_type: whether the hotel is a “resort” or “city” hotel
· summer: whether the was made for the summer season or not
· children: whether children are listed on the reservation
· previous_cancelations: if person who made reservation has canceled before
We have 3 different tree induction models,
evaluate
each model on the test set.
:
1. A regular single decision tree
https://bigml.com/shared/evaluation/xTXf88MOhwF3cLqmAOBkqOTh9rA
2. An
ensemble
of trees using random forests (which BigML calls “decision forests”)
https://bigml.com/shared/evaluation/iDLqmKeWNuwr6kDGBK2XFM3ZarD
3. An
ensemble of trees using boosting (which BigML calls “boosted trees”)
https://bigml.com/shared/evaluation/uMi3GEWbLih6L5f1q1soFA08kiX
Finally, describe and compare the performance of each model and comment on if their relative performance met your expectations.