Now, we are going to look at logistic regressions and try to predict a dicoutom outcome using the BreastCancer datasets.
The dataset is placed in the mlbench package. Hence, load the mlbench package and load the data.
install.packages("mlbench")
library(ggplot2)
library(mlbench)
library(ggplot2)
library(mlbench)
data(BreastCancer)
data <- BreastCancer
head(data)
Exercise 3a.1
- Try to explore the data
- How many observation?
- Any factor data?
- Are the variable Cl.thickness normal distributed and what is the median and interquartile range (IQR)?
Are you experiencing problems?
If we look at the data the variables are loaded as factor variables.
When the data is ordered based on the values of the variable (Cell.shape is a ordered variable and not “round”, “square” or similar), we can convert it to a numeric variables and remove the id column. we should leave the variable as categorical, had it been a pure categorical variable with no internal ordering.
# remove id column
data <- data[,-1]
# convert factors to numeric
for(i in 1:9) {
data[, i] <- as.numeric(as.character(data[, i]))
}
Exercise 3a.2
- Use glm() function to build a logistic model with Class modeled as a function of Cell.shape alone.
Exercise 3a.3
Add Cl.thickness + Cell.size + Mitoses and the three interactions.
Exercise 3a.3
Create a boxplot of cell thickness between patients alive and not.
ggplot(data, aes(Class, Cl.thickness)) + geom_boxplot()
LS0tCnRpdGxlOiAiRXhlcmNpc2UgM2E6IExvZ2lzdGljIHJlZ3Jlc3Npb24iCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCk5vdywgd2UgYXJlIGdvaW5nIHRvIGxvb2sgYXQgbG9naXN0aWMgcmVncmVzc2lvbnMgYW5kIHRyeSB0byBwcmVkaWN0IGEgZGljb3V0b20gb3V0Y29tZSB1c2luZyB0aGUgQnJlYXN0Q2FuY2VyIGRhdGFzZXRzLgoKVGhlIGRhdGFzZXQgaXMgcGxhY2VkIGluIHRoZSBtbGJlbmNoIHBhY2thZ2UuIEhlbmNlLCBsb2FkIHRoZSBtbGJlbmNoIHBhY2thZ2UgYW5kIGxvYWQgdGhlIGRhdGEuCgpgYGB7ciBldmFsPUZBTFNFfQppbnN0YWxsLnBhY2thZ2VzKCJtbGJlbmNoIikKYGBgCgpgYGB7cn0KbGlicmFyeShnZ3Bsb3QyKQpsaWJyYXJ5KG1sYmVuY2gpCmBgYAoKYGBge3J9CmRhdGEoQnJlYXN0Q2FuY2VyKQpkYXRhIDwtIEJyZWFzdENhbmNlcgpoZWFkKGRhdGEpCmBgYAoKIyNFeGVyY2lzZSAzYS4xCgotIFRyeSB0byBleHBsb3JlIHRoZSBkYXRhCiAgLSBIb3cgbWFueSBvYnNlcnZhdGlvbj8KICAtIEFueSBmYWN0b3IgZGF0YT8KICAtIEFyZSB0aGUgdmFyaWFibGUgQ2wudGhpY2tuZXNzIG5vcm1hbCBkaXN0cmlidXRlZCBhbmQgd2hhdCBpcyB0aGUgbWVkaWFuIGFuZCBpbnRlcnF1YXJ0aWxlIHJhbmdlIChJUVIpPwoKYGBge3IgZXZhbD1GQUxTRSwgaW5jbHVkZT1GQUxTRX0KbnJvdyhkYXRhKQpOUk9XKG5hLm9taXQoZGF0YXNldCkpCmhlYWQoZGF0YSkKZ2dwbG90KGRhdGEsIGFlcyhDbC50aGlja25lc3MpKSArCiAgZ2VvbV9oaXN0b2dyYW0obmEucm09VFJVRSwgYmlucz0xNSkgKwogIGxhYnModGl0bGU9J0hpc3RvZ3JhbScpCmBgYAoKCkFyZSB5b3UgZXhwZXJpZW5jaW5nIHByb2JsZW1zPwoKSWYgd2UgbG9vayBhdCB0aGUgZGF0YSB0aGUgdmFyaWFibGVzIGFyZSBsb2FkZWQgYXMgZmFjdG9yIHZhcmlhYmxlcy4KCldoZW4gdGhlIGRhdGEgaXMgb3JkZXJlZCBiYXNlZCBvbiB0aGUgdmFsdWVzIG9mIHRoZSB2YXJpYWJsZSAoQ2VsbC5zaGFwZSBpcyBhIG9yZGVyZWQgdmFyaWFibGUgYW5kIG5vdCAicm91bmQiLCAic3F1YXJlIiBvciBzaW1pbGFyKSwgd2UgY2FuIGNvbnZlcnQgaXQgdG8gYSBudW1lcmljIHZhcmlhYmxlcyBhbmQgcmVtb3ZlIHRoZSBpZCBjb2x1bW4uIHdlIHNob3VsZCBsZWF2ZSB0aGUgdmFyaWFibGUgYXMgY2F0ZWdvcmljYWwsIGhhZCBpdCBiZWVuIGEgcHVyZSBjYXRlZ29yaWNhbCB2YXJpYWJsZSB3aXRoIG5vIGludGVybmFsIG9yZGVyaW5nLgoKCmBgYHtyIGVjaG89VFJVRX0KIyByZW1vdmUgaWQgY29sdW1uCmRhdGEgPC0gZGF0YVssLTFdCgojIGNvbnZlcnQgZmFjdG9ycyB0byBudW1lcmljCmZvcihpIGluIDE6OSkgewogZGF0YVssIGldIDwtIGFzLm51bWVyaWMoYXMuY2hhcmFjdGVyKGRhdGFbLCBpXSkpCn0KYGBgCgoKYGBge3IgZXZhbD1GQUxTRSwgaW5jbHVkZT1GQUxTRX0KbnJvdyhkYXRhKQpOUk9XKG5hLm9taXQoZGF0YXNldCkpCmhlYWQoZGF0YSkKZ2dwbG90KGRhdGEsIGFlcyhDbC50aGlja25lc3MpKSArCiAgZ2VvbV9oaXN0b2dyYW0obmEucm09VFJVRSwgYmlucz0xNSkgKwogIGxhYnModGl0bGU9J0hpc3RvZ3JhbScpCm1lZGlhbl9kYXRhX0NsLnRoaWNrbmVzcyA9IG1lZGlhbihkYXRhJENsLnRoaWNrbmVzcykKbWVkaWFuX2RhdGFfQ2wudGhpY2tuZXNzCklRUl9kYXRhX0NsLnRoaWNrbmVzcyA8LSBJUVIoZGF0YSRDbC50aGlja25lc3MpCklRUl9kYXRhX0NsLnRoaWNrbmVzcwoKYGBgCgoKIyNFeGVyY2lzZSAzYS4yCgotIFVzZSBnbG0oKSBmdW5jdGlvbiB0byBidWlsZCBhIGxvZ2lzdGljIG1vZGVsIHdpdGggQ2xhc3MgbW9kZWxlZCBhcyBhIGZ1bmN0aW9uIG9mIENlbGwuc2hhcGUgYWxvbmUuCgpgYGB7ciBpbmNsdWRlPUZBTFNFfQpnbG0xIDwtIGdsbShDbGFzcyB+IENlbGwuc2l6ZSwgZmFtaWx5PSJiaW5vbWlhbCIsIGRhdGEgPSBkYXRhKQpzdW1tYXJ5KGdsbTEpCmBgYAoKIyNFeGVyY2lzZSAzYS4zCgpBZGQgQ2wudGhpY2tuZXNzICsgQ2VsbC5zaXplICsgTWl0b3NlcyBhbmQgdGhlIHRocmVlIGludGVyYWN0aW9ucy4KCmBgYHtyIGluY2x1ZGU9RkFMU0V9CmdsbTIgPC0gZ2xtKENsYXNzIH4gQ2wudGhpY2tuZXNzICsgQ2VsbC5zaXplICsgTWl0b3NlcyArIENsLnRoaWNrbmVzczpDZWxsLnNpemUgKyBDZWxsLnNpemU6TWl0b3NlcyArIENsLnRoaWNrbmVzczpNaXRvc2VzLCBkYXRhPWRhdGEsIGZhbWlseSA9ICJiaW5vbWlhbCIpCnN1bW1hcnkoZ2xtMikKYGBgCgojI0V4ZXJjaXNlIDNhLjMKCkNyZWF0ZSBhIGJveHBsb3Qgb2YgY2VsbCB0aGlja25lc3MgYmV0d2VlbiBwYXRpZW50cyBhbGl2ZSBhbmQgbm90LgoKZ2dwbG90KGRhdGEsIGFlcyhDbGFzcywgQ2wudGhpY2tuZXNzKSkgKyBnZW9tX2JveHBsb3QoKQoKCg==