Chapter 13

  1. To collect a dataset, you can do any of the following:
  1. Git LFS can work seamlessly with Git. Specifically, we can use Git LFS to track the extensions of large files that we want to have version control on, and Git will work with Git LFS to delegate those files when we want to push our projects to GitHub. Afterward, we simply need to use Git in the usual way.
  2. An attribute that contains continuous, numerical data often has its missing values filled out with the mean. On the other hand, attributes with discrete numerical data as well as categorical data can use the mode to fill out their missing values.
  3. In a naive encoding scheme, you may inadvertently apply some sort of an ordered relation to the data when the original data is replaced with numerical values. With one-hot encoding, we can avoid this problem by creating new binary attributes that contain the same data as the original attribute. However, in an attribute with a large number of unique values, one-hot encoding might greatly increase the dimensionality of our dataset, which is undesirable in many cases.
  4. Bar charts can be applied to categorical attributes while distribution plots can visualize numerical attributes, both discrete and continuous.
  5. The feature correlation matrix of a dataset can identify any attribute that is highly correlated with the target attribute we are interested in, which can help us obtain valuable insights regarding the dataset.
  1. Sometimes, we use specific machine learning models to analyze a dataset and compute the feature importance of each dataset attribute. This feature importance value denotes how important that attribute was during the learning process of the mode, thus indicating some sort of correlation between that attribute and our target attribute.