These commands assume you are using the standard
matplotlib. If you do not use this ML pipeline, move on, stop reading now.
Cross-validation is a method used for splitting the training setup into
train sets, and then training the algorithm on the
training set and evaluating the algorithm on the
test set. You do not need to divide these up manually, it can be done very easily with
It’s important to understand that
train_test_split does NOT preserve you percentage of samples for each class. If you want to perserve this (and if you have unbalanced classes, you most certainly do), use
If different columns values have different magnitudes, you are going to need to normalize your features
before you do any training. To do that, you can use
From the documentation, you
can read that
StandardScaler subtracts the mean and scales to unit variance.
Converting A Pandas Data Frame To Numpy Matrix For Scikit-Learn
Scikit-learn does not currently accept Panda’s Dataframes, but that is OK, because you can convert a dataframe into a numpy matrix easily enough with the following command:
X = df.as_matrix().astype(np.float)
Note this will only work if all of your data is
numerical (no text)
Dropping All Rows That Contain NAN in a Given Column
If a certain feature is required, you will need to drop any rows in your dataframe
NaN values for that feature. To do this, you can execute something simple
like the following
df = df[np.isfinite(df['FeatureColumn'])]
Dropping Useless Columns
Sometimes you have columns in your dataframe that you know are not useful for training a model, so drop them, in place:
df.drop(['UselessColumn1', 'UselessColumn2'], axis=1, inplace=True)
If you look at the columns in your dataframe,
UselessColumn2 should be gone.
Discretizing Columns In A Dataframe
Supervised learning algorithms usually like looking at numerical features, so if you need to convert a column that contains finite text classes (states, countries, etc.) to numbers, use scikit-learn:
Boolean ANDs or ORs with two columns
Sometimes you want to filter a given set of rows in a dataframe by ANDing or ORing together two columns and take the result. This can be done like this
The parentheses are very important, otherwise pandas will complain.