We all care about the process in every non-trivial task. We know that it is better to follow a software process while working on software. The advantages of using software process are described in many resources. The Machine Learning based project can also follow some process. While, there are different processes defined for Machine Learning based projects, here I am sharing one, that is based on the blog MachineLearningMastery.com:
A) Problem Definition
1. What is the problem?
2. Why does the problem need to be solve?
3. How would I solve the problem?
4. Describe provided data
B) Analyze Data
1. Summarize Data -description
2. Visualize Data - different graphs to get the idea about the data - can use scatter plot, histogram and so on
C) Prepare Data
1.Select data - select and define which data, what is missing, what is to be eliminated
2.Pre-process data - formatting, cleaning, sampling
3.Transform data/ Feature engineering - some operations like scaling, attribute decompositions and attribute aggregations
a) scaling - uniformity of scales (for instance, make every instance have an attribute in same scale (e.g. meters or kms)
b) decompositions - decompose complex features, for instance data can be decomposed into year, month, day and so on
c) aggregations - if possible aggregate/combine trivial features to get some meaningful feature. For instance, dob and age can be aggregated
to dob as we can use the current date to find the age
D) Evaluate Algorithms
Figure out the right algorithm for the problem.
1. Test Harness - if different algorithms perform poor on the problem, dataset, then we might need to re-transform the data or adjust the problem definition.
Rather than training and testing on same data set, prepare separate train and test data ((i) randomly split the data into train and test for n times and take the average, (ii)k-fold cross validation - split the dataset into k subsets, and train with (k-1)subsets and test with the remaining 1, repeat this for k times so that every instance is included in both testing and training).
Statistical significance tests like Student's t-test can also be performed
2. Spot check Algorithms - Rather than the algorithm that did well in other problems or exhaustively trying all possible algorithms, get to the methods that do well on the problem, fast. Design the experiment, run it, then analyze the results. Be methodical.
Start building up your suite of algorithms for spot check experiments
E) Improve Results
We can attempt to improve the results using the following approaches:
a) Algorithm Tuning - adjusting the configuration parameters for the used algorithms
b) Ensembling -
(i)Bagging (having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set)
(ii) Boosting - Modules are linked in a chain, where one model learns to fix the mistakes made by the model ahead of it, and so on down the line
(iii) Blending/Stacking/Stacked Aggregation - involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs
c) Extreme Feature Engineering - This strategy is about exposing more structure in the problem for the algorithms to learn. In this strategy we push the idea of feature engineering the extreme limits
F) Present Results
(i) Report the results including the followings Context of problem and motivation, Problem description, Solution description, Findings about data,methods that worked or didn't work, Limitations of the model,data approach, Conclusions (summary of context, problem and solution)
(ii) Operationalize the model -
a) Algorithm Implementation - Using a production-level library that supports the method you wish to use
b) Modeling tests - write automated tests
c) Tracking - track the performance of the model in run time and if required, think carefully in allowing models to update themselves in a production environment
A) Problem Definition
1. What is the problem?
2. Why does the problem need to be solve?
3. How would I solve the problem?
4. Describe provided data
B) Analyze Data
1. Summarize Data -description
2. Visualize Data - different graphs to get the idea about the data - can use scatter plot, histogram and so on
C) Prepare Data
1.Select data - select and define which data, what is missing, what is to be eliminated
2.Pre-process data - formatting, cleaning, sampling
3.Transform data/ Feature engineering - some operations like scaling, attribute decompositions and attribute aggregations
a) scaling - uniformity of scales (for instance, make every instance have an attribute in same scale (e.g. meters or kms)
b) decompositions - decompose complex features, for instance data can be decomposed into year, month, day and so on
c) aggregations - if possible aggregate/combine trivial features to get some meaningful feature. For instance, dob and age can be aggregated
to dob as we can use the current date to find the age
D) Evaluate Algorithms
Figure out the right algorithm for the problem.
1. Test Harness - if different algorithms perform poor on the problem, dataset, then we might need to re-transform the data or adjust the problem definition.
Rather than training and testing on same data set, prepare separate train and test data ((i) randomly split the data into train and test for n times and take the average, (ii)k-fold cross validation - split the dataset into k subsets, and train with (k-1)subsets and test with the remaining 1, repeat this for k times so that every instance is included in both testing and training).
Statistical significance tests like Student's t-test can also be performed
2. Spot check Algorithms - Rather than the algorithm that did well in other problems or exhaustively trying all possible algorithms, get to the methods that do well on the problem, fast. Design the experiment, run it, then analyze the results. Be methodical.
Start building up your suite of algorithms for spot check experiments
E) Improve Results
We can attempt to improve the results using the following approaches:
a) Algorithm Tuning - adjusting the configuration parameters for the used algorithms
b) Ensembling -
(i)Bagging (having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set)
(ii) Boosting - Modules are linked in a chain, where one model learns to fix the mistakes made by the model ahead of it, and so on down the line
(iii) Blending/Stacking/Stacked Aggregation - involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs
c) Extreme Feature Engineering - This strategy is about exposing more structure in the problem for the algorithms to learn. In this strategy we push the idea of feature engineering the extreme limits
F) Present Results
(i) Report the results including the followings Context of problem and motivation, Problem description, Solution description, Findings about data,methods that worked or didn't work, Limitations of the model,data approach, Conclusions (summary of context, problem and solution)
(ii) Operationalize the model -
a) Algorithm Implementation - Using a production-level library that supports the method you wish to use
b) Modeling tests - write automated tests
c) Tracking - track the performance of the model in run time and if required, think carefully in allowing models to update themselves in a production environment
No comments:
Post a Comment