- Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
- Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
- A growing library of feature transformations, metrics and estimators. Ramp‘s simple API allows for easy extension.
Detecting insults
Let‘s try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.First, we load up the data using pandas. (You can download the data from the Kaggle site, you‘ll have to sign up.)
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import pandas training_data = pandas.read_csv( ‘train.csv‘ ) print training_data pandas.core.frame.DataFrame Int64Index: 3947 entries, 0 to 3946 Data columns: Insult 3947 non - null values Date 3229 non - null values Comment 3947 non - null values dtypes: int64( 1 ), object ( 2 ) |
Anyways, let‘s set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
1
2
3
4
5
|
from ramp import * context = DataContext( store = ‘~/data/insults/ramp‘ , data = training_data) |
Next, we‘ll specify a base configuration for our analysis.
1
2
3
4
|
base_config = Configuration( target = ‘Insult‘ , metrics = [metrics.AUC()], ) |
Model exploration
Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
import sklearn from ramp.estimators.sk import BinaryProbabilities base_features = [ Length( ‘Comment‘ ), Log(Length( ‘Comment‘ ) + 1 ) ] factory = ConfigFactory(base_config, features = [ # first feature set is basic attributes base_features, # second feature set adds word features base_features + [ text.NgramCounts( text.Tokenizer( ‘Comment‘ ), mindocs = 5 , bool_ = True )], # third feature set creates character 5-grams # and then selects the top 1000 most informative base_features + [ trained.FeatureSelector( [text.NgramCounts( text.CharGrams( ‘Comment‘ , chars = 5 ), bool_ = True , mindocs = 30 ) ], selector = selectors.BinaryFeatureSelector(), n_keep = 1000 , target = F( ‘Insult‘ )), ], # the fourth feature set creates 100 latent vectors # from the character 5-grams base_features + [ text.LSI( text.CharGrams( ‘Comment‘ , chars = 5 ), mindocs = 30 , num_topics = 100 ), ] ], # we‘ll try two estimators (and wrap them so # we get class probabilities as output): model = [ BinaryProbabilities( sklearn.linear_model.LogisticRegression()), BinaryProbabilities( sklearn.naive_bayes.GaussianNB()) ]) |
Now, let‘s run cross-validation and compare the results:
1
2
3
|
for config in factory: models.cv(config, context, folds = 5 , repeat = 2 , print_results = True ) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
... Configuration model: Probabilites for LogisticRegression( C = 1.0 , class_weight = None , dual = False , fit_intercept = True , intercept_scaling = 1 , penalty = l2, tol = 0.0001 ) 3 features target: Insult auc 0.8679 ( + / - 0.0101 ) [ 0.8533 , 0.8855 ] ... Configuration model: Probabilites for GaussianNB() 3 features target: Insult auc 0.6055 ( + / - 0.0171 ) [ 0.5627 , 0.6265 ] ... |
Predictions
We can also create a quick utility that processes a given comment and spit out it‘s probability of being an insult:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
def probability_of_insult(config, ctx, txt): # create a unique index for this text idx = int (md5(txt).hexdigest()[: 10 ], 16 ) # add the new comment to our DataFrame d = DataFrame( { ‘Comment‘ :[txt]}, index = pandas.Index([idx])) ctx.data = ctx.data.append(d) # Specify which instances to predict with predict_index # and make the prediction pred, predict_x, predict_y = models.predict( config, ctx, predict_index = pandas.Index([idx])) return pred[idx] |
1
2
3
4
5
6
7
8
9
10
11
12
13
|
probability_of_insult( logreg_lsi_100_config, context, "ur an idiot" ) > . 8483555 probability_of_insult( logreg_lsi_100_config, context, "ur great" ) > . 099361 |
And more
There‘s more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp. Email me or submit an issue if you have any bugs/suggestions/comments.