- Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
- Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
- A growing library of feature transformations, metrics and estimators. Ramp‘s simple API allows for easy extension.
Detecting insults
Let‘s try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.First, we load up the data using pandas. (You can download the data from the Kaggle site, you‘ll have to sign up.)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import pandastraining_data = pandas.read_csv(‘train.csv‘)print training_datapandas.core.frame.DataFrameInt64Index: 3947 entries, 0 to 3946Data columns:Insult 3947 non-null valuesDate 3229 non-null valuesComment 3947 non-null valuesdtypes: int64(1), object(2) |
Anyways, let‘s set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
|
1
2
3
4
5
|
from ramp import *context = DataContext( store=‘~/data/insults/ramp‘, data=training_data) |
Next, we‘ll specify a base configuration for our analysis.
|
1
2
3
4
|
base_config = Configuration( target=‘Insult‘, metrics=[metrics.AUC()], ) |
Model exploration
Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
import sklearnfrom ramp.estimators.sk import BinaryProbabilitiesbase_features = [ Length(‘Comment‘), Log(Length(‘Comment‘) + 1)]factory = ConfigFactory(base_config, features=[ # first feature set is basic attributes base_features, # second feature set adds word features base_features + [ text.NgramCounts( text.Tokenizer(‘Comment‘), mindocs=5, bool_=True)], # third feature set creates character 5-grams # and then selects the top 1000 most informative base_features + [ trained.FeatureSelector( [text.NgramCounts( text.CharGrams(‘Comment‘, chars=5), bool_=True, mindocs=30) ], selector=selectors.BinaryFeatureSelector(), n_keep=1000, target=F(‘Insult‘)), ], # the fourth feature set creates 100 latent vectors # from the character 5-grams base_features + [ text.LSI( text.CharGrams(‘Comment‘, chars=5), mindocs=30, num_topics=100), ] ], # we‘ll try two estimators (and wrap them so # we get class probabilities as output): model=[ BinaryProbabilities( sklearn.linear_model.LogisticRegression()), BinaryProbabilities( sklearn.naive_bayes.GaussianNB()) ]) |
Now, let‘s run cross-validation and compare the results:
|
1
2
3
|
for config in factory: models.cv(config, context, folds=5, repeat=2, print_results=True) |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
...Configuration model: Probabilites for LogisticRegression( C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001) 3 features target: Insultauc0.8679 (+/- 0.0101) [0.8533,0.8855]...Configuration model: Probabilites for GaussianNB() 3 features target: Insultauc0.6055 (+/- 0.0171) [0.5627,0.6265]... |
Predictions
We can also create a quick utility that processes a given comment and spit out it‘s probability of being an insult:|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
def probability_of_insult(config, ctx, txt): # create a unique index for this text idx = int(md5(txt).hexdigest()[:10], 16) # add the new comment to our DataFrame d = DataFrame( {‘Comment‘:[txt]}, index=pandas.Index([idx])) ctx.data = ctx.data.append(d) # Specify which instances to predict with predict_index # and make the prediction pred, predict_x, predict_y = models.predict( config, ctx, predict_index=pandas.Index([idx])) return pred[idx] |
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
probability_of_insult( logreg_lsi_100_config, context, "ur an idiot")> .8483555probability_of_insult( logreg_lsi_100_config, context, "ur great")> .099361 |
And more
There‘s more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp. Email me or submit an issue if you have any bugs/suggestions/comments.