【转载】Getting started with Ramp: Detecting insults

Ramp is a python library for rapid machine learning prototyping. It provides a simple, declarative syntax for exploring features, algorithms and transformations quickly and efficiently. At its core it‘s a pandas wrapper around various python machine learning and statistics libraries (scikit-learn, rpy2, etc.). Some features:

Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
A growing library of feature transformations, metrics and estimators. Ramp‘s simple API allows for easy extension.

Detecting insults

Let‘s try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.

First, we load up the data using pandas. (You can download the data from the Kaggle site, you‘ll have to sign up.)

import pandas
 
training_data = pandas.read_csv(‘train.csv‘)
 
print training_data
 
pandas.core.frame.DataFrame
Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult     3947  non-null values
Date       3229  non-null values
Comment    3947  non-null values
dtypes: int64(1), object(2)

We‘ve got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you‘re curious, the insults in question range from the relatively civilized ("... you don‘t have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).

Anyways, let‘s set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.

from ramp import *
 
context = DataContext(
              store=‘~/data/insults/ramp‘, 
              data=training_data)

We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.

Next, we‘ll specify a base configuration for our analysis.

base_config = Configuration(
    target=‘Insult‘,
    metrics=[metrics.AUC()],
    )

Here we have specified the DataFrame column ‘Insult‘ as the target for our classification and the AUC for our metric.

Model exploration

Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.

import sklearn
from ramp.estimators.sk import BinaryProbabilities
 
base_features = [
    Length(‘Comment‘),  
    Log(Length(‘Comment‘) + 1)
]
 
factory = ConfigFactory(base_config,
    features=[
        # first feature set is basic attributes
        base_features,
 
        # second feature set adds word features
        base_features + [
            text.NgramCounts(
                text.Tokenizer(‘Comment‘),
                mindocs=5,
                bool_=True)],
 
        # third feature set creates character 5-grams
        # and then selects the top 1000 most informative
        base_features + [
            trained.FeatureSelector(
                [text.NgramCounts(
                    text.CharGrams(‘Comment‘, chars=5),
                    bool_=True,
                    mindocs=30)
                ],
                selector=selectors.BinaryFeatureSelector(),
                n_keep=1000,
                target=F(‘Insult‘)),
            ],
 
        # the fourth feature set creates 100 latent vectors
        # from the character 5-grams
        base_features + [
            text.LSI(
                text.CharGrams(‘Comment‘, chars=5),
                mindocs=30,
                num_topics=100),
            ]
    ],
 
    # we‘ll try two estimators (and wrap them so
    # we get class probabilities as output):
    model=[
        BinaryProbabilities(
            sklearn.linear_model.LogisticRegression()),
        BinaryProbabilities(
            sklearn.naive_bayes.GaussianNB())
    ])

We‘ve defined some base features along with four feature sets that seem promising.

Now, let‘s run cross-validation and compare the results:

for config in factory:
    models.cv(config, context, folds=5, repeat=2, 
              print_results=True)

Here are a couple snippets of the output:

...
 
Configuration
 model: Probabilites for LogisticRegression(
          C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, tol=0.0001)
 3 features
 target: Insult
auc
0.8679 (+/- 0.0101) [0.8533,0.8855]
 
...
 
Configuration
 model: Probabilites for GaussianNB()
 3 features
 target: Insult
auc
0.6055 (+/- 0.0171) [0.5627,0.6265]
 
...

The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)

Predictions

We can also create a quick utility that processes a given comment and spit out it‘s probability of being an insult:

def probability_of_insult(config, ctx, txt):
    # create a unique index for this text
    idx = int(md5(txt).hexdigest()[:10], 16)
 
    # add the new comment to our DataFrame
    d = DataFrame(
            {‘Comment‘:[txt]}, 
            index=pandas.Index([idx]))
    ctx.data = ctx.data.append(d)
 
    # Specify which instances to predict with predict_index
    # and make the prediction
    pred, predict_x, predict_y = models.predict(
            config, 
            ctx,
            predict_index=pandas.Index([idx]))
 
    return pred[idx]

And we can run it on some sample text:

probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur an idiot")
 
> .8483555
 
probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur great")
 
> .099361

Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.

And more

There‘s more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp. Email me or submit an issue if you have any bugs/suggestions/comments.