标签:
The past almost four months I have been competing in a Kaggle competition about diabetic retinopathy grading based on high-resolution eye images. In this post I try to reconstruct my progression through the competition; the challenges I had, the things I tried, what worked and what didn’t. This is not meant as a complete documentation but, nevertheless, some more concrete examples can be found at the end and certainly in the code. In the end I finished fifth of the almost 700 competing teams. (Some things still need to be added in the next few days.)
Diabetic retinopathy (DR) is the leading cause of blindness in the working-age population of the developed world and is estimated to affect over 93 million people. (From the competition description where some more background information can be found.)
The grading process consists of recognising very fine details, such as microaneurysms, to some bigger features, such as exudates, and sometimes their position relative to each other on images of the eye. (This is not an exhaustive list, you can look at, for example, the long list of criteria used in the UK to grade DR (pdf).)
Some annotated examples from the literature to get an idea of what this really looks like (the medical details/terminology are not very important for the rest of this post):
Now let’s look at it as someone who simply wants to try to model this problem.
You have 35126 images in the training set that look like this
annotated by a patient id and “left” or “right” (each patient has two images, one per eye) and divided into 5 fairly unbalanced classes (per eye/image, not per patient!)
Class | Name | Number of images | Percentage |
---|---|---|---|
0 | Normal | 25810 | 73.48% |
1 | Mild NPDR | 2443 | 6.96% |
2 | Moderate NPDR | 5292 | 15.07% |
3 | Severe NPDR | 873 | 2.48% |
4 | PDR | 708 | 2.01% |
You are asked to predict the class (thus, one of the 5 numbers) for each of the 53576 test images and your predictions are scored on the quadratic weighted kappa metric. For the public leaderboard that was updated during the competition, 20 percent of the test data was used. The other 80 percent are used to calculate the final ranking.
Not surprisingly, all my models were convolutional networks (convnets) adapted for this task. I recommend reading the well-written blogpost from the ? Deep Sea ? team that won the Kaggle National Data Science Bowl competition since there are a lot of similarities in our approaches and they provide some more/better explanation.
A small but important note: notice that 20 percent of the test images is roughly 11k images and the training set is only 35k images. Hence, the leaderboard score can actually provide some more stable scoring than only using, for example, a 90%/10% split on the training set which would result in only about a third of the size of the public leaderboard dataset used for validation. Normally I don’t like to make many submissions during a competition but because training and evaluating these models is quite time-consuming and the public leaderboard could provide some interesting and somewhat reliable information, I did try to make use of this at the end.
Also note that competing in a competition such as this requires you – unless you have a huge amount of resources and time – to interpret a stream of fairly complex and noisy data as quickly and efficiently as possible and thus, the things I have “learned” are not really tested rigorously and might even be wrong. In the same mindset, a lot of the code is written very quickly and even obvious optimisations might be postponed (or never done!) because the trade-off is not worth it, there are more immediate priorities, you don’t want to risk breaking some things, I don’t know how (to do it quickly), etc. I started working on this competition somewhere in the beginning of April because before that I was in the process of moving to London.
This post is written in an approximate chronological order, which might be somewhat confusing to read through at first. Therefore, I’ll try to sketch in a very concise fashion what my eventual models looked like. (It’s not completely independent of the rest of the text, which does try to provide some more explanation.)
My best models used a convolutional network with the following relatively basic architecture (listing the output size of each layer)
Nr | Name | batch | channels | width | height | filter/pool |
---|---|---|---|---|---|---|
0 | Input | 64 | 3 | 512 | 512 | |
1 | Conv | 64 | 32 | 256 | 256 | 7//2 |
2 | Max pool | 64 | 32 | 127 | 127 | 3//2 |
3 | Conv | 64 | 32 | 127 | 127 | 3//1 |
4 | Conv | 64 | 32 | 127 | 127 | 3//1 |
5 | Max pool | 64 | 32 | 63 | 63 | 3//2 |
6 | Conv | 64 | 64 | 63 | 63 | 3//1 |
7 | Conv | 64 | 64 | 63 | 63 | 3//1 |
8 | Max pool | 64 | 64 | 31 | 31 | 3//2 |
9 | Conv | 64 | 128 | 31 | 31 | 3//1 |
10 | Conv | 64 | 128 | 31 | 31 | 3//1 |
11 | Conv | 64 | 128 | 31 | 31 | 3//1 |
12 | Conv | 64 | 128 | 31 | 31 | 3//1 |
13 | Max pool | 64 | 128 | 15 | 15 | 3//2 |
14 | Conv | 64 | 256 | 15 | 15 | 3//1 |
15 | Conv | 64 | 256 | 15 | 14 | 3//1 |
16 | Conv | 64 | 256 | 15 | 15 | 3//1 |
17 | Conv | 64 | 256 | 15 | 15 | 3//1 |
18 | Max pool | 64 | 256 | 7 | 7 | 3//2 |
19 | Dropout | 64 | 256 | 7 | 7 | |
20 | Maxout (2-pool) | 64 | 512 | |||
21 | Concat with image dim | 64 | 514 | |||
22 | Reshape (merge eyes) | 32 | 1028 | |||
23 | Dropout | 32 | 1028 | |||
24 | Maxout (2-pool) | 32 | 512 | |||
25 | Dropout | 32 | 512 | |||
26 | Dense (linear) | 32 | 10 | |||
27 | Reshape (back to one eye) | 64 | 5 | |||
28 | Apply softmax | 64 | 5 |
(Where a//b
in the last column denotes pool or filter size a x a
with stride b x b
.)
where the reshape was done to combine the representations of the two eyes belonging to the same patient. All layers were initialised with the SVD variant of the orthogonal initialisation (based on Saxe et al.) and the non-linear layers used leaky rectify units max(alpha*x, x)
with alpha=0.5.
The inputs were 512x512 images which were augmented in real-time by
together with their original image dimensions. Because of the great class imbalance, some classes wereoversampled to get a more uniform distribution of classes in batches. Somewhere in the middle of training this oversampling stopped and images were sampled from the true training set distribution to
Training was done with Stochastic Gradient Descent (SGD) and Nesterov momentum for almost 100k iterations on a loss which was a combination of a continuous kappa loss together with the cross-entropy (or log) loss:
kappalogclipped = cont_kappa + 0.5 * T.clip(log_loss, log_cutoff, 10**3)
An important part was converting the softmax probabilities for each label to a discrete prediction. Using the label with the highest probability (i.e., argmax
) did quite well but is unstable and a significant improvement comes from converting these probabilities to one continuous value (by weighted sum), ranking these values and then using some boundaries to assign labels (e.g., first 10% is label 0, next 20% is label 1, etc.).
Doing all this gets you to somewhere around +0.835 with a single model.
A final good improvement then came from ensembling a few models using the mean of their log probabilities for each class, converting these to normal probabilities in [0, 1] again and using
weighted_probs = probs[:, 1] + probs[:, 2] * 2 + probs[:, 3] * 3 + probs[:, 4] * 4
to get one vector of predictions on which we can apply the ranking procedure from the previous paragraph to assign labels. A few candidate boundaries were determined using scipy’s minimize function on the kappa score of some ensembles.
The patients are always split into a training set (90%) and a validation set (10%) by stratified sampling based on the maximum label of the two eyes per patient.
First of all, since the original images are fairly large (say, 3000x2000 pixels on average) and most contained a fairly significant black border, I started by downscaling all the images by a factor of five (without interpolation) and trying to remove most of these black borders.
This made it computationally much more feasible to do many augmentations: we augment the training set with seemingly similar images to increase the number of training examples.
These augmentations (transformations) were
Most of these were implemented from the start. The specific ranges/parameters depend on the model (some examples can be found at the end). During training random samples are picked from the training set and transformed before queueing them for input to the network. The augmentations were done by spawning different threads on the CPU such that there was almost no delay in waiting for another batch of samples. This worked very well for most of the competition, which is why I did not spend much time trying to optimise this.
Resizing was done after the cropping and before the other transformations, because it makes some of other operations computationally way too intensive, and can be done in two ways:
The method chosen also depends on the model. Method 1 was used a lot in the beginning, then 2 in the middle and in the end I revisited the choice again. Eventually I stuck with method 2 since my best performing models in the end used that and I did not want to have another hyperparameter to take into account, and I was afraid of losing too much information with the center crops from the first method.
During the training the input is normalised by subtracting the total mean and dividing by the total standard deviation estimated on a few hundred samples before training (a pixel based normalisation was also used for a lot of models, at the moment of writing I have not figured out if there is a big difference).
The original image dimensions (before they are rescaled to a square input) are also always concatenated as features with the other dense representations (I never removed this or tested if it was of any use).
The first few nets were initially trained with SGD and Nesterov momentum on the cross-entropy (or log) loss function. Thus, as a classification problem. However, the competition used the (quadratic weighted) kappa metric to score submissions and not the log loss or accuracy. The (quadratic weighted) kappa score of a class prediction y for a ground truth t can be given by the Python code (not optimised for edge cases and assuming one-hot encoded):
import numpy as np
def quadratic_kappa(y, t, eps=1e-15):
# Assuming y and t are one-hot encoded!
num_scored_items = y.shape[0]
num_ratings = y.shape[1]
ratings_mat = np.tile(np.arange(0, num_ratings)[:, None],
reps=(1, num_ratings))
ratings_squared = (ratings_mat - ratings_mat.T) ** 2
weights = ratings_squared / (float(num_ratings) - 1) ** 2
# We norm for consistency with other variations.
y_norm = y / (eps + y.sum(axis=1)[:, None])
# The histograms of the raters.
hist_rater_a = y_norm.sum(axis=0)
hist_rater_b = t.sum(axis=0)
# The confusion matrix.
conf_mat = np.dot(y_norm.T, t)
# The nominator.
nom = np.sum(weights * conf_mat)
expected_probs = np.dot(hist_rater_a[:, None],
hist_rater_b[None, :])
# The denominator.
denom = np.sum(weights * expected_probs / num_scored_items)
return 1 - nom / denom
Alternatively, for some more information about the metric, see the evaluation page of the competition. The kappa metric tries to represent the agreement between two different raters, where the score typically varies from 0 (random agreement) to 1 (complete agreement). The downside of this metric is that it is discrete – forcing you to pick only one class of the five per sample instead of allowing you to use the probabilities for each class – which necessarily gives you some more variance when evaluating models. There already is a decent amount of noise in the dataset (very questionable ground truth classifications, noisy/broken images, etc.) so there certainly was a noticeable amount of variance throughout the competition to take into account.
I only trained one or two models on the cross-entropy loss since the competition was scored on the kappa metric and I felt it was important to quickly change to a loss function closely based on this metric. Several variations of acontinuous kappa loss – you simply use your softmax prediction matrix y as input to the quadratic_kappa
function above – were tried but using only a kappa loss seemed too unstable in the beginning of training to learn well. Combining the continuous kappa with the cross-entropy loss seemed to fix this. Eventually this “kappalogclipped” loss was used
kappalogclipped = cont_kappa + 0.5 * T.clip(log_loss, log_cutoff, 10**3)
where there was an additional y_pow parameter which determined the power to which to raise the predictions before calculating the continuous kappa (squashing lower probabilities, making the predictions more discrete). This y_pow was 2 for a lot of the later experiments and has a significant influence on the model. By usingy_pow=2 the continuous kappa approximates the discrete kappa very well, which will give you some more variance in the updates. The log_cutoff determined from which point to start ignoring the log loss, this was 0.80 for most of the experiments. The scaling by 0.5 is something left behind by many other experiments with losses.
There has been a lot of discussion about optimising the kappa metric, partly because of the recent CrowdFlower Search Results Relevance competition which also used this metric to score the submissions. Most of the competitors in that competition used regression on the Mean Squared Error (MSE) objective together with somedecoding strategy to convert the continuous predictions to discrete ones. The first place winner has an excellent write-up where he compared many different methods to optimise the kappa metric and concludes that MSE together with a ranking based discretisation worked best. I also considered such an approach but since training my models took quite a while and I did not want to lose too much time testing all these different methods – and it already worked quite well! – I stuck with the kappalogclipped loss. These other losses also don’t take the distribution of the predictions into account, which I thought might be important (even though you could optimise for that after training). The kappalogclipped loss had the benefit of allowing me to easily monitor a relatively reliable kappa score during the training by using the label with the highest probability (the same argmax
strategy was used for almost all the models but is revisited at the end). Also note that the fact that these labels areordered is implicitly defined by the kappa loss itself.
I did test the MSE objective very briefly at the end and got somewhat similar performance.
This was still trained with SGD with Nesterov momentum using some learning rate schedule (decreasing the learning rate 3-5 times). Most of the time some L2 regularisation on the network parameters, or weight decay, was added.
My first models used 120x120 rescaled input and I stayed with that for a decent amount of time in the beginning (first 3-4 weeks). A week or so later my first real model had an architecture that looked like this (listing the output size of each layer)
Nr | Name | batch | channels | width | height | filter/pool |
---|---|---|---|---|---|---|
0 | Input | 32 | 3 | 120 | 120 | |
1 | Cyclic slice | 128 | 3 | 120 | 120 | |
2 | Conv | 128 | 32 | 120 | 120 | 3//1 |
3 | Conv | 128 | 16 | 120 | 120 | 3//1 |
4 | Max pool | 128 | 16 | 59 | 59 | 3//2 |
5 | Conv roll | 128 | 64 | 59 | 59 | |
6 | Conv | 128 | 64 | 59 | 59 | 3//1 |
7 | Conv | 128 | 32 | 59 | 59 | 3//1 |
8 | Max pool | 128 | 32 | 29 | 29 | 3//2 |
9 | Conv roll | 128 | 128 | 29 | 29 | |
10 | Conv | 128 | 128 | 29 | 29 | 3//1 |
11 | Conv | 128 | 128 | 29 | 29 | 3//1 |
12 | Conv | 128 | 128 | 29 | 29 | 3//1 |
13 | Conv | 128 | 64 | 29 | 29 | 3//1 |
14 | Max pool | 128 | 64 | 14 | 14 | 3//2 |
15 | Conv roll | 128 | 256 | 14 | 14 | |
16 | Conv | 128 | 256 | 14 | 14 | 3//1 |
17 | Conv | 128 | 256 | 14 | 14 | 3//1 |
18 | Conv | 128 | 256 | 14 | 14 | 3//1 |
19 | Conv | 128 | 128 | 14 | 14 | 3//1 |
20 | Max pool | 128 | 128 | 6 | 6 | 3//2 |
21 | Dropout | 128 | 128 | 6 | 6 | |
22 | Maxout (2-pool) | 128 | 512 | |||
23 | Cyclic pool | 32 | 512 | |||
24 | Concat with image dim | 32 | 514 | |||
25 | Dropout | 32 | 514 | |||
26 | Maxout (2-pool) | 32 | 512 | |||
27 | Dropout | 32 | 512 | |||
28 | Softmax | 32 | 5 |
(Where a//b
in the last column denotes pool or filter size a x a
with stride b x b
.)
which used the cyclic layers from the ? Deep Sea ? team. As nonlinearity I used the leaky rectify function,max(alpha*x, x)
, with alpha=0.3. Layers were almost always initialised with the SVD variant of the orthogonal initialisation (based on Saxe et al.). This gave me around 0.70 kappa. However, I quickly realised that, given the grading criteria for the different classes (think of the microaneurysms which are pretty much impossible to detect on 120x120 images), I would have to use bigger input images to get anywhere near a decent model.
Something else that I had already started testing in models somewhat, which seemed to be quite critical for decent performance, was oversampling the smaller classes. I.e., you make samples of certain classes more likely than others to be picked as input to your network. This resulted in more stable updates and better, quicker training in general (especially since I was using small batch sizes of 32 or 64 samples because of GPU memory restrictions).
First I wanted to take into account the fact that for each patient we get two retina images: the left and right eye. By combining the dense representations of the two eyes before the last two dense layers (one of which being a softmax layer) I could use both images to classify each image. Intuitively you can expect some pairs of labels to be more probable than others and since you always get two images per patient, this seems like a good thing to do.
This gave me the basic architecture for 512x512 rescaled input which was used pretty much until the end (except for some experiments):
Nr | Name | batch | channels | width | height | filter/pool |
---|---|---|---|---|---|---|
0 | Input | 64 | 3 | 512 | 512 | |
1 | Conv | 64 | 32 | 256 | 256 | 7//2 |
2 | Max pool | 64 | 32 | 127 | 127 | 3//2 |
3 | Conv | 64 | 32 | 127 | 127 | 3//1 |
4 | Conv | 64 | 32 | 127 | 127 | 3//1 |
5 | Max pool | 64 | 32 | 63 | 63 | 3//2 |
6 | Conv | 64 | 64 | 63 | 63 | 3//1 |
7 | Conv | 64 | 64 | 63 | 63 | 3//1 |
8 | Max pool | 64 | 64 | 31 | 31 | 3//2 |
9 | Conv | 64 | 128 | 31 | 31 | 3//1 |
10 | Conv | 64 | 128 | 31 | 31 | 3//1 |
11 | Conv | 64 | 128 | 31 | 31 | 3//1 |
12 | Conv | 64 | 128 | 31 | 31 | 3//1 |
13 | Max pool | 64 | 128 | 15 | 15 | 3//2 |
14 | Conv | 64 | 256 | 15 | 15 | 3//1 |
15 | Conv | 64 | 256 | 15 | 14 | 3//1 |
16 | Conv | 64 | 256 | 15 | 15 | 3//1 |
17 | Conv | 64 | 256 | 15 | 15 | 3//1 |
18 | Max pool | 64 | 256 | 7 | 7 | 3//2 |
19 | Dropout | 64 | 256 | 7 | 7 | |
20 | Maxout (2-pool) | 64 | 512 | |||
21 | Concat with image dim | 64 | 514 | |||
22 | Reshape (merge eyes) | 32 | 1028 | |||
23 | Dropout | 32 | 1028 | |||
24 | Maxout (2-pool) | 32 | 512 | |||
25 | Dropout | 32 | 512 | |||
26 | Dense (linear) | 32 | 10 | |||
27 | Reshape (back to one eye) | 64 | 5 | |||
28 | Apply softmax | 64 | 5 |
(Where a//b
in the last column denotes pool or filter size a x a
with stride b x b
.)
Some things that had also been changed:
max(alpha*x, x)
, made a big difference on performance. I started using alpha=0.5 which worked very well. In the small tests I did, using alpha=0.3 or lower gave significantly lower scores.Throughout I also kept trying to find a way to work better with the high resolution input. I tried splitting the image into four (or sixteen) non-overlapping (or only slightly overlapping) squares, passing these through a smaller convnet in parallel and then combining these representations (by stacking them or pooling over them) but this did not seem to work. Something I was a little more hopeful about was using the spatial transformation layers from the Spatial Transformer Networks paper from DeepMind. The intention was to use some coarse input to make the ST-modules direct their attention to some parts of the image in higher resolution (for example, potential microaneurysms!) and hopefully they would be able to detect those finer details.
However, training this total architecture end-to-end, without initialising with a pre-trained net, was incredibly difficult at first. Even with norm constraints, lowered learning rates for certain components, smaller initialisation, etc., it was quite difficult, for one, to not have it diverge to some limiting values for the transformation parameters (which sort of makes sense). However, this might be partly because of the specific problem and/or my implementation. In the paper they also seem to work with pre-trained nets and this does seem like the way to go. Unfortunately, when I first tried this, it started overfitting quite significantly on the training set. I wish I had more time to explore this but since my more basic networks already worked so well and seemed to still allow for a decent amount of improvement, I had to prioritise those.
Some small things I have learned:
In general, I think these layers are very interesting but they certainly didn’t seem that straightforward to me. However, this might be partially because of this specific problem.
In the last two to three weeks I was trying to wrap up any (very) experimental approaches and started focusing on looking more closely at the models (and their predictions) to see where I could maybe still improve. The “basic architecture” from the previous section barely changed for more than a month and most of my improvements came from optimising the learning process. My best (leaderboard) score (of about 0.828, which was top 3 for a few weeks) three weeks before the end of the competition came from a simple log mean ensemble from 2-3 models scoring individually around 0.82.
When looking at the two images of certain patients next to each other I noticed something which is harder to see when looking at single images: camera artifacts.
Sometimes they can even resemble microaneurysms (well, they can be tiny black dots, the size of a microaneurysm) and it can be very hard to distinguish between the two unless you have the other image. Even more, these artifacts seemed to be fairly common as well! The problem is: it is very unlikely my models at the time would be able to figure this out because
One thing I tried to take this into account was to merge the outputs of the first convolutional or pooling layer for each of the two eyes (or maybe even the input layer). Then theoretically the convolutional layer could be able to detect similar patterns in the left and right eye. However, then I felt I was reducing the input space way too much after the merging (which was – and had to be – done in the first few layers) and thus, I instead took the outputsa
and b
from the output of some layer (for the two eyes of a patient) and replaced them with ab
and ba
bystacking them on the channel dimension (instead of simply replacing them both by ab
). This way the net still had access to the other low level representations but I was not losing half the input space.
Unfortunately this did not seem to help that much and I was running out of time such that I put this aside.
Nearer to the end of the competition I also started testing the clever pseudo-labeling idea from the ? Deep Sea ? team which uses the predictions from other (ensembles of) models on the test set to help guide or regularise new models. Essentially, during training I added some test images to the batches, such that on average roughly 25% of the batch was comprised of images from the test set, together with the softmax predictions for those images from my best ensemble. This probably helped to push me to about 0.83 for a single model.
For a long time I simply used the class with the highest probability from my softmax output as my prediction (i.e.,argmax
). Even though this allowed me to get this far, it was obvious that this method is quite unstable since it doesn’t take the magnitude of the (other) probabilities into account, only their size relative to each other. To get around that, I used a similar ranking decoding strategy as was used by some people in the CrowdFlower Search Results Relevance competition: first we convert the probabilities from the softmax output to one value by weighing each probability by the class label {0, 1, 2, 3, 4}
weighted_probs = probs[:, 1] + probs[:, 2] * 2 + probs[:, 3] * 3 + probs[:, 4] * 4
Next we rank the weighted probabilities from low to high and try to find the most optimal boundaries [x_0, x_1, x_2, x_3]
to assign labels. I.e., the images with a weighted probability in [0, x_0[
we assign the label 0, the images with a weighted probability in [x_0, x_1[
we assign the label 1, etc. I used scipy’s minimize function to quickly find some boundaries that optimise the kappa score on the train set of some ensemble. This probably could have been optimised better but I did not want to overtune and have the risk of badly overfitting.
This helped quite a bit, pushing some of my single models from 0.83 to 0.835. However, I was kind of surprised that changing the prediction distribution quite significantly with different boundaries did not result in bigger differences in scores.
Something that stood out was the fact that the models were highly variable on the validation sets. Although this was mostly because of the discrete metric, the noise in the dataset itself and my small validation set, I still wanted to try to find out more. If you look at the quadratic_kappa
metric/loss above you can see that it is determined by two matrices: nom
and denom
. The kappa score is then given by 1 - sum(nom) / sum(denom)
. Hence, you want to minimise the nom
sum and maximise the denom
sum. The denominator denom
is fairly stable and captures the distributions of the predictions and the targets. The nominator nom
is the one which will give us the most insight into our model’s predictions. Let’s look at the nom and denom for some random well-performing model (+0.82 kappa) when I was using the simple argmax
decoding:
Whereby the matrices are normalised and show the percentage of the total sum located in that position. You immediately notice something: the error is almost dominated by the errors of predicting class 0 when it was really class 2 (i.e., you predict a normal eye but it is really a moderate NPR eye). Initially this gives you some hope, since you think you should be able to bring down some misclassifications which are 2 classes apart and they would have an important impact on your score. However, the problem with this, and in general, was that the ground truth was not sure at all (since I’m not an ophthalmologist, not even a doctor!) and some really questionable classifications had already come to light.
Because the other ranking decoding strategy is only applied after training and was done quite late in the competition, I don’t have any error distribution pictures for those yet. But I do remember the nom behaving quite similarly and most of the error (30-40%) coming from predicting class 0 when it was really class 2.
A good improvement then came from ensembling a few models using the mean of their log probabilities for each class, converting these to normal probabilities in [0, 1] again and using the ranking decoding strategy from one of the previous paragraphs to assign labels to the images. A few candidate boundaries were determined usingscipy’s minimize function on the kappa score of some ensembles. Doing this on the best models at the time pushed my models to +0.84.
Some other things that did not seem to work (well enough):
Some things I haven’t tried:
I have read quite a few papers (and skimmed a lot more) and even though some of them were very interesting, I was quite limited in time and resources to try them all out. Some I have read and found interesting (not necessarily directly related to this competition):
The actual process was quite a bit more lengthy and chaotic (especially at the end) but hopefully this captures the most important elements of my competition experience. All in all, a relatively basic architecture was able to achieve top scores in this competition. Nevertheless, the predictions always felt quite “weak” and my feeling is that there is still quite a bit of room for improvement. Without a doubt the biggest difficulty for me was dealing with the large amount of variance resulting from
Maybe I should have worked harder on taking 2. into account, although, when checking the misclassifications of my models, I still felt that 1. was a very significant problem as well. It would be very interesting (and important!) to get some scores of other (human) expert raters.
In hindsight, optimising learning on the special kappa metric seemed to be much more important than optimising the architecture and I did lose a lot of time trying to work with some more special architectures because I thought they might be needed to be able to finish high enough. It is also possible that the MSE objective was the better choice for this competition. I tested it briefly at the end and the performance seemed somewhat similar but I would expect it to be more stable than using my kappalogclipped loss. I should also have explored different train/validation splits instead of always using 90%/10% splits. This possibly could have made evaluating the models more stable.
Personally, I very much enjoyed the competition. I learned a lot, am left with a whole bunch of ideas to work on and many pages of ugly code to rework. Hopefully next time I can compete together with other smart, motivated people since I miss having those interactions and doing everything on my own while also working a full time job was quite challenging and exhausting! Congratulations to the winners and all the people who contributed!
Also, when comparing results from this competition to other approaches or software, take into account that it does not necessarily make any sense because they may be training and/or evaluating on different datasets!
Everything was trained on a NVIDIA GTX 980 in the beginning; this was the GPU for the desktop I was also working on, which wasn’t ideal. Therefore, later I also tried using the GRID K520 on AWS (even though it was at least two times slower). The code is based on the code from the ? Deep Sea ? team that won the Kaggle National Data Science Bowl competition and uses mostly Lasagne (which uses Theano). Models (including static and learned parameters and data from the training) were dumped via cPickle and a quickly written script was used to produce images for each of these dumps summarising the model and its performance. For example:
This way it was much easier to compare models and keep track of them. (This is just an example to give you an idea of one method I used to compare models. If you are curious about the specifics, you can find all the information in the code itself.)
Example activations for one of the better models with the basic 512x512 architecture (each vertical block represents the output of one channel of the layer):
I also want to publish some models with all the learned parameters. To be continued.
I would like to thank Kaggle, California Healthcare Foundation and EyePACS for organising and/or sponsoring this challenging competition. Also many thanks to the wonderful developers and contributors of Theano and Lasagnefor their continuous effort on these libraries.
Detecting diabetic retinopathy in eye images
标签:
原文地址:http://www.cnblogs.com/yymn/p/4687190.html