从零实现来理解机器学习算法：书籍推荐及障碍的克服

时间：2015-09-18 11:39:42 阅读：459 评论：0 收藏：0 [点我收藏+]

标签：

前部为英文原文，原文链接：http://machinelearningmastery.com/understand-machine-learning-algorithms-by-implementing-them-from-scratch/

后部为中文翻译，本文中文部分转自：http://www.csdn.net/article/2015-09-08/2825646

Understand Machine Learning Algorithms By Implementing Them From Scratch (and tactics to get around

Implementing machine learning algorithms from scratch seems like a great way for a programmer to understand machine learning.

And maybe it is.

But there some downsides to this approach too.

In this post you will discover some great resources that you can use to implement machine learning algorithms from scratch.

You will also discover some of the limitations of this seemingly perfect approach.

Have you implemented a machine learning algorithm from scratch in an effort to learn about it Leave a comment, I’d love to hear about your experience.

Implement machine learning algorithms from scratch!
Photo by Tambako The Jaguar, some rights reserved.

Benefits of Implementing Machine Learning Algorithms From Scratch

I promote the idea of implementing machine learning algorithms from scratch.

I think you can learn a lot about how algorithms work. I also think that as a developer, it provides a bridge into learning the mathematical notations, descriptions and intuitions used in machine learning.

I’ve discussed the benefits of implementing algorithms from scratch before in the post “Benefits of Implementing Machine Learning Algorithms From Scratch“.

In the post I listed the benefits as:

the understanding you gain
the starting point it provides
the ownership of the algorithm and code it forces

Also in that post I comment how you can short-cut the process by leveraging existing tutorials and books. There is a wealth of good resources for getting started, but there are also stumbling blocks to watch out for.

In the next section I point out three books that you can follow to implement machine learning algorithms from scratch.

I’ve helped a lot of programmers get started in machine learning over the last few years. From my experience, I list 5 of the most common stumbling blocks that I see tripping up programmers and the tactics that you can use to over come them.

Finally, you will discover 3 quick tips to getting the most from code tutorials and going from a copy-paste programmer (if you happen to be one) to truly diving down the rabbit hole of machine learning algorithms.

Great Books You Can Use To Implement Algorithms

I have implemented a lot of algorithms from scratch, directly from research papers. It can be very difficult.

It is a much gentler start to follow someone else’s tutorial.

There are many excellent resources that you can use to get started implementing machine learning algorithms from scratch.

Perhaps the most authoritative are books that guide you through tutorials.

There are many benefits to starting with a book. For example:

Someone else has figured out the algorithm and how to turn it into code.
You can use it as a known working starting point for tinkering and experimentation.

Some great books that guide you through implementing machine learning algorithms step-by-step are:

Data Science from Scratch: First Principles with Python by Joel Grus

This truly is from scratch, working through visualization, stats, probability, working with data and then 12 or so different machine learning algorithms.

This is one of my favorite beginner machine learning books from this year.

Machine Learning: An Algorithmic Perspective by Stephen Marsland

This is the long awaited second edition to this popular book. This covers a large number of diverse machine learning algorithms with implementations.

I like that it gives a mix of mathematical description, pseudo code as well as working source code.

Machine Learning in Action by Peter Harrington

This book works through the 10 most popular machine learning algorithms providing case study problems and worked code examples in Python.

I like that there is a good effort to tie the code to the descriptions using numbering and arrows.

Did I miss a good book that provides programming tutorials for implementing machine learning algorithms from scratch?

Let me know in the comments.

5 Stumbling Blocks When Implementing Algorithms From Scratch (and how to overcome them)

Implementing machine learning algorithms from scratch using tutorials is a lot of fun.

But there can be stumbling blocks, and if you’re not careful, they may trip you up and kill your motivation.

In this section I want to point out the 5 most common stumbling blocks that I see and how to roll with them and not let them hold you up. I want you to get unstuck and plow on (or move on to another tutorial).

Some good general advice for avoiding the stumbling blocks below is to carefully check the reviews of books (or the comments on blog posts) before diving into a tutorial. You want to be sure that the code works and that you’re not wasting your time.

Another general tactic is to dive-in no matter what and figure out the parts that are not working and re-implement them yourself. This is a great hack to force understanding, but it’s probably not for the beginner and you may require a good technical reference close at hand.

Anyway, let’s dive into the 5 common stumbling blocks with machine learning from scratch tutorials:

1) The Code Does Not Work

The worst and perhaps most common stumbling block is that the code in the example does not work.

In fact, if you spend some time in the book reviews on Amazon for some texts or in the comments of big blog posts, it’s clear that this problem is more prevalent than you think.

How does this happen? A few reasons come to mind that might give you clues to applying your own fixes and carrying on:

The code never worked. This means that the book was published without being carefully edited. Not much you can do here other than perhaps getting into the mind of the author and trying to figure out what they meant. Maybe even try contacting the author or the publisher.
The language has moved on. This can happen, especially if the post is old or the book has been in print for a long time. Two good examples are the version of Ruby moving from 1.x to 2.x and Python moving from 2.x to 3.x.
The third-party libraries have moved on. This is for those cases where the implementations were not totally from scratch and some utility libraries were used, such as for plotting. This is often not that bad. You can often just update the code to use the latest version of the library and modify the arguments to meet the API changes. It may even be possible to install an older version of the library (if there are few or no dependencies that you might break in your development environment).
The dataset has moved on. This can happen if the data file is a URL and is no longer available (perhaps you can find the file elsewhere). It is much worse if the example is coded against a third-party API data source like Facebook or Twitter. These APIs can change a lot and quickly. Your best bet is to understand the most recent version of the API and adapt the code example, if possible.

A good general tactic if the code does not work is to look for the associated errata if it is a book, GitHub repository, code downloads or similar. Sometimes the problems have been fixed and are available on the book or author’s website. Some simple Googling should turn it up.

Code machine learning algorithms completely from scratch. Photo by Tambako The Jaguar, some rights reserved

2) Poor Descriptions Of Code

I think the second worst stumbling block when implementing algorithms from scratch is when the descriptions provided with the code are bad.

These types of problems are particularly not good for a beginner, because you are trying your best to stay motivated and actually learn something from the exercise. All of that goes down in smoke if the code and text do not align.

I (perhaps kindly) call them “bad descriptions” because there may be many symptoms and causes. For example:

A mismatch between code and description. This may have been caused by the code and text being prepared at different times and not being correctly edited together. It may be something small like a variable name change or it may be whole function names or functions themselves.
Missing explanations. Sometimes you are given large slabs of code that you are expected to figure out. This is frustrating, especially in a book where it’s page after page of code that would be easier to understand on the screen. If this is the case, you might be better off finding the online download for the code and working with it directly.
Terse explanations. Sometimes you get explanations of the code, but they are too brief, like “uses information gain” or whatever. Frustrating! You still may have enough to research the term, but it would be much easier if the author had included an explanation in the context and relevant to the example.

A good general tactic is to look up description for the algorithm in other resources and try to map them onto the code you are working with. Essentially, try to build your own descriptions for the code.

This just might not be an option for a beginner and you may need to move on to another resource.

3) Code is not Idiomatic

We programmers can be pedantic about the “correct” use of our languages (e.g. Python code is not Pythonic). This is a good thing, it shows good attention to detail and best practices.

When sample code is not idiomatic to the language in which it is written it can be off putting. Sometimes it can be so distracting that the code can be unreadable.

There are many reasons that this may be the case, for example:

Port from another language. The sample code may be a port from another programming language. Such as FORTRAN in Java or C in Python. To a trained eye, this can be obvious.
Author is learning the language. Sometimes the author may use a book or tutorial project to learn a language. This can be manifest by inconsistency throughout the code examples. This can be frustrating and even distracting when examples are verbose making poor use of language features and API.
Author has not used the language professionally. This can be more subtle to spot and can be manifest by the use of esoteric language features and APIs. This can be confusing when you have to research or decode the strange code.

If idiomatic code is deeply important to you, these stumbling blocks could be an opportunity. You could port the code from the “Java-Python” hybrid (or whatever) to a pure Pythonic implementation.

In so doing, you would gain a deeper understanding for the algorithm and more ownership over the code.

4) Code is not Connected to the Math

A good code example or tutorial will provide a bridge from the mathematical description to the code.

This is important because it allows you to travel across and start to build an intuition for the notation and the concise mathematical descriptions.

There problem is, sometimes this bridge may be broken or missing completely.

Errors in the math. This is insidious for the beginner that is already straining to build connections from the math to the code. Incorrect math can mislead or worse consume vast amounts of time with no pay off. Knowing that it is possible, is a good start.
Terse mathematical description. Equations may be littered around the sample code, leaving it to you to figure out what it is and how it relates to the code. You have few options, you could just treat it as a math free example and refer to a different more complete reference text, or you could put in effort to relate the math to the code yourself. This is more likely by authors that are not familiar with the mathematical description of the algorithm and seemingly drop it in as an after thought.
Missing mathematics. Some references are math free, by design. In this case you may need to find your own reference text and build the bridge yourself. This is probably not for beginners, but it is a skill well worth investing the time into.

A beginner might want to stick with code and ignore the math, to build confidence and momentum. Later, it will pay to invest in a high-quality reference text and start relating the code to the math.

You want to get good at relating the algebra to standard code constructs and build an intuition for the process involved. It’s an applied skill. You need to put in the work and practice.

5) Incomplete Code Listing

We saw in 2) that you can have no descriptions and long listings of code. This problem can be inverted where you don’t have enough code. This is the case when the code listing is incomplete.

I am a big believer in complete code listings. I think the code listing should give you everything you need to give a “complete” and working implementation, even if it is the simplest possible case.

You can build on a simple case, you can’t run an incomplete example. You have to put in work and tie it all together.

Some reasons that this stumbling block may be the case, are:

Elaborate descriptions. Verbose writing can be a sign of incomplete thinking. Not always, but sometimes. If something is not well understood there may be an implicit attempt to cover it up with a wash of words. If there is no code at all, you could take it as a challenge to design the algorithm from the description and corroborate it from other descriptions and resources.
Code snipp

摘要：现阶段有些开发者并没有机器学习算法的基础知识，但是怎么才能让开发者从零入门来学习好机器学习算法，这篇文便帮助开发者总结推荐了一些办法。

【编者按】并非所有的开发者都有机器学习算法的基础知识，那么开发者如何从零入门来学习好机器学习算法呢？本文总结推荐了一些从零开始学习机器学习算法的办法，包括推荐了一些合适的书籍，如何克服所面临的各种障碍，以及快速获得更多知识的窍门。

从零开始实现机器学习算法似乎是开发者理解机器学习的一个出色方式。或许真的是这样，但这种做法也有一些缺点。

在这篇文章中，你会发现一些很好的资源，可以用来从零开始实现机器学习算法。你也会发现一些看似完美的方法的局限性。你已经从零开始实现机器学习算法并努力学习留下的每一条评论了么？我很乐意听到关于你的经验。

技术分享

从零开始实现机器学习算法！图片来自Tambako The Jaguar

从零开始实现机器学习算法的好处

我推广了从零开始实现机器学习算法的观念。

我认为你可以学到很多关于算法是如何工作的。我也认为，作为一名开发者，它提供了一个学习用于机器学习的数学符号、描述以及直觉的桥梁。

在“从零开始实现机器学习算法的好处”这篇文章里，我已经讨论了从零实现机器学习算法的好处。

在那篇文章，我列出的好处如下：

你获取了知识；
它提供了一个起点；
拥有算法和代码的所属权。

在这篇文章中，我对如何利用现有的教程和书籍来缩短这个学习过程表达了一些个人看法。有一些用于初学的丰富资源，但也要堤防一些绊脚石。

下一节，我指出了三本书，你可以照着书籍从零开始实现机器学习算法。

在过去的几年里，我已经在机器学习入门中帮助了许多程序员。根据我的经验，我列出了五项曾困扰过程序员的最常见的障碍，以及你可以用来克服它们的技巧。

最后，你会发现3个快速技巧，用以从代码教程中获得更丰富的知识，并从一个复制粘贴的程序员（如果你碰巧是其中一个）到一个真正深入机器学习算法的学者。

用于实现算法的优秀书籍

我从零实现过许多算法，这些算法直接来自研究论文。这个过程可能非常困难。

跟着别人的教程来做是一个非常温和的开始。有很多优秀的资源，可以让你用来从零开始实现机器学习算法。也许最具权威性的是能指导你完成整个教程的书籍。

从啃书本开始学习有很多好处。例如：

其他人已经研究出了该算法并把它转换成了代码；
你可以使用它作为一个用于修改和实验的已知工作起点。

那么，一步一步引导你完成机器学习算法实现的出色书籍有：

Data Science from Scratch: First Principles with Python by Joel Grus

这本书的确是从零开始，贯穿可视化操作、统计、概率、数据处理，然后是大约12个不同的机器学习算法。

这本书是我今年最喜欢的机器学习初学者书籍之一。

技术分享

Machine Learning: An Algorithmic Perspective by Stephen Marsland

这本书是我期待已久的这本流行书籍的第二版。它涵盖了大量的不同种类的机器学习算法实现。

我喜欢它既给出了数学描述和伪代码，又包含了能执行的源代码。

Machine Learning in Action by Peter Harrington

该书贯穿了10个最受欢迎的机器学习算法，提供了案例研究问题并用Python代码实例来解决。

我喜欢它用符号和箭头把代码和描述紧密联系在一起的形式。

我是否有漏掉一本从零开始实现机器学习算法的编程教程书籍呢？

如果有，请在评论中指出！

从零实现机器学习算法的5个障碍（以及如何克服它们）

根据教程从零开始实现机器学习算法是很有趣的。但也有可能会成为绊脚石，而且如果你不小心，他们可能会绊倒你并抹杀你的学习动机。

在这一节中，我想指出我所看到的五个常见的绊脚石，以及如何与它们共存，而不是让它们阻碍你。我的目的是让你完全摆脱它并且破浪前行（或是转移到另一个教程）。

用来避免下面障碍的一些好的常规建议是在你深入一个教程之前，仔细检查书籍的评论（或博客帖子的评论）。你要确保代码是能够工作的并且保证你不是在浪费时间。

另一个常规策略是，无论深入的是什么，找出不工作的那部分，并自己去重新实现他们。这是一个强行理解的出色解决方法，但它可能不适合初学者，并且你可能需要一个很好的技术参考资料放在手边。

无论如何，让我们从零开始机器学习教程，深入研究这5个常见的障碍：

1）代码不能正常工作

最糟糕并且最常见的障碍就是实例当中的代码不能正常工作。

事实上，如果你花一些时间浏览亚马逊网站的一些书籍评论或博文评论，很显然，这个问题比你想象的更为普遍。

这是怎么发生的呢？有几个原因可能会给你提供一些线索，可以应用到你自己的修改中并继续使用：

代码从不工作。这意味着，这本书没有经过精心编辑就出版了。在这种情况下，你能做的并不多，除非是进入作者的大脑，并试图推测出他们的想法。或许还可以尝试联系作者本人或是出版商。
语言已变动。这种情况可能会发生，特别是如果该文章是发布已久的或者该书已印刷了很长一段时间。两个很好的例子是Ruby从1.x版本到2.x版本和Python从2.x版本到3.x版本。
第三方库已变动。这通常发生在那些情况下，即实现不完全是从零开始并且使用了一些有用的库，如用于绘图的库。这通常不会那么糟糕。你可以通过经常更新代码来使用最新版本的库以及修改参数来满足API的修改。甚至可以安装一个旧版本的库（如果版本很少或是几乎不需要可能破坏开发环境的其它依赖库）。
该数据集已变动。如果数据文件是一个下载链接，并且已经失效（也许你可以在其它地方找到该文件），这种情况下就有可能会发生。如果这个例子是针对第三方API数据来源，比如Facebook或Twitter，该情况会更加糟糕。这些APIs可以迅速地改变很多。如果可能的话，你最好的办法是了解最新版本的API，并改写代码中的实例。

如果它是一本书、GitHub库、代码下载或者类似的，如果代码不工作，一个好的常规策略是寻找相关的勘误表。有时这些问题已经在书上或作者的网站上修正了。一些简单的谷歌搜索就能找到它们。

技术分享

2) 代码不规范描述

当从零开始实现算法时，我认为第二个糟糕的绊脚石是提供的代码描述很糟糕。

对于初学者来说，这类问题特别不好，因为你正在努力维持积极性，而实际上你是从练习中学习一些东西。如果代码和文本不一致，所有的这些都会在烟雾中渐渐消失。

我（或许比较温和）把他们称为“糟糕的描述”，因为可能有很多的症状和原因。例如：

代码和描述之间的不匹配。这可能是由于代码和文本在不同时间准备而造成的，并且不能正确地编辑起来。它可能是一些小的，如一个变量名称的变化，或者它可能是整个函数名或函数本身的变化。
缺失的解释。有时，你会得到你所期望获得的大量代码。这是令人沮丧的，特别是书中连篇累牍的代码，可能在屏幕上更容易理解。如果是这样的话，最好的方法是找到在线下载的代码并直接使用它来工作。
过于简洁的解释。有时你会对代码进行解释，但它们可能过于简单，如“使用信息增益”或任何其它的。令人沮丧！你可能还要花更多的时间来研究这个术语，但如果作者在上下文中包含了一个该术语的解释以及相关的实例，那么这就会显得更简单。

一个好的常规方法是在其它的资源里寻找算法的描述，并尝试将它们映射到你所使用的代码中。从本质上讲，是尝试建立你自己的代码描述。

这对初学者来说可能不是一个好的选择，你可能需要转到另一个资源上。

3）代码不符合语言习惯

我们程序员可以对我们语言的 “正确”使用咬文嚼字（如Python代码不是Pythonic）。这其实是一件好事，它显示了对细节和最佳实践的充分关注。

当实例代码不符合语言编写习惯时，它可能会让人排斥。有时它会使代码零散以至于难以理解。

这种情况有许多原因，例如：

来自另一种语言的接口。实例代码可能是另一种编程语言的接口。如在Java中调用FORTRAN或在Python中调用C。在老手眼里，这会很显眼。
作者正在学习语言。有时，作者可能使用一本书或一个教程项目来学习语言。在整个代码示例中，可能会不一致。当实例多次使用难以理解的语言特征和API时，这可能会让人失望甚至分散注意力。
作者没有使用专业语言。这可能是更加微妙的一点，可以通过使用深奥的语言功能和APIs来体现。当你必须研究或解读奇怪的代码时，这可能会让你混淆。

如果你惯用的代码对你非常重要，这些障碍可能会是一个机会。你可以把接口代码从“Java-Python”混合体（或别的什么）化为一个纯Python的实现。

这么做之后，你将得到一个更深层次的算法理解以及更多的代码所属权。

4）代码和数学无关

一个很好的代码示例或教程将提供一个从数学描述到代码的桥梁。

这很重要，因为它允许你跨越代码和数学，并开始为符号和简明的数学描述形成一个直觉。

问题是，有时候这个桥梁可能会被彻底破坏或是丢失。

数学上的错误。这对初学者来说是潜在的，因为建立从数学到代码的关联已经很紧张了。不正确的数学可能会误导或者严重地消耗大量的时间，并且还没有回报。知道这个可能会发生，就是一个很好的开始。
简明的数学描述。方程可以在示例代码中四处散落，让你去弄清楚它究竟是什么，以及它是如何与代码相关联的。你的选择不多，你可以把它当做是一个与数学无关的例子，并参考一个不同的更加完整的参考文本，或者你可以努力把数学与自己的代码关联起来。这更有可能的是作者本身就不熟悉算法的数学描述，而且似乎是事后才添加到文章里的。
缺失的数学。有些参考文献在描述数学时是自由的。在这种情况下，你可能需要找到自己的参考文本，并建立自己的桥梁。这可能不适合初学者，但这是一个技能，很值得去投入时间。

一个初学者可能会坚持代码而忽略数学，建立信心和动力。之后，它将为一个高质量的参考文本以及关联代码和数学付出代价。

你想要擅长于关联代数和标准代码，并为有关过程建立一个直觉。这是一个应用技巧。需要你投入工作与实践。

5）不完整的代码列表

我们在2）中看到，你可以有不带任何描述和长列表的代码。然而，当你没有大量代码的时候，这个问题会逆转。这也就是代码列表不完整时的情况。

事实上，我是一个完整代码列表的忠实信徒。我认为代码列表应该给你所需要的，给你一个“完整”的代码和工作实现，即使它是最简单的情况。

你可以建立一个简单的实例，但你不能运行一个不完整的例子。你必须把它放在工作中并把所有的都联系在一起。

这个障碍可能成为事实的一些原因是：

冗长的描述。冗长的编写可能是一个不完整思维的标志。但有时候，也不一直都是这样。如果理解的不是很好，可能会在潜意识里试图用一堆词来掩饰。如果没有任何代码，你可以把它当作是一个挑战，根据描述来设计算法，并从其它描述和资源来证实它。
代码片段。概念可能会精心描述，然后使用一个小代码片段来证实。这有助于紧密配合代码段的概念，但它需要你自己大量的工作，将其结合在一起，形成一个工作系统。
无样本输出。代码实例经常失误的一个关键方面通常是样本输出。如果有输出的话，当你运行它时，它可以给你一个期待的明确想法。没有样本输出的话，那就完全是猜测。

在某些情况下，把代码聚在一起，这对你可能会是一个有趣的挑战。这同样不适合初学者，但是一旦你有一些算法之后，这也许会是一个有趣的锻炼。

3个诀窍让你从算法实现中获得更多知识

你可以实现一个合理的算法。一旦你这样做过，那么你可以做得更多，并在你知道它之前，你已经建立了你自己非常理解的小算法库。

在这一节中，我想给你3个你可以使用的快速技巧，可以让你从实现机器学习算法过程中获得最多的经验。

添加先进的特征。以你正常运行的代码为例，并在它的基础上创建。如果教程是好的，它将列出扩展的想法。如果没有，你可以研究一些自己的。在算法的后面列出一系列的候选扩展算法并一个又一个的去实现它们。这至少会迫使你去理解代码的意思并做出修改。
适应另一个问题。在不同的数据集上运行该算法。如果有任何问题，就解决它。进一步去适应不同的问题实现。如果代码示例是二分类，那么修改它让其适用于多分类或回归问题。
可视化算法行为。我发现实时绘制算法的性能和行为是一个非常宝贵的学习工具，即使是在今天。你可以在测试集和训练集上开始按时期水平（所有的算法在一定程度上都是迭代的）绘制精确度。在那里，你可以选择特定的可视化算法，如自组织映射模型的二维网格，回归时间序列的系数和k近邻算法的Voronoi划分。

我认为这些技巧与教程和代码实例相比，会让你走的更远。

特别是最后一点，会给你在算法行为上更深层次的见解，很少有从业人员花时间去学习它。

你的行动步骤

这是一篇很长的文章，现在，你已经学会了如何从零开始实现机器学习算法。

重要的是，你已经了解了最常见的障碍、一些框架是如何形成的以及一些你可以运用的战术，你可以把它们转化为机遇。

你的下一步很明显：从零开始实现算法。

从零实现来理解机器学习算法：书籍推荐及障碍的克服

标签：

原文地址：http://www.cnblogs.com/gooking/p/4818501.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行