码迷,mamicode.com
首页 > 其他好文 > 详细

Data Analysis with Pandas-(1)-Getting started with matrices

时间:2015-11-05 06:07:13      阅读:366      评论:0      收藏:0      [点我收藏+]

标签:

1. Reading data into NumPy

NumPy is a Python module that has a lot of functions for working with data. If you want to do serious work with data in Python, you‘ll be using a lot of NumPy. We‘ll work through importing NumPy and loading in a csv file.

技术分享

技术分享

2. Fixing the data types

If you looked at the data you read in last screen, you may have noticed that it looked very strange. This is because genfromtxt reads the data into a?NumPy?array. Every element in an array has to be the same data type. So everything is a string, or everything is an integer, and so on. NumPy?tried to convert all of our data to floats, which caused the values to become strange. We‘ll need to specify the data type when we read our data in so we can avoid that.

技术分享

3. Indexing the data

Now that we know how to read in a file, let‘s start pulling values out. Remember how all elements in a matrix have an index? We can print the item at row 1, column 2, by typing?print world_alcohol[0,1]

技术分享

技术分享

4. Vectors

When we grab a whole row or column from the matrix, we actually end up with a vector. Just like a matrix is a 2-dimensional array because it has rows and columns, a vector is a 1-dimensional array. Vectors are similar to Python lists in that they can be indexed with only one number. Think of a vector as just a single row, or a single column.

技术分享

技术分享

5. Array shape

All arrays, whether they are 1-dimensional (vectors), two dimensional (matrices), or even larger, have a number of elements in each dimension. For example, a matrix may have 200 rows and 10 columns. We can use the?shape?method to find these dimensions.

技术分享

6. Boolean elements

We can also use boolean statements on arrays to get truth values. The interesting part about this is that the booleans are computed elementwise.

技术分享

The above code will actually compare each element of the fourth column of?world_alcohol, check if it equals?"Beer", and create a new vector with the True/False values.

技术分享

技术分享

7. Subsets of vectors

We can subset vectors based on boolean vectors like the ones we generated in the last screen.

技术分享

The code above will select and print only the elements in the fourth column whose value is "Beer". world_alcohol[:,3][beer]?goes through each position in the fourth column vector (from 0 to the last index), and checks if the beer vector is True at the same position. If the beer vector is True, it assigns the element of the fourth column at that position to the subset. If the beer vector is False, the element is skipped.

技术分享

技术分享

8. Subsets of matrices

We can subset a matrix in the same way that we can subset a vector.

技术分享

The above code will print all of the rows in?world_alcohol?where the "Type" column equals?"Beer". Note how because matrices are indexed using two numbers, we are substituting the boolean vector?beer?for the first number. We can alter the second number to select different columns.

技术分享

The above code would select the second column where the "Type" column equals?"Beer".

技术分享

技术分享

9. Subsets with multiple conditions

So now we can find all of the rows that correspond to?"Algeria", for example. But what if what we really want is to find all the rows for?"Algeria"?in?"1985"?

We‘ll have to use multiple conditions to generate our vector.

技术分享

The code above will generate a boolean that uses multiple conditions. How it works is that the parentheses specify that the two component vectors should be generated first. (order of operations)Then the two vectors will be compared index by index. If both vectors are True at index 1, then the resulting vector will be True at index 1. If either vector is False at index 1, the result will be False at index 1. Here‘s an expanded example:

技术分享

We can add more than 2 conditions if we want -- we just have to put an?&?symbol between each one. The resulting vector will contain?True?in the position corresponding to rows where all conditions are True, and?False for rows where any condition is False.

技术分享

技术分享

10. Convert a column to floats

We now know almost everything we need to compute how much alcohol the people in a country drank in a given year! But there are a couple of things we need to work through first. First, we need to convert the?"Liters of alcohol drunk"?column (the fifth one) to floats. We need to do this because they are?strings?now, and we can‘t take the sum of strings. Because they aren‘t numeric, their sum wouldn‘t make much sense. We can use the?astype?method on the array to do this.

技术分享

11. Replace values in an array

There are values in our alcohol consumption column that are preventing us from converting the column from floats to strings. In order to fix this, we first have to learn how to replace values. We can replace values in a?NumPy array?by just assigning to them with the equals sign.

技术分享

The code above will replace any item in the alcohol consumption column that contains ‘0‘ (remember that the world alcohol matrix is all?string?values) with ‘10‘.

技术分享

技术分享

12. Convert the alcohol consumption column to floats

Now that you know what the bad value is, we can replace it and then convert the column to floats.

技术分享

13. Compute the total alcohol consumption

We can compute the total value of a column using the?sum?method.

技术分享

14.?Finding how much alcohol a person in a country drank in a year

We can subset a vector with another vector, as we learned earlier. This means that we can find the total alcohol consumed by any given country in any given year now.

技术分享

15. A function to sum yearly alcohol consumption

Now that we know how to find the total alcohol consumption of the average person in a country in a given year, we can make a function out of it. A function will make it easier for us to calculate the alcohol consumption for all countries.

技术分享

?16. Finding the country that drinks the least

We can now loop over our dictionary keys to find the country with the lowest amount of alcohol consumed per person in 1989.

技术分享

Data Analysis with Pandas-(1)-Getting started with matrices

标签:

原文地址:http://www.cnblogs.com/yuehq/p/4937906.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!