numpy 必须是一致的数据类型
pandas build on numpy, can be different data type
the following is colums with different data types
import pandas as pd
brics = pd.DataFrame(dict)
Pandas 里面又定义了两种数据类型:Series 和 DataFrame
Series 就是“竖起来”的 list:
brics = pd.DataFrame(dict)
dict = {
"country": ["Brazil", "Russia", "India", "China", "South Africa"],
"capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area": [8.516, 17.10, 3.286, 9.597, 1.221]
"population": [200.4, 143.5, 1252, 1357, 52.98] }
keys (column labels) values (data, column by column)
brics.index = ["BR", "RU", "IN", "CH", "SA"]
DataFrame from CSV file
brics =pd.read_csv('path/to/brics.csv',index_col = 0 )
Index and Select Data
● Square brackets
● Advanced methods
● loc
● iloc
Column Access [ ] : one [] is series
two [[ ]] is dataframe, has the column name with it
brics[["country", "capital"]]
Row Access [ ]
● Square brackets: limited functionality
● Ideally
● 2D Numpy arrays
● my_array[ rows , columns ]
● Pandas
● loc (label-based)
● iloc (integer position-based)
● Square brackets
● Column access ***brics[["country", "capital"]]***
● Row access: only through slicing brics[1:4]
● loc (label-based)
● Row access brics.loc[["RU", "IN", "CH"]]
● Column access brics.loc[:, ["country", "capital"]]
● Row & Column access brics.loc[["RU", "IN", "CH"], ["country", "capital"]]
You can think of a DataFrame as a group of Series that share an index. This makes it easy to select specific columns that you want from the DataFrame.
Also a couple pointers:
1) Selecting a single column from the DataFrame will return a Series
2) Selecting multiple columns from the DataFrame will return a DataFrame
Row selection can be done through multiple ways.
Some of the basic and common methods are:
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing
You can also combine multiple selection requirements through boolean
operators like & (and) or | (or)
df.apply(numpy.mean)
count = df[['gold','silver','bronze']].apply(numpy.mean)
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
#series of column of cars_per_cap
many_cars = cpc > 500
#boolean series
car_maniac = cars[many_cars]
#use to subset the cars DataFrame to select certain observations
# Print car_maniac
print(car_maniac)
Before, the operational operators like<
and>=
worked with Numpy arrays out of the box. Unfortunately, this is not true for the boolean operatorsand
,or
, andnot
.
To use these operators with Numpy, you will neednp.logical_and()
,np.logical_or()
andnp.logical_not()
.
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Import numpy, you'll need this
import numpy as np
# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc>100 ,cpc < 500)
medium = cars[between]
# Print medium
print(medium)
Loop over DataFrame
Iterating over a Pandas DataFrame is typically done with theiterrows()
method. Used in afor
loop, every observation is iterated over and on every iteration the row label and actual row contents are available:
for lab, row in brics.iterrows() :
...
The row data that's generated by iterrows() on every run is a Pandas Series.
This format is not very convenient to print out. Luckily,
you can easily select variables from the Pandas Series using square brackets:
for lab, row in brics.iterrows() :
print(row['country'])
Add column (1)
add new column 'name_length'
for lab, row in brics.iterrows() :
brics.loc[lab, "name_length"] = len(row["country"])