numpy 必须是一致的数据类型

pandas build on numpy, can be different data type

the following is colums with different data types

import pandas as pd
brics = pd.DataFrame(dict)

Pandas 里面又定义了两种数据类型:Series 和 DataFrame

Series 就是“竖起来”的 list

brics = pd.DataFrame(dict)

dict = {
 "country":    ["Brazil", "Russia", "India", "China", "South Africa"],
 "capital":    ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
 "area":       [8.516, 17.10, 3.286, 9.597, 1.221]
 "population": [200.4, 143.5, 1252, 1357, 52.98] }
keys (column labels)    values (data, column by column)

brics.index = ["BR", "RU", "IN", "CH", "SA"]

DataFrame from CSV file

brics =pd.read_csv('path/to/brics.csv',index_col = 0 )

Index and Select Data

● Square brackets

● Advanced methods

● loc

● iloc

Column Access [ ] : one [] is series

two [[ ]] is dataframe, has the column name with it

brics[["country", "capital"]]

Row Access [ ]

● Square brackets: limited functionality

● Ideally

● 2D Numpy arrays

● my_array[ rows , columns ]

● Pandas

● loc (label-based)

● iloc (integer position-based)

● Square brackets
    ● Column access  ***brics[["country", "capital"]]***

    ● Row access: only through slicing   brics[1:4]

● loc (label-based)
    ● Row access   brics.loc[["RU", "IN", "CH"]]

    ● Column access   brics.loc[:, ["country", "capital"]]

    ● Row & Column access   brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

You can think of a DataFrame as a group of Series that share an index. This makes it easy to select specific columns that you want from the DataFrame.

Also a couple pointers:

1) Selecting a single column from the DataFrame will return a Series

2) Selecting multiple columns from the DataFrame will return a DataFrame

Row selection can be done through multiple ways.

Some of the basic and common methods are:

1) Slicing

2) An individual index (through the functions iloc or loc)

3) Boolean indexing

You can also combine multiple selection requirements through boolean

operators like & (and) or | (or)

df.apply(numpy.mean)

count = df[['gold','silver','bronze']].apply(numpy.mean)

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']  
#series of column of cars_per_cap
many_cars = cpc > 500
#boolean series

car_maniac = cars[many_cars]
#use to subset the cars DataFrame to select certain observations

# Print car_maniac
print(car_maniac)

Before, the operational operators like<and>=worked with Numpy arrays out of the box. Unfortunately, this is not true for the boolean operatorsand,or, andnot.

To use these operators with Numpy, you will neednp.logical_and(),np.logical_or()andnp.logical_not().

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Import numpy, you'll need this
import numpy as np

# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc>100 ,cpc < 500)
medium = cars[between]

# Print medium
print(medium)

Loop over DataFrame

Iterating over a Pandas DataFrame is typically done with theiterrows()method. Used in aforloop, every observation is iterated over and on every iteration the row label and actual row contents are available:

for lab, row in brics.iterrows() :
    ...

The row data that's generated by iterrows() on every run is a Pandas Series. 
This format is not very convenient to print out. Luckily, 
you can easily select variables from the Pandas Series using square brackets:  

for lab, row in brics.iterrows() :
    print(row['country'])

Add column (1)

add new column 'name_length'
for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])

results matching ""

    No results matching ""