Norsaa 2019-12-12
目录
两种丢失的数据:
种类
None:None是对象类型,type(None):NoneType
np.nan(NaN):是浮点型,type(np.nan):float
两种丢失数据的区别:
object类型比float在进行运算耗时
测试两种耗时时间:
import numpy as np %timeit np.arange(1000,dtype=object).sum() # 结果: 63.4 μs ± 1.02 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit np.arange(1000,dtype=float).sum() # 结果: 6.45 μs ± 84.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
pandas中的None和NAN
df = DataFrame(np.random.randint(0,100,size=(8,6))) df.iloc[2,3] = np.nan df.iloc[5,5] = None # 在内部会强转成浮点型 df.iloc[5,3] = np.nan df.iloc[7,2] = np.nan df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0 | 81 | 60.0 | 92.0 | 99 | 1.0 |
1 | 62 | 67 | 98.0 | 41.0 | 43 | 80.0 |
2 | 82 | 77 | 6.0 | NaN | 42 | 64.0 |
3 | 63 | 30 | 63.0 | 22.0 | 99 | 24.0 |
4 | 97 | 60 | 68.0 | 85.0 | 24 | 35.0 |
5 | 44 | 59 | 50.0 | NaN | 29 | NaN |
6 | 65 | 0 | 4.0 | 74.0 | 35 | 83.0 |
7 | 17 | 78 | NaN | 33.0 | 61 | 6.0 |
对空值对应的行数据进行删除
# 实现:Series、DataFrame都可以用isnull() df.isnull() # 判断哪些元素为空值
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | False | True | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
5 | False | False | False | True | False | True |
6 | False | False | False | False | False | False |
7 | False | False | True | False | False | False |
1.清洗有空值的行
# 查看哪些行存在空值数据 df.isnull().all(axis=1) # 一般结合notnull()使用 # 结果: # 0 False # 1 False # 2 False # 3 False # 4 False # 5 False # 6 False # 7 False # dtype: bool df.isnull().any(axis=1) # 结果: # 0 False # 1 False # 2 True # 3 False # 4 False # 5 True # 6 False # 7 True # dtype: bool df.notnull().all(axis=1) # 结果: # 0 True # 1 True # 2 False # 3 True # 4 True # 5 False # 6 True # 7 False # dtype: bool # 清洗有空值的行 df.loc[df.notnull().all(axis=1)] # isnull() --> any:true表示其对应的行中存在空值 # notnull() --> all:False表示其对应的行中存在空值
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 45 | 98 | 41.0 | 65.0 | 90 | 50.0 |
1 | 65 | 87 | 99.0 | 34.0 | 97 | 71.0 |
3 | 35 | 19 | 38.0 | 54.0 | 91 | 24.0 |
4 | 62 | 86 | 62.0 | 54.0 | 87 | 38.0 |
6 | 60 | 99 | 15.0 | 32.0 | 54 | 16.0 |
2.直接使用dropna函数过滤空值对应的行数据
# drop系列:行用axis=0,列用axis=1 df.dropna(axis=0)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 45 | 98 | 41.0 | 65.0 | 90 | 50.0 |
1 | 65 | 87 | 99.0 | 34.0 | 97 | 71.0 |
3 | 35 | 19 | 38.0 | 54.0 | 91 | 24.0 |
4 | 62 | 86 | 62.0 | 54.0 | 87 | 38.0 |
6 | 60 | 99 | 15.0 | 32.0 | 54 | 16.0 |
将空值进行填充
# 任意填充 df.fillna(value=666) # 将空值都填充成value值 # 推荐使用空值近邻的值进行填充 df.fillna(method='ffill',axis=1) # axis轴向,method=ffill(向前),bfill(向后填充) # 如果发现还是有空值 df.fillna(method='bfill',axis=1).fillna(method='ffill',axis=1)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.0 | 81.0 | 60.0 | 92.0 | 99.0 | 1.0 |
1 | 62.0 | 67.0 | 98.0 | 41.0 | 43.0 | 80.0 |
2 | 82.0 | 77.0 | 6.0 | 6.0 | 42.0 | 64.0 |
3 | 63.0 | 30.0 | 63.0 | 22.0 | 99.0 | 24.0 |
4 | 97.0 | 60.0 | 68.0 | 85.0 | 24.0 | 35.0 |
5 | 44.0 | 59.0 | 50.0 | 50.0 | 29.0 | 29.0 |
6 | 65.0 | 0.0 | 4.0 | 74.0 | 35.0 | 83.0 |
7 | 17.0 | 78.0 | 78.0 | 33.0 | 61.0 | 6.0 |
需求:
代码实现:
time | none | 1 | 2 | 3 | 4 | none1 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|---|
2019/1/27 17:00 | -24.8 | -18.2 | -20.8 | -18.8 | NULL | NULL | NULL | ||
2019/1/27 17:01 | -23.5 | -18.8 | -20.5 | -19.8 | -15.2 | -14.5 | -16 | ||
2019/1/27 17:02 | -23.2 | -19.2 | NULL | NULL | -13 | NULL | -14 | ||
2019/1/27 17:03 | -22.8 | -19.2 | -20 | -20.5 | NULL | -12.2 | -9.8 | ||
2019/1/27 17:04 | -23.2 | -18.5 | -20 | -18.8 | -10.2 | -10.8 | -8.8 |
# 预处理,将excel数据读取出来,将空列none、none1删除 df = pd.read_excel('./data/testData.xlsx') df.drop(labels=['none','none1'],axis=1,inplace=True) df # 删除空值所在的行,如果删除的代价较大,选择填充 df.dropna(axis=0) # 填充 new_df = df.fillna(axis=0,method='ffill').fillna(axis=0,method='bfill') # 检车空值填充的情况 new_df.isnull().any(axis=0) # 结果: # time False # 1 False # 2 False # 3 False # 4 False # 5 False # 6 False # 7 False # dtype: bool
准备数据:
df = DataFrame(data=np.random.randint(0,100,size=(10,8))) df.iloc[1] = [1,1,1,1,1,1,1,1] df.iloc[3] = [1,1,1,1,1,1,1,1] df.iloc[5] = [1,1,1,1,1,1,1,1] df.iloc[7] = [1,1,1,1,1,1,1,1] df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 85 | 27 | 59 | 81 | 14 | 65 | 86 | 17 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 65 | 2 | 69 | 44 | 99 | 12 | 75 | 37 |
3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | 47 | 63 | 62 | 9 | 77 | 36 | 15 | 90 |
5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
6 | 7 | 38 | 2 | 5 | 93 | 86 | 4 | 92 |
7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
8 | 22 | 12 | 45 | 93 | 29 | 42 | 96 | 59 |
9 | 11 | 10 | 49 | 60 | 45 | 58 | 23 | 53 |
填充重复的数据
df.drop_duplicates(keep='last') # keep = 'first' 表示第一行保留 # keep = 'last' 表示最后一行保留 # keep = 'False' 表示全部都进行填充
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 85 | 27 | 59 | 81 | 14 | 65 | 86 | 17 |
2 | 65 | 2 | 69 | 44 | 99 | 12 | 75 | 37 |
4 | 47 | 63 | 62 | 9 | 77 | 36 | 15 | 90 |
6 | 7 | 38 | 2 | 5 | 93 | 86 | 4 | 92 |
7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
8 | 22 | 12 | 45 | 93 | 29 | 42 | 96 | 59 |
9 | 11 | 10 | 49 | 60 | 45 | 58 | 23 | 53 |
需求:自定义一个1000行3列(A,B,C)取值范围为0-1的数据源,然后将C列中的值大于其两倍标准差的异常值进行清洗
实现:
# 一个1000行3列(A,B,C)取值范围为0-1的数据源 df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C']) df.head()
A | B | C | |
---|---|---|---|
0 | 0.483195 | 0.786152 | 0.169980 |
1 | 0.000227 | 0.593595 | 0.378660 |
2 | 0.859977 | 0.190864 | 0.885891 |
3 | 0.790337 | 0.898124 | 0.244129 |
4 | 0.212919 | 0.687666 | 0.827701 |
# 计算两倍标准差 std_twice = df['C'].std() * 2 #判定异常值的条件 df.loc[~(df['C'] > std_twice)]
pd.concat:pandas使用pd.concat函数,与np.concatenate函数类似
参数说明: objs axis=0 keys join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联 ignore_index=False
匹配级联
df1 = DataFrame({'employee':['Bobs','Linda','Bill'], 'group':['Accounting','Product','Marketing'], 'hire_date':[1998,2017,2018]}) pd.concat((df1,df1),axis=0)
employee | group | hire_date | |
---|---|---|---|
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
不匹配级联
df2 = df1.copy() df2.columns = ['employee','groupps','hire_date'] """ employee groupps hire_date 0 Bobs Accounting 1998 1 Linda Product 2017 2 Bill Marketing 2018 """ pd.concat((df1,df2),axis=0)
employee | group | groupps | hire_date | |
---|---|---|---|---|
0 | Bobs | Accounting | NaN | 1998 |
1 | Linda | Product | NaN | 2017 |
2 | Bill | Marketing | NaN | 2018 |
0 | Bobs | NaN | Accounting | 1998 |
1 | Linda | NaN | Product | 2017 |
2 | Bill | NaN | Marketing | 2018 |
join
inner:只对可以匹配的项进行级联
outer:可以级联所有的项
pd.concat((df1,df2),axis=0,join='inner')
employee | hire_date | |
---|---|---|
0 | Bobs | 1998 |
1 | Linda | 2017 |
2 | Bill | 2018 |
0 | Bobs | 1998 |
1 | Linda | 2017 |
2 | Bill | 2018 |
append函数的使用
append只可以进行纵向的级联
df1.append(df2)
合并概述:
一对一合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering'], }) df2 = DataFrame({'employee':['Lisa','Bob','Jake'], 'hire_date':[2004,2008,2012], }) pd.merge(df1,df2,on='employee')
df表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Lisa | Accounting | 2004 |
1 | Jake | Engineering | 2016 |
df2表格如下:
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | Engineering | Steve |
合并表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
多对多合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering']}) df5 = DataFrame({'group':['Engineering','Engineering','HR'], 'supervisor':['Carly','Guido','Steve'] }) pd.merge(df1,df5,how='outer')
df1表格如下:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
df5表格如下:
group | supervisor | |
---|---|---|
0 | Engineering | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
合并表格如下:
employee | group | supervisor | |
---|---|---|---|
0 | Bob | Accounting | NaN |
1 | Jake | Engineering | Carly |
2 | Jake | Engineering | Guido |
3 | Lisa | Engineering | Carly |
4 | Lisa | Engineering | Guido |
5 | NaN | HR | Steve |
key的规范化
1.当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
df1 = DataFrame({'employee':['Jack',"Summer","Steve"], 'group':['Accounting','Finance','Marketing']}) df2 = DataFrame({'employee':['Jack','Bob',"Jake"], 'hire_date':[2003,2009,2012], 'group':['Accounting','sell','ceo']}) pd.merge(df1,df2,on='group')
df1表格如下:
employee | group | |
---|---|---|
0 | Jack | Accounting |
1 | Summer | Finance |
2 | Steve | Marketing |
df2表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Jack | Accounting | 2003 |
1 | Bob | sell | 2009 |
2 | Jake | ceo | 2012 |
合并表格如下:
employee_x | group | employee_y | hire_date | |
---|---|---|---|---|
0 | Jack | Accounting | Jack | 2003 |
2.当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
df1 = DataFrame({'employee':['Bobs','Linda','Bill'], 'group':['Accounting','Product','Marketing'], 'hire_date':[1998,2017,2018]}) df5 = DataFrame({'name':['Lisa','Bobs','Bill'], 'hire_dates':[1998,2016,2007]}) pd.merge(df1,df5,left_on='employee',right_on='name',how='outer')
df1表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
df5表格如下:
hire_dates | name | |
---|---|---|
0 | 1998 | Lisa |
1 | 2016 | Bobs |
2 | 2007 | Bill |
合并表格如下:
employee | group | hire_date | hire_dates | name | |
---|---|---|---|---|---|
0 | Bobs | Accounting | 1998.0 | 2016.0 | Bobs |
1 | Linda | Product | 2017.0 | NaN | NaN |
2 | Bill | Marketing | 2018.0 | 2007.0 | Bill |
3 | NaN | NaN | NaN | 1998.0 | Lisa |
计算的时候总共分3步,1到2是第二组......lower: i. 这组数据中的小值 higher: j. 这组数据中的大值,fraction 是第三步中的小数部分,意思是当前这组数据的0到1的分位数
Series是一种类似于一维数组的对象,由一组数据以及一组与之对应的索引组成。 index: 索引序列,必须是唯一的,且与数据的长度相同. 如果没有传入索引参数,则默认会自动创建一个从0~N的整数索引