python 数据分析--数据处理工具Pandas(2)

在前面的学习中主要了解了Pandas如何构造序列和数据框，如何读取和写入各种格式的数据，以及如何对数据进行初步描述，本文将进一步了解Pandas如何处理字符串和日期数据，数据清洗，获取数据子集，透视表，分组聚合操作等内容。

4. Pandas处理字符串和日期数据

待处理的数据表 python 数据分析--数据处理工具Pandas(2)

数据处理要求:
更改出生日期birthday和手机号tel两个字段的数据类型。
根据出生日期birthday和开始工作日期start_work两个字段新增年龄和工龄两个字段。
将手机号tel的中间四位隐藏起来。
根据邮箱信息新增邮箱域名字段。
基于other字段取出每个人员的专业信息。

import pandas as pd

#读入数据
employee_info = pd.read_excel(r"E:/Data/3/data_test03.xlsx",header=0)
employee_info.dtypes

name                  object
gender                object
birthday              object
start_work    datetime64[ns]
income                 int64
tel                    int64
email                 object
other                 object
dtype: object

# 更改数据类型
employee_info.birthday = pd.to_datetime(employee_info.birthday, format="%Y/%m/%d")
employee_info.tel = employee_info.tel.astype(‘str‘)
employee_info.dtypes

name                  object
gender                object
birthday      datetime64[ns]
start_work    datetime64[ns]
income                 int64
tel                   object
email                 object
other                 object
dtype: object

# 新增年龄和工龄字段
# 年龄 = 当天日期的年份 - 生日那一天的年份
# 工龄 = 当天日期的年份 - 开始工作那一天的年份
employee_info[‘age‘] = pd.datetime.today().year - employee_info.birthday.dt.year
employee_info[‘workage‘] = pd.datetime.today().year - employee_info.start_work.dt.year

# 新增邮箱域名字段
# 字符串分割、巧用了匿名函数 lambda
# split分出来的数据有两部分[邮箱名,域名]，域名的索引为1

employee_info[‘email_domain‘] = employee_info.email.apply(func = lambda x: x.split(‘@‘)[1])  
employee_info

		counts	min_weight	avg_price
color	cut
D	Fair	163	0.25	4291.061350
	Good	662	0.23	3405.382175
	Ideal	2834	0.20	2629.094566
	Premium	1603	0.20	3631.292576
	Very Good	1513	0.23	3470.467284
E	Fair	224	0.22	3682.312500
	Good	933	0.23	3423.644159
	Ideal	3903	0.20	2597.550090
	Premium	2337	0.20	3538.914420
	Very Good	2400	0.20	3214.652083
F	Fair	312	0.25	3827.003205
	Good	909	0.23	3495.750275
	Ideal	3826	0.23	3374.939362
	Premium	2331	0.20	4324.890176
	Very Good	2164	0.23	3778.820240
G	Fair	314	0.23	4239.254777
	Good	871	0.23	4123.482204
	Ideal	4884	0.23	3720.706388
	Premium	2924	0.23	4500.742134
	Very Good	2299	0.23	3872.753806
H	Fair	303	0.33	5135.683168
	Good	702	0.25	4276.254986
	Ideal	3115	0.23	3889.334831
	Premium	2360	0.23	5216.706780
	Very Good	1824	0.23	4535.390351
I	Fair	175	0.41	4685.445714
	Good	522	0.30	5078.532567
	Ideal	2093	0.23	4451.970377
	Premium	1428	0.23	5946.180672
	Very Good	1204	0.24	5255.879568
J	Fair	119	0.30	4975.655462
	Good	307	0.28	4574.172638
	Ideal	896	0.23	4918.186384
	Premium	808	0.30	6294.591584
	Very Good	678	0.24	5103.513274

python 数据分析--数据处理工具Pandas(2)

4. Pandas处理字符串和日期数据

相关推荐