Welcome to a quick exercise for you to practice your pandas skills! We will be using the SF Salaries Dataset from Kaggle! Just follow along and complete the tasks outlined in bold below. The tasks will get harder and harder as you go along.
Import pandas as pd.
import pandas as pd
import numpy as np
Read Salaries.csv as a dataframe called sal.
df_Salaire = pd.read_csv('Salaries.csv')
Check the head of the DataFrame.
df_Salaire.head()
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.00 | 400184.25 | NaN | 567595.43 | 567595.43 | 2011 | NaN | San Francisco | NaN |
1 | 2 | GARY JIMENEZ | CAPTAIN III (POLICE DEPARTMENT) | 155966.02 | 245131.88 | 137811.38 | NaN | 538909.28 | 538909.28 | 2011 | NaN | San Francisco | NaN |
2 | 3 | ALBERT PARDINI | CAPTAIN III (POLICE DEPARTMENT) | 212739.13 | 106088.18 | 16452.60 | NaN | 335279.91 | 335279.91 | 2011 | NaN | San Francisco | NaN |
3 | 4 | CHRISTOPHER CHONG | WIRE ROPE CABLE MAINTENANCE MECHANIC | 77916.00 | 56120.71 | 198306.90 | NaN | 332343.61 | 332343.61 | 2011 | NaN | San Francisco | NaN |
4 | 5 | PATRICK GARDNER | DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) | 134401.60 | 9737.00 | 182234.59 | NaN | 326373.19 | 326373.19 | 2011 | NaN | San Francisco | NaN |
Use the .info() method to find out how many entries there are.
df_Salaire.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 148654 entries, 0 to 148653 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 148654 non-null int64 1 EmployeeName 148654 non-null object 2 JobTitle 148654 non-null object 3 BasePay 148045 non-null float64 4 OvertimePay 148650 non-null float64 5 OtherPay 148650 non-null float64 6 Benefits 112491 non-null float64 7 TotalPay 148654 non-null float64 8 TotalPayBenefits 148654 non-null float64 9 Year 148654 non-null int64 10 Notes 0 non-null float64 11 Agency 148654 non-null object 12 Status 0 non-null float64 dtypes: float64(8), int64(2), object(3) memory usage: 14.7+ MB
What is the average BasePay ?
df_Salaire.BasePay.mean() # or df_Salaire['BasePay'].mean()
66325.44884050643
What is the highest amount of OvertimePay in the dataset ?
df_Salaire.OvertimePay.max() # or df_Salaire[OvertimePay].max()
245131.88
What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll).
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL'] # df_Salaire[df_Salaire.EmployeeName == 'JOSEPH DRISCOLL']
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24 | 25 | JOSEPH DRISCOLL | CAPTAIN, FIRE SUPPRESSION | 140546.86 | 97868.77 | 31909.28 | NaN | 270324.91 | 270324.91 | 2011 | NaN | San Francisco | NaN |
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']
24 CAPTAIN, FIRE SUPPRESSION Name: JobTitle, dtype: object
How much does JOSEPH DRISCOLL make (including benefits)?
df_Salaire[df_Salaire['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']
24 270324.91 Name: TotalPayBenefits, dtype: float64
What is the name of highest paid person (including benefits)?
df_Salaire.TotalPayBenefits.max()
567595.43
df_Salaire.TotalPayBenefits == df_Salaire.TotalPayBenefits.max()
0 True 1 False 2 False 3 False 4 False ... 148649 False 148650 False 148651 False 148652 False 148653 False Name: TotalPayBenefits, Length: 148654, dtype: bool
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].max()]
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.0 | 400184.25 | NaN | 567595.43 | 567595.43 | 2011 | NaN | San Francisco | NaN |
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].max()]['EmployeeName']
0 NATHANIEL FORD Name: EmployeeName, dtype: object
df_Salaire.loc[df_Salaire['TotalPayBenefits'].idxmax()]
Id 1 EmployeeName NATHANIEL FORD JobTitle GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY BasePay 167411 OvertimePay 0 OtherPay 400184 Benefits NaN TotalPay 567595 TotalPayBenefits 567595 Year 2011 Notes NaN Agency San Francisco Status NaN Name: 0, dtype: object
df_Salaire.iloc[df_Salaire['TotalPayBenefits'].argmax()]
Id 1 EmployeeName NATHANIEL FORD JobTitle GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY BasePay 167411 OvertimePay 0 OtherPay 400184 Benefits NaN TotalPay 567595 TotalPayBenefits 567595 Year 2011 Notes NaN Agency San Francisco Status NaN Name: 0, dtype: object
What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?
df_Salaire['TotalPayBenefits'].argmin()
148653
df_Salaire.iloc[df_Salaire['TotalPayBenefits'].argmin()]
Id 148654 EmployeeName Joe Lopez JobTitle Counselor, Log Cabin Ranch BasePay 0 OvertimePay 0 OtherPay -618.13 Benefits 0 TotalPay -618.13 TotalPayBenefits -618.13 Year 2014 Notes NaN Agency San Francisco Status NaN Name: 148653, dtype: object
df_Salaire[df_Salaire['TotalPayBenefits'] == df_Salaire['TotalPayBenefits'].min()]
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
148653 | 148654 | Joe Lopez | Counselor, Log Cabin Ranch | 0.0 | 0.0 | -618.13 | 0.0 | -618.13 | -618.13 | 2014 | NaN | San Francisco | NaN |
What was the average (mean) BasePay of all employees per year? (2011-2014) ?
df_Salaire.groupby('Year').mean()
Id | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Notes | Status | |
---|---|---|---|---|---|---|---|---|---|
Year | |||||||||
2011 | 18080.0 | 63595.956517 | 4531.065429 | 3617.081926 | NaN | 71744.103871 | 71744.103871 | NaN | NaN |
2012 | 54542.5 | 65436.406857 | 5023.417824 | 3653.437583 | 26439.966967 | 74113.262265 | 100553.229232 | NaN | NaN |
2013 | 91728.5 | 69630.030216 | 5281.641980 | 3819.969007 | 23829.076572 | 77611.443142 | 101440.519714 | NaN | NaN |
2014 | 129593.0 | 66564.421924 | 5401.993737 | 3505.421251 | 24789.601756 | 75463.918140 | 100250.918884 | NaN | NaN |
df_Salaire.groupby('Year').mean()['BasePay']
Year 2011 63595.956517 2012 65436.406857 2013 69630.030216 2014 66564.421924 Name: BasePay, dtype: float64
df_Salaire.groupby('Year').mean()['BasePay']
Year 2011 63595.956517 2012 65436.406857 2013 69630.030216 2014 66564.421924 Name: BasePay, dtype: float64
How many unique job titles are there?
df_Salaire.JobTitle.nunique()
2159
What are the top 5 most common jobs?
df_Salaire.JobTitle.value_counts(5)
Transit Operator 0.047331 Special Nurse 0.029525 Registered Nurse 0.025132 Public Svc Aide-Public Works 0.016939 Police Officer 3 0.016286 ... MEDIA PRODUCTION SUPERVISOR 0.000007 PRINCIPAL RECREATION SUPERVISOR 0.000007 SENIOR GENERAL UTILITY MECHANIC 0.000007 Chief Forensic Toxicologist 0.000007 SENIOR FOOD SERVICE SUPERVISOR 0.000007 Name: JobTitle, Length: 2159, dtype: float64
How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)
Combien de titres d'emploi étaient représentés par une seule personne en 2013 ? (par exemple, les titres de postes n'ayant qu'une seule occurrence en 2013)
df_Salaire[df_Salaire['Year']==2013]['JobTitle'].value_counts()==1
Transit Operator False Special Nurse False Registered Nurse False Public Svc Aide-Public Works False Custodian False ... IS Operator-Journey True Chief Adult Probation Officer True Asphalt Plant Supervisor 1 True Laboratory Technician I True Asst Chf, Bur Clm Invest&Admin True Name: JobTitle, Length: 1051, dtype: bool
sum(df_Salaire[df_Salaire['Year']==2013]['JobTitle'].value_counts()==1)
202
How many people have the word Chief in their job title? (This is pretty tricky)
def chief_string(title):
if 'chief' in title.lower().split():
return True
else:
return False
chief_string('GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY')
False
chief_string('chief MANAGER-METROPOLITAN TRANSIT AUTHORITY')
True
df_Salaire.JobTitle.iloc[10]
'ASSISTANT CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)'
df_Salaire.JobTitle.apply(lambda x: chief_string(x))
0 False 1 False 2 False 3 False 4 True ... 148649 False 148650 False 148651 False 148652 False 148653 False Name: JobTitle, Length: 148654, dtype: bool
sum(df_Salaire.JobTitle.apply(lambda x: chief_string(x)))
477
df_Salaire.JobTitle
0 GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 1 CAPTAIN III (POLICE DEPARTMENT) 2 CAPTAIN III (POLICE DEPARTMENT) 3 WIRE ROPE CABLE MAINTENANCE MECHANIC 4 DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) ... 148649 Custodian 148650 Not provided 148651 Not provided 148652 Not provided 148653 Counselor, Log Cabin Ranch Name: JobTitle, Length: 148654, dtype: object
477
Bonus : Y a-t-il une corrélation entre la longueur de la chaîne d'intitulé du poste et le salaire ?
df_Salaire['title_len'] = df_Salaire['JobTitle'].apply(len)
df_Salaire[['JobTitle', 'title_len']].corr()
title_len | |
---|---|
title_len | 1.0 |