Part I: Data Exploration and Cleaning

Recently I spent one and a half months learning this course, and I have so much fun in it! Now since I have completed 80 days of lessons, it is time for me to sort out what I’ve learned before I move on!

In this course, I learned data analysis and data science on Day 71–80. Here is the Part I.

Photo by Windows on Unsplash

Preliminary Exploration

After importing a CSV, the first step is to check out your data:

  • What’s the dimension?
  • What are the column names?
  • Have a glimpse at the first/last 5 rows or just random 5 rows.
  • What are the data types in each column and are there any NaN values?
df.shape 
df.columns
df.head()/df.tail()/df.sample()
df.info()

Data Cleaning

In order to have a high quality data analysis, it is important to have CLEAN and organized data before you dive into data analysis.

You need to check where there exist problems as follows and if so, how will you deal with them?

  1. NaN
  • Is there any?
# To check whether there are NaN in each column
df.isna().any()
# To check how many NaNs in each column
df.isna().sum()
  • How to deal with it?

You can delete it if the sample size is large enough

df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
# axis=0,delete rows/axis=1,delete columns
# how='any',delete columns/rows that include any NaN; how='all', delete columns/rows that are all NaN

You can also fill it with other values — the mean or the mode.

#example
df['Age'].fillna(df['Age'].mean(), inplace=True)

2. Duplicates

Check if there are any duplicates. You can set a subset to check specific columns.

df.duplicated().any()
df.duplicated(subset=['brand'])

Remove them:

df.drop_duplicates(subset=['brand', 'style'], keep='last',inplace=True)
# keep='last'/'first'/'False'; False--drop all duplicates

3. Data Type

convert to date

pd.to_datetime()

convert to numeric

pd.to_numeric()

4. Is data validated?

Sometimes, the values cannot be identified by Python. For example, the number 60,000 cannot be taken as an integer due to the comma, and you need to remove it and then convert it.

df.price=df.price.astype(str).str.replace(',','')df.price=pd.to_numeric(df_apps_clean.Installs)

Sometimes, values are combined together in one column, such as First Name and Last Name, or something like 2/3. We need to split them by using split(‘’,expand=True)

#split 2/3 and turn it into a float values=df_data['prize_share'].str.split('/',expand=True)df_data['share_pct']=pd.to_numeric(values[0])/pd.to_numeric(values[1])

For date type, if it is daily data but you want to turn it into monthly data, you can use resample:

df_btc_monthly=df_btc_price.resample('M',on='DATE').last()

Extract nested data

For example, if one item falls into two categories, we need to split them

# use split first so that there are two columns for each genresstack = df.Genres.str.split(';',expand=True)#then stack stack.stack()

Pivot

test_df = pd.DataFrame({'Age': ['Young', 'Young', 'Young', 'Young', 'Old', 'Old', 'Old', 'Old'],'Actor': ['Jack', 'Arnold', 'Keanu', 'Sylvester', 'Jack', 'Arnold', 'Keanu', 'Sylvester'],'Power': [100, 80, 25, 50, 99, 75, 5, 30]})pivote_df=test_df.pivot(index='Age',columns='Actor',values='Power')pivote_df

Data Manipulation

filter data

df_apps_clean=df_apps_clean[df_apps_clean.Price<250]# on multi conditions-- use loc or queryinternational_release=data.loc[(data.USD_Domestic_Gross==0)&(data.USD_Worldwide_Gross!=0)]international_release=data.query('USD_Domestic_Gross==0 and USD_Worldwide_Gross !=0')

select/drop data

col_subset = ['year', 'category', 'laureate_type', 'full_name']multiple_winners[col_subset]]# or drop columnsfeatures = data.drop('PRICE', axis=1)# drop specified data future_releases=data[data['Release_Date']>scrape_datedata_clean=data.drop(future_releases.index)

change column names

themes_by_year.rename(columns={'theme_id':'nr_themes'},inplace=True)`

add columns

df['new_column']# or use Insert which can specify the location 

Index is quite important! If one column is index, then you can access it by its column name.

#setdf.set_index('column_name')#make it column again df.reset_index()

rank based on a specific column

df.sort_values('col',ascending=True,inplace=True)

I will update basic data analysis in the next article. If there is anything wrong or confusing, please tell me and I will look it up : )

Data Science | Lifestyle | Learning Notes