Part II: Data Analysis

Yan
4 min readAug 28, 2021

Here is Part I

After we get clean and organized data, now we can do data analysis.

I think there are several important things that we need to make clear when analyzing data:

It is all about problem-solving.

We need to know why we are doing this, what problems to solve or for what we are exploring.

As long as we are clear about your goals, even if we don’t know how to do it, we can use Google/Stack Overflow to look it up and that is exactly the process of learning.

Photo by Carlos Muza on Unsplash

Pay attention to the data structure/type.

Sometimes we may get the data that we want, but we must be clear about the data structure so that we can use it, especially when we can get the same data using different ways.

When I am learning, I got confused about Series and DataFrame, particularly when I use value_counts (which returns Series) and groupby+agg function (which returns DataFrame). Once I got the hang of that, it is much easier.

Photo by fabio on Unsplash

Data Analysis

Select data you need

There are many ways to access specific rows and columns.

loc VS iloc

loc is label based — df.loc[‘column’]

while iloc is index based — -df.iloc[0]

Slices

df.loc['ind0':'ind3']--include start and End
df.iloc[0:3]--end not included
# we can select rows and columns at the same time
df.loc[0:5,['col0':'col3']]
df.iloc[0:6,[2,3,5]]
# if select all rows / columns
df.loc[:,['col0':'col3']]
df.loc[0:5,:]

Count

When counting, we may have different purposes and thus use different methods.

  1. df.column.nunique() — count the unique number of a column
  2. df.column.value_counts() — tell you the number of each value — returns a Series
  3. df.count() — count non-NA cells for each column or row.
  4. df.groupby(‘column’).count() — based on column — returns a DataFrame

Sort

Sort the df based on given column(s) and return the corresponding dataframe.

df.sort_values(by=’column’, ascending=True,inplace=True)

Merge

When you have two DataFrames and need part of information from each, you need to merge.

merged_df=pd.merge(df1,df2,on='column',how='inner')

Groupby

When we do data analysis, we inevitably use Groupby a lot because we want to group data to find differences in one category and it may reveal useful insight.

df.groupby(‘col’).sum() — return a dataframe of all columns

df.groupby(‘col’).agg({‘col_x’:sum}) — Run operations on a specific column — including sum/count/mean/…

We can also use it to do the cumulative sum——use it twice!

For example, if we want to calculate the cumulative sum of people who win prizes in each country, we can do this:

# Get the data we need; as_index=False so that they are still columns and we can sort values 
prize_by_year = df_data.groupby(by=['birth_country', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country', 'prize']]# use groupby twice; the second groupby(level=0) means that we use the first index(year) to groupby so that we can cumsum
# Don't forget to reset index!
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current','year']).sum().groupby(level=0).cumsum().reset_index()cumulative_prizes

Calculation

Firstly, make sure the data type is right!

Secondly, when we get the result, we may want it to be readable for reporting and thus we need to round the result. Here are two ways to round to the nearest hundredth:

  1. round(1/3,2)
  2. f'{1/3:.2f}

Deal with Time

Often times we need to deal with time-series data to draw insights, which requires us to manipulate data as we want, such as turning datetime to year or calculating moving averages. Here are what I’ve encountered.

from Year to Decade

First we can Year by using pd.DatetimeIndex(df.date).year or df.date.dt.year

Then we use floor division so that we can turn 1988 to 1980

daf['decade']=10*(pd.DatetimeIndex(data['Release_Date']).year//10)

Moving average

In order to see the long-term trend better, sometimes we use moving average. It returns Series.

# prepare data--year & prize
prize_per_year = df_data.groupby(by='year').count().prize
# rolling with 5 windows
moving_average = prize_per_year.rolling(window=5).mean()

Resample: from Daily data to Monthly data

df_monthly=df_price.resample('M',on='DATE').last()

All I am sharing here is what I’ve done in some projects and as I will learn more, I will update at any time! If there is something wrong with it or you get confused, feel free to tell me by commenting! Tata!

Photo by Kelly Sikkema on Unsplash

--

--