Data analysis books for YouBike services#

We require to install pandas, metplotlib and seaborn in the virtual environment.

import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt

pd.set_option('display.max_rows',5)
pd.set_option('large_repr', 'info')

li = []
for dirname, _, filenames in os.walk('./data/'):
    for filename in filenames:
        df = pd.read_csv(dirname+filename, index_col=None, header=0)
        li.append(df)
# read data here
bikecsv = pd.concat(li, axis=0, ignore_index=True)

bikecsv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814262 entries, 0 to 814261
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   _id              814262 non-null  object
 1   station_no       814262 non-null  int64 
 2   station_name     814262 non-null  object
 3   station_name_en  814262 non-null  object
 4   district         814262 non-null  object
 5   district_en      814262 non-null  object
 6   total            814262 non-null  int64 
 7   rentAvailable    814262 non-null  int64 
 8   returnAvaiable   814262 non-null  int64 
 9   updated_time     814262 non-null  object
 10  info_time        814262 non-null  object
dtypes: int64(4), object(7)
memory usage: 68.3+ MB

Before analysis, it is necessary to know the columns in the CSV data#

Here is the table showing the meanings or types of data:

Column Name

Type (Numerical/Category)

Nullable

Description

_id

Categorical

non-null

UUID generated by MongoDB, also a serial ID

station_no

Categorical

non-null

Serial Number

station_name

Categorical

non-null

Station name

district

Categorical

non-null

Stations in which districts

total

Numerical

non-null

Total = used + empty

rentAvailable

Numerical

non-null

Available rental seats in stations

returnAvailable

Numerical

non-null

Available return seats in stations

updated_time

Date String

non-null

Date-time string in TZ GMT+8

info_time

Date String

non-null

Date-time string in TZ GMT+8

Additionally, info_time can be categorized as workday or weekend for analysis.

After identifying the meanings of data columns, here is a checklist to process the dataframe:#

  • Define the outliers according to the description of columns

  • Identify and remove erroneous data in the CSV

  • Process date strings with pandas

  • Drop some serial ID columns

Drop some categorical columns from the dataset:#

# drop column here axis=0 indicates data in row, axis=1 indicates data in column
bikecsv = bikecsv.drop(axis=1, columns=['_id', 'station_no','station_name','district'])
bikecsv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814262 entries, 0 to 814261
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   station_name_en  814262 non-null  object
 1   district_en      814262 non-null  object
 2   total            814262 non-null  int64 
 3   rentAvailable    814262 non-null  int64 
 4   returnAvaiable   814262 non-null  int64 
 5   updated_time     814262 non-null  object
 6   info_time        814262 non-null  object
dtypes: int64(3), object(4)
memory usage: 43.5+ MB

In normal cases, we need to omit erroneous data:#

As defined in the column descriptions, total is equal to rentAvailable plus returnAvailable.

filtered = bikecsv.loc[bikecsv['total'] == bikecsv['rentAvailable']+bikecsv['returnAvaiable'],:]

print('Correct data Percentage: '+str(round(len(filtered)/len(bikecsv),4) * 100)+'%')
Correct data Percentage: 86.86%

Transforming date strings into the timestamp data type#

print('Let\'s show some insight of date')

## convert datetime strings to datetime formats
filtered['info_time'] = pd.to_datetime(filtered['info_time'],format='%Y-%m-%d %H:%M:%S')
filtered['updated_time'] = pd.to_datetime(filtered['updated_time'],format='%Y-%m-%d %H:%M:%S')

## calculate time difference between station updated time and current time
timeDiff = (filtered['info_time'] - filtered['updated_time']) / pd.Timedelta(minutes=1)

filtered['TimeDiff'] = timeDiff
Let's show some insight of date
/tmp/ipykernel_1902/2368680856.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['info_time'] = pd.to_datetime(filtered['info_time'],format='%Y-%m-%d %H:%M:%S')
/tmp/ipykernel_1902/2368680856.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['updated_time'] = pd.to_datetime(filtered['updated_time'],format='%Y-%m-%d %H:%M:%S')
/tmp/ipykernel_1902/2368680856.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['TimeDiff'] = timeDiff
## judge if current time is workday or not
def isWorkHour (timeSeries):
    work_day = timeSeries.weekday() < 5
    work_hour = timeSeries.hour >= 8 & timeSeries.hour < 19
    return work_day & work_hour

workday = filtered['info_time'].apply(isWorkHour)

filtered['isWorkDay'] = workday
/tmp/ipykernel_1902/3245013958.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['isWorkDay'] = workday

Create a new column, ratio#

Indicate the renting rate at each station and time.

## Add ratio between rentAvailable and total
filtered['ratio'] = filtered['rentAvailable'] / filtered['total'] * 100

filtered
/tmp/ipykernel_1902/1613316239.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered['ratio'] = filtered['rentAvailable'] / filtered['total'] * 100
<class 'pandas.core.frame.DataFrame'>
Index: 707282 entries, 0 to 814261
Data columns (total 10 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   station_name_en  707282 non-null  object        
 1   district_en      707282 non-null  object        
 2   total            707282 non-null  int64         
 3   rentAvailable    707282 non-null  int64         
 4   returnAvaiable   707282 non-null  int64         
 5   updated_time     707282 non-null  datetime64[ns]
 6   info_time        707282 non-null  datetime64[ns]
 7   TimeDiff         707282 non-null  float64       
 8   isWorkDay        707282 non-null  bool          
 9   ratio            707282 non-null  float64       
dtypes: bool(1), datetime64[ns](2), float64(2), int64(3), object(2)
memory usage: 54.6+ MB

After feature engineering, let’s work on data visualization#

We need to use info_time, which represents the timestamp when we fetched open data, and district_en, which denotes the areas where bike stations are located. Additionally, we will calculate the average renting rate.

groupByHourAndDistrict = filtered.groupby([filtered['info_time'].dt.hour, 'district_en'])
# 按照 'district_en' 和 'info_time' 聚合
grouped = groupByHourAndDistrict['ratio'].mean().unstack()

colors = plt.cm.get_cmap('tab10', len(grouped.columns))

for i,area in enumerate(grouped.columns):
    plt.plot(grouped.index, grouped[area], marker='o', label=area, color=colors(i))

plt.title('Distribution of Info Time by District')
plt.xlabel('Hour in a day')
plt.ylabel('rentavailable percentage avg.')
plt.legend(title='District', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.figure(figsize=(64, 32))
plt.show()
/tmp/ipykernel_1902/3905258553.py:4: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  colors = plt.cm.get_cmap('tab10', len(grouped.columns))
_images/7feda38f990d2e6b07af2da9c135c67d4a0c3304234e1d776a2927e44268a167.png
<Figure size 6400x3200 with 0 Axes>

Inspecting the distribution of the total number of parking lots#

  • Using a histogram, find out the mode of the total number of parking lots.

  • Using boxplots, identify the quartiles of the total number of parking lots.

groupByStation = bikecsv.groupby(['station_name_en'], as_index=False)

dataframe = groupByStation['total'].max().sort_values(by='total',ascending=True)

parkinglots = dataframe['total']
print(len(parkinglots))

plt.style.use('ggplot')
fig = plt.figure('Parking distribution',figsize=(16,10))
# 創建並排的子圖
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

ax1.boxplot(parkinglots,whis=2)
ax1.set_title('Boxplots of Parking Spaces')
ax1.set_xlabel('Station')
ax1.set_ylabel('Total Parking Spaces')

ax2.hist(parkinglots, bins=20)
ax2.set_title('Histogram of Parking Spaces')
ax2.set_xlabel('Total Parking Spaces')
ax2.set_ylabel('Frequency')


plt.tight_layout()
plt.show()
1408
_images/e8066a7f442168b483620f54829b613336357e39342d92a0e25cd45140d8adcc.png

According to the boxplots and histogram, the total number of parking spaces is a determining factor in the distribution.

These visualizations illustrate that most stations have 15 parking lots. Additionally, we could divide the stations into four sections to inspect the differences in average usage rates.

parkinglots.describe()
count    1408.00000
mean       24.99929
            ...    
75%        31.00000
max        99.00000
Name: total, Length: 8, dtype: float64

parkinglots.describe() provides statistical insights, including quartiles and the standard deviation of parking spaces at each station.

In the following code, we will use the cut method built into Pandas to label the range from the 1st quartile to the 3rd quartile, including outliers.

Furthermore, we will inspect bike renting ratios with total parking lots in quartiles and outliers.

qtrBikecsv = filtered.copy()
# get cut bins based on descriptions of parkinglots 
qcuts = pd.cut(qtrBikecsv['total'], bins=[0,15,20,30,60,99], labels=['Q1','Q2','Q3','Q4','Outlier'])

qtrBikecsv = pd.concat((qtrBikecsv, qcuts.rename('range')),axis=1)

qtrBikecsv
<class 'pandas.core.frame.DataFrame'>
Index: 707282 entries, 0 to 814261
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   station_name_en  707282 non-null  object        
 1   district_en      707282 non-null  object        
 2   total            707282 non-null  int64         
 3   rentAvailable    707282 non-null  int64         
 4   returnAvaiable   707282 non-null  int64         
 5   updated_time     707282 non-null  datetime64[ns]
 6   info_time        707282 non-null  datetime64[ns]
 7   TimeDiff         707282 non-null  float64       
 8   isWorkDay        707282 non-null  bool          
 9   ratio            707282 non-null  float64       
 10  range            707282 non-null  category      
dtypes: bool(1), category(1), datetime64[ns](2), float64(2), int64(3), object(2)
memory usage: 55.3+ MB
# 使用 matplotlib 庫設置圖表樣式
plt.style.use('seaborn-v0_8-darkgrid')

# 獲取 'range' 列的唯一值
uniq_ranges = qtrBikecsv['range'].unique()

# 設置子圖的行列數
num_rows = len(uniq_ranges)
num_cols = 1  # 一列只有一張圖表

# 創建子圖
fig, axes = plt.subplots(num_rows, num_cols, figsize=(10, 6*num_rows),sharey=True)

# 逐個處理每個唯一值
for i, uniq in enumerate(uniq_ranges):
    # 根據 'range' 分組
    groupQtrDist = qtrBikecsv.loc[qtrBikecsv['range'] == uniq, :].groupby([ qtrBikecsv['info_time'].dt.hour,'district_en'])
        
    # 計算每個分組的平均值,並將 'info_time' 時間的層次解除堆疊
    pivot_table = groupQtrDist['ratio'].mean().unstack()
    # 繪製折線圖
    pivot_table.plot(kind='line', ax=axes[i])
    
    # 設置圖表標題和軸標籤
    axes[i].set_title(f'Ratio by District for Range {uniq}')
    axes[i].set_xlabel('Hour of the Day')
    axes[i].set_ylabel('Mean Ratio' if i == 0 else "")  # 只在第一個子圖上顯示 Y 軸標籤        
    # 設置圖例
    axes[i].legend(title='District', bbox_to_anchor=(1, 1))

# 調整子圖之間的間距
plt.tight_layout()

# 顯示圖表
plt.show()
_images/bc2cf64491ea74cc51904f3a1f26a42d483564f1b1bcb11f865ec0f6a128e91c.png

The comparison among outliers and other charts shows that the changes in average renting ratios are drastic.

This indicates that renting ratios by hour will impact linear regression models.

Thus, we need to remove outliers from the datasets before creating a linear model.