Data Preparation

This module covers the following steps:

  ➤ 1. Read Input Data

  ➤ 2. Display columns

  ➤ 3. Print the data for visual inspection

  ➤ 4. Verify the data type

  ➤ 5. Convert object data type to numerical data type

  ➤ 6. Display converted data

  ➤ 7. Save the data into a CSV file

Data Preparation

↪ 1. Read Input Data

The input typically requires cleaning before the data can be analyzed. Data cleaning is the process of preparing the data for analysis by fixing or removing incorrectly formatted, duplicate, incomplete, irrelevant, or improperly formatted data.

The steps required for the data cleaning process are generally on the data itself and the process may require a significant effort which could be time-consuming and tedious.

The Data Cleaning process is out of the scope of this tutorial. The following pages give some flavor for data cleaning which is suitable only for the data that was read above.

The read_csv() method of the Pandas library can be used to read a CSV file.

      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda.csv', 
                       header=0)

Data Preparation

↪ 2. Display columns

      print("Input Dataframe columns:", data.columns)

      ---Output---       #Input Dataframe columns: Index(['Symbol ', 'Series ', 'Date ', 'Prev Close ', 'Open Price ',       # 'High Price ', 'Low Price ', 'Last Price ', 'Close Price ',       # 'Average Price ', 'Total Traded Quantity ', 'Turnover ₹ ',       # 'No. of Trades ', 'Deliverable Qty ', '% Dly Qt to Traded Qty '],       # dtype='object')

Observe the spaces in column names and those need fixing for easy reference of the columns. The below line is the simplest step to fix the column names.

      data.columns = ['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price',
             'High Price', 'Low Price', 'Last Price', 'Close Price',
             'Average Price', 'Total Traded Quantity', 'Turnover',
             'No. of Trades', 'Deliverable Qty', '% Dly Qt to Traded Qty']

Data Preparation

↪ 3. Print the data for visual inspection

      print(data.head(5))

      ---Output---       # - Symbol Series Date Prev Close Open Price High Price Low Price       # 0 ANDROMEDA EQ 31-May-2022 1,401.55 1,395.90 1,404.80 1,380.00       # 1 ANDROMEDA EQ 01-Jun-2022 1,388.95 1,380.00 1,400.45 1,379.15       # 2 ANDROMEDA EQ 02-Jun-2022 1,394.85 1,387.25 1,388.00 1,378.60       # 3 ANDROMEDA EQ 03-Jun-2022 1,385.10 1,398.30 1,400.75 1,376.30       # 4 ANDROMEDA EQ 06-Jun-2022 1,380.30 1,376.30 1,387.00 1,360.65       #       # - Last Price Close Price Average Price Total Traded Quantity       # 0 1,384.00 1,388.95 1,396.23 6,742,694       # 1 1,396.10 1,394.85 1,391.06 6,045,948       # 2 1,385.20 1,385.10 1,383.53 5,737,510       # 3 1,379.00 1,380.30 1,388.75 3,478,622       # 4 1,377.00 1,378.45 1,380.25 3,086,633       #       # - Turnover No. of Trades Deliverable Qty % Dly Qt to Traded Qty       # 0 9,414,334,821.10 178,695 4,400,633 65.27       # 1 8,410,276,065.05 152,770 3,461,082 57.25       # 2 7,938,005,254.70 160,429 3,458,109 60.27       # 3 4,830,949,975.05 174,541 1,879,231 54.02       # 4 4,260,332,744.20 126,359 1,388,767 44.99
Question: What is the issue?

Data Preparation

↪ 4. Verify the data type

Observe that the numerical data has a 'comma' in them. The info() method shows the data type.

      data.info()

      ---Output---       #       # RangeIndex: 250 entries, 0 to 249       # Data columns (total 15 columns):       # # Column Non-Null Count Dtype       # --- ------ -------------- -----       # 0 Symbol 250 non-null object       # 1 Series 250 non-null object       # 2 Date 250 non-null object       # 3 Prev Close 250 non-null object       # 4 Open Price 250 non-null object       # 5 High Price 250 non-null object       # 6 Low Price 250 non-null object       # 7 Last Price 250 non-null object       # 8 Close Price 250 non-null object       # 9 Average Price 250 non-null object       # 10 Total Traded Quantity 250 non-null object       # 11 Turnover ₹ 250 non-null object       # 12 No. of Trades 250 non-null object       # 13 Deliverable Qty 250 non-null object       # 14 % Dly Qt to Traded Qty 250 non-null float64       # dtypes: float64(1), object(14)       # memory usage: 29.4+ KB

All the numerical columns are of the 'object' data type. These columns need to be converted into numerical data type. The two steps below convert the 'object' columns into 'float' columns.

Data Preparation

↪ 5. Convert object data type to numerical data type

Step 1: Remove comma from the numerical columns

      data["Close Price"] = data["Close Price"].str.replace(',','')
      data["Prev Close"] = data["Prev Close"].str.replace(',','')
      data["Open Price"] = data["Open Price"].str.replace(',','')
      data["High Price"] = data["High Price"].str.replace(',','')
      data["Low Price"] = data["Low Price"].str.replace(',','')
      data["Last Price"] = data["Last Price"].str.replace(',','')
      data["Average Price"] = data["Average Price"].str.replace(',','')
      data["No. of Trades"] = data["No. of Trades"].str.replace(',','')
      data["Deliverable Qty"] = data["Deliverable Qty"].str.replace(',','')
      data["Total Traded Quantity"] = data["Total Traded Quantity"].str.replace(',','')
      data["Turnover"] = data["Turnover"].str.replace(',','')

Step 2: Convert the data type into float

      data["Close Price"] = data["Close Price"].astype(float)
      data["Prev Close"] = data["Prev Close"].astype(float)
      data["Open Price"] = data["Open Price"].astype(float)
      data["High Price"] = data["High Price"].astype(float)
      data["Low Price"] = data["Low Price"].astype(float)
      data["Last Price"] = data["Last Price"].astype(float)
      data["Average Price"] = data["Average Price"].astype(float)
      data["No. of Trades"] = data["No. of Trades"].astype(float)
      data["Deliverable Qty"] = data["Deliverable Qty"].astype(float)
      data["Total Traded Quantity"] = data["Total Traded Quantity"].astype(float)
      data["Turnover"] = data["Turnover"].astype(float)

Again, display the data type.

      data.info()

      ---Output---       #       # Int64Index: 250 entries, 104 to 44       # Data columns (total 15 columns):       # # Column Non-Null Count Dtype       # --- ------ -------------- -----       # 0 Symbol 250 non-null object       # 1 Series 250 non-null object       # 2 Date #Heatmap 250 non-null object       # 3 Prev Close 250 non-null float64       # 4 Open Price 250 non-null float64       # 5 High Price 250 non-null float64       # 6 Low Price 250 non-null float64       # 7 Last Price 250 non-null float64       # 8 Close Price 250 non-null float64       # 9 Average Price 250 non-null float64       # 10 Total Traded Quantity 250 non-null float64       # 11 Turnover 250 non-null float64       # 12 No. of Trades 250 non-null float64       # 13 Deliverable Qty 250 non-null float64       # 14 % Dly Qt to Traded Qty 250 non-null float64       # dtypes: float64(12), object(3)       # memory usage: 31.2+ KB

Data Preparation

↪ 6. Display converted data

The numerical data columns have been converted into a float data type. Print the data again.

      print(data.head(5))

      ---Output---       # - Symbol Series Date Prev Close Open Price High Price       # 0 ANDROMEDA EQ 31-May-2022 1401.55 1395.90 1404.80       # 1 ANDROMEDA EQ 01-Jun-2022 1388.95 1380.00 1400.45       # 2 ANDROMEDA EQ 02-Jun-2022 1394.85 1387.25 1388.00       # 3 ANDROMEDA EQ 03-Jun-2022 1385.10 1398.30 1400.75       # 4 ANDROMEDA EQ 06-Jun-2022 1380.30 1376.30 1387.00       #       # - Low Price Last Price Close Price Average Price Total Traded Quantity       # 0 1380.00 1384.0 1388.95 1396.23 6742694.0       # 1 1379.15 1396.1 1394.85 1391.06 6045948.0       # 2 1378.60 1385.2 1385.10 1383.53 5737510.0       # 3 1376.30 1379.0 1380.30 1388.75 3478622.0       # 4 1360.65 1377.0 1378.45 1380.25 3086633.0       #       # - Turnover No. of Trades Deliverable Qty % Dly Qt to Traded Qty       # 0 9.414335e+09 178695.0 4400633.0 65.27       # 1 8.410276e+09 152770.0 3461082.0 57.25       # 2 7.938005e+09 160429.0 3458109.0 60.27       # 3 4.830950e+09 174541.0 1879231.0 54.02       # 4 4.260333e+09 126359.0 1388767.0 44.99

Data Preparation

↪ 7. Save the data into a CSV file

      data.to_csv("andromeda-cleaned.csv", index=False)