FIZ353 - Numerical Analysis | 09/10/2020
Importing a CSV file with Pandas
Getting rid of the unnecessary header stuff
Designating the column labels
Designating the index using one of the columns
Accessing to the data
Old-school style: referring via the col/row indexes
Referring via the col/row labels
Filtering
Processing
Creating a DataFrame
Exporting the DataFrame to a CSV file
Emre S. Tasci emre.tasci@hacettepe.edu.tr
Numerical analysis deals with various kinds of data: be it experimental results, surveys, collected values, etc. But one thing is almost always sure: there's always a huge amount of data lying around and we will be trying to make the most of it.
The first step to data processing is, of course to introduce the data to our code via importing (load).
Most of the time, the data is collected in matrix form and stored in comma-separated value (CSV) form, e.g.,
The data above is taken from meteoblue site and contains the temperature, humidity and precipation data for Ankara for this month so far (8 days). But as you can observe, the actual data begins on the 11th line, with the previous lines containing identification information like location, coordinates, units, etc. This kind of preinformation lines are reffered to as headers.
Importing the data in a CSV file is pretty straight-forward using Pandas' read_csv command:
import pandas as pd
data_with_headerclutter = pd.read_csv("data/01_dataexport_20201008T180753.csv")
print(data_with_headerclutter)
Here we see a couple of things, the foremost important one being that, Pandas don't worry about different types of variables as NumPy would. It just takes whatever it finds and -for the moment- treating the first line as the header row, takes all in.
Second, when we ask it to display the imported data, it prints a summary of things, 5 lines from the top and 5 lines from the bottom, also reporting the true size [201 x 4]. We can change the number of lines shown by setting the 'display.min_rows' & 'display.max_rows' options:
## Display a total of 6 rows only
#pd.set_option('display.min_rows', 6)
pd.set_option('display.max_rows', 6)
print(data_with_headerclutter)
## We can as well print it out by direct referral:
#data_with_headers
data_with_headerclutter.shape
## Display all rows
pd.set_option('display.max_rows', None)
print(data_with_headerclutter)
The headers are getting in the way (especially the 10th line ("timestamp | Ankara Temp...") even messes the output due to its enormous width. So we simply tell Pandas to start parsing from the 11th line (10th on the zero-indexed, thus we will be skipping 9 lines):
pd.set_option('display.min_rows', 10)
pd.set_option('display.max_rows', 10)
data_without_headerclutter = pd.read_csv("data/01_dataexport_20201008T180753.csv",
skiprows=9)
data_without_headerclutter
... almost there but it has assumed the previous line before that as the column label line (which is usually true but this time there is just too much clutter), so we'll manually enter the column labels:
data_without_headerclutter.columns = ['Timestamp','Temperature','Relative Humidity','Precipitation Total']
print(data_without_headerclutter)
data_without_headerclutter
Currently, the index (row ids) is enumerated. But if we wish, we can also designate a column that holds unique values (such as the Timestamp column in our example) via the set_index command:
data_without_headerclutter = data_without_headerclutter.set_index('Timestamp')
data_without_headerclutter
and thus we now have 3 columns with the former Timestamp column being the index (row identifier).
We can get the list of the column names and row indexes:
# Columns
data_without_headerclutter.columns
# Row indexes
data_without_headerclutter.index
Now that we have managed to import the datafile, we have all the data under our reach - hooray! 8)
From here on, it runs more or less in the same vein as a NumPy or GNU Octave/MATLAB arrays, with the main difference being the ability to also refer directly using the column and row labels.
For this kind of referrence, we use the iloc command. But before we do that, let me redefine the dataset name to df (for 'DataFrame').
df = data_without_headerclutter
# (I'm just tired of typing 'data_without_headerclutter' all the time!)
df
df.iloc[2,0] # Row 2, Col 0
df.iloc[[4,1],[1,0]] # Rows 4 and 1 && Cols 1 and 0
df.iloc[1:4,0:2] # Rows [1,4) && Cols [0,2)
df.iloc[[1,3,6],:] # Rows 1,3 and 6 && All cols
And then, we have the option to call by the labels, using loc:
df.loc[['20201001T0300'],['Temperature','Precipitation Total']]
For filtering, we just make a proposition, and get the True/False Boolean results. Let's work with a smaller dataframe:
sdf = df.iloc[0:5,0:2]
sdf
Let's try to find those entries with temperature below 13 degrees:
sdf.iloc[:,[0]] < 13
So we see that for 3 of the 5 entries, this assertion is correct, let's pick them:
filter1 = sdf.iloc[:,[0]] < 13
sdf.loc[filter1['Temperature']==True]
We can achieve the same thing by going straight to the heart of the issue using the column name directly:
sdf.Temperature < 13
and then feeding this as the index:
sdf[sdf.Temperature < 13]
What about the row index? Can we specify a criteria for them as well? First let's re-display our small dataframe:
sdf
Let's seek the ones recorded before (and including) 01:00 or the one at 04:00 (here or is for the union, not intersection!)
# Joining multiple criteria based on row label (index)
sdf[(sdf.index <= '20201001T0100') | (sdf.index == '20201001T0400')]
As the row index is string, we don't even need to write the whole timestamp to the end to make the comparison:
sdf.index > "20201001T02"
Now that we know how to slice via filtering, we can do whatever we want with the sections of the data we're interested in. For example, let us calculate the mean temperature in October, 1st:
# First get the October 1st day's Temperature data:
oct1_temp = df[df.index < "20201002"]['Temperature']
print(oct1_temp)
oct1_temp.mean()
Let's calculate all the daily mean temperatures:
for i in range(1,9):
day_temp = df[(df.index > "2020100"+str(i-1)) & (df.index < "2020100"+str(i+1))]['Temperature']
day_temp_mean = day_temp.mean()
print ("October, {} mean temperature: {:5.2f} degree C".format(i,day_temp_mean))
We can directly create a dataframe with the DataFrame command:
# Start with an empty dataframe:
mean_temps = pd.DataFrame({'day' : [],'meanTemp' : []})
mean_temps
# We can fill it individually
mean_temps.append({'day': "20201001", 'meanTemp' : 16.41}, ignore_index=True)
# or create it from the existing list
days = []
means = []
for i in range(1,9):
day_temp = df[(df.index > "2020100"+str(i-1)) & (df.index < "2020100"+str(i+1))]\
['Temperature']
day_temp_mean = day_temp.mean()
# print ("October, {} mean temperature: {:5.2f} degree C".format(i,day_temp_mean))
days.append("2020100"+str(i))
means.append(day_temp_mean)
data = {'days': days, 'means': means}
df_mean_temps = pd.DataFrame(data)
df_mean_temps
new_entry = pd.Series({'days': "20201013",'means':"23.5"})
df_mean_temps = df_mean_temps.append(new_entry, ignore_index=True)
df_mean_temps
# Specifying the index:
new_entry = pd.Series({'days': "20201011",'means':"24.5"}, name = 'k14')
df_mean_temps = df_mean_temps.append(new_entry)
df_mean_temps.index
df_mean_temps.shape
df_mean_temps.iloc[[9],:]
df_mean_temps.loc[['k14'],:]
df_mean_temps
# With the row labels (index) and column labels included:
df_mean_temps.to_csv('data/01_out.csv')
print(df_mean_temps.to_csv())
# Without the row labels but with the column labels:
print(df_mean_temps.to_csv(index=False))
# Without the row labels and the column labels:
print(df_mean_temps.to_csv(index=False, header=False))
# Specify header for the index column:
print(df_mean_temps.to_csv(index_label='id'))
# Use ';' as the seperator, instead of ','
df_mean_temps.to_csv(sep=';',path_or_buf='data/01_out.csv')
print(df_mean_temps.to_csv(sep=';'))
Even though Pandas offer very flexible options of doing things, we could as well have used NumPy:
import numpy as np
data = np.genfromtxt("data/01_dataexport_20201008T180753.csv", delimiter=',')
print(data)
As you can see, unfortunately, we can't place different types of variables into the same array. But other than that, the rest of the operations are similar:
data.shape
data_crop = data[11:,1:4]
data_crop
np.mean(data_crop[:,0])