DataFrame in Python

in an easy way


You’re asking one question: “What is the deal with DataFrames??” Being curious you fire up chrome and you come across this unearthly definition from the official pandas documentation:

“DataFrames” are Two-dimensional size-mutable, potentially heterogeneous tabular data structures with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

Let’s break it down.

Not confusing it with the cute and cuddly animals, “Pandas” is one of the most popular packages in python. It offers powerful yet flexible data structures that make data manipulation and analysis easy, among many other things.

“IPL Score chart” An example of what a DataFrame looks like. Source : FuzzyLogix

DataFrames in Python come with the Pandas library. They are defined as two-dimensional labeled data structures with columns of potentially different types.
While, Dict stands for Dictionary, It can contain series, arrays, constants, etc.

The term “DataFrame” comes from the world of statistics. Which generally means “Tables” or “Tabular Data”.

As is the case with the tables, DataFrames are made up of rows and columns. Each row and each column has the same data type. The row datatype can be unrelated, but the column datatype must be similar. DataFrames usually contain some metadata in addition to data. For example, column and row names.

Now that you’ve got an idea about what DataFrames are, what they can do and can’t.

Let’s see an example.

We are going to use two libraries; numpy and pandas. Here I have imported numpy as np and pandas as pd.

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

While you can create DataFrames from scratch, you can also convert other data structures to Pandas DataFrames.

Importing the data.

How about we get data from WikiPedia? Say a sports stats or something?

Let’s get the “IPL” teams records summary. To those of you who don’t know, “IPL” stands for “Indian Premier League”, It is a professional “Twenty-20“ cricket league in India.

For IPL records and statistical data – Click here

Wikipedia

Or, for some unknown reason you don’t want to click a button; you can open a website using python. Just run :

import webbrowser
website='https://en.wikipedia.org/wiki/List_of_Indian_Premier_League_records_and_statistics'
webbrowser.open(website)

On the site page. Select the entire table of “Team Records” results summary and copy (CTRL + C) it to the clipboard. Make sure you don’t select and copy unnecessary page elements such as spaces, letters or paragraphs.

On the site page. Select the entire table of “Team Records” results summary and copy (CTRL + C) it to the clipboard. Make sure you don’t select and copy unnecessary page elements such as spaces, letters or paragraphs.

Now, the entire table is saved on your clipboard.

For the next step, we are going to read the clipboard data and store it in the variable called ipl_results .

ipl_results = pd.read_clipboard()
ipl_results

The result will look as follows:

Now we are free to perform many many operations.

1. Getting the list of all columns :

To get the list of all columns. Perform,

ipl_results.columns

Which will give you the result :

Notice that each word is within the single quotaion mark and every word is seperated by a comma.

2. Getting select columns :

Statistical data is often — usually, in fact — messy. Sometimes, when you are manipulating some complex data, you only require the data that is necessary to your goal.

Let’s say, in the above example, I only care about — Team name, the matches the team won, matches the team played and the number of times when a team won the “IPL” series title.

pd.DataFrame (ipl_results, columns = ['Team','Mat','Won','Titles'])
# This will give me only the required columns

The output will be :

3. Adding a new Column:

Say, I want to add a new row in the data.

The IPL results summary is missing the “Captain” column.What will happen if I add the ‘Captain’ column in the code we ran in the earlier operation?

pd.DataFrame (ipl_results, columns = ['Team','Mat','Won','Titles','Captain'])

Okay, the DataFrame contains the newly added column, but what’s the deal with “NaN”?

The new column gets added but we haven’t provided it’s values. Hence it’s values are replaced with “NaN”.

“NaN” means not-a-number. Usually “NaN” is used to just mean that some data is missing.

4. Filling the Data :

To fill the data in the “Captain” column.

Create a Series —

captains = pd.Series(['MS Dhoni','Rohit Sharma','Virat Kohli'], index = [5,0,1])

Here, you have the option to add input to the ‘index’ argument to make sure that you have the index that you desire.

Assign Series to the “Captain” —

ipl_results['Captain'] = captains
ipl_results

The output is:

Adding captains to their respective team.

The nice thing about data in a DataFrame is that it is very easy to convert into other formats such as Excel, CSV, HTML, LaTeX, etc.

5. Deleting the row/column :

To delete rows :

Say I don’t need the rows with the index values — 3,4,7,9

ipl_drop_row = ipl_results.drop([3,4,7,9])
ipl_drop_row

When the drop function is performed, the rows gets deleted.

To delete columns :

Deleting multiple columns is very easy. Say, I don’t need the columns — Span, Tie+W, Tie+L, NR.

ipl_drop_columns = ipl_results.drop(['Span','Tie+W','Tie+L','NR'], axis = 1)
ipl_drop_columns

Here, axis=1 means along the “columns”. It’s a column-wise operation.

This gives the output :

6 : Creating a DataFrame from scratch :

Obviously, making your DataFrames is your first step in almost anything that you want to do when it comes to cleaning up a messy data set in Python. But sometimes, you will want to start from scratch.

To create a DataFrame from python is to use a list of dictionaries.

data = {'City' : ['Mumbai','Delhi','Pune'], 'Population' : [9900000,7100000,4500000]}
city_frame = pd.DataFrame(data)
city_frame

Here, the dictionary keys City and Population are used for the column headings.

This will give the output :

This is it! You’ve finally completed the basics of DataFrames.

Note that, although we have covered some important topics, there is more to the DataFrames than what we have learned so far.

In Python, there are many different approaches for solving the same problems. There is no one approach that is “best”, it usually depends on your goals and needs.

Thanks for the read. Have a good day!