Introduction to pandas

Introduction to pandas

Pandas is a powerful Python library for data analysis and manipulation. It provides data structures like Series (1D labeled arrays) and DataFrame (2D tables) that simplify handling structured data. With Pandas, you can easily read, clean, transform, and analyze data from various sources, including CSV, Excel, and databases.

Install pandas

You can use the pip command to install pandas.

pip install pandas

How to read csv files

Here is a small csv file. Let’s see how to open and work with this pandas.

ID,Name,Age,Salary,City
1,Alice,25,50000,New York
2,Bob,30,55000,Los Angeles
3,Charlie,35,60000,Chicago
4,David,40,65000,Houston
5,Eve,28,52000,San Francisco
6,Frank,33,58000,Seattle
7,Grace,29,53000,Boston
8,Hank,45,70000,Denver
9,Ivy,31,56000,Miami
10,Jack,38,62000,Atlanta

read the csv file using read_csv()

read_csv() converts the given file to a pandas Dataframe. And a pandas dataframe helps us to preprocess ( filter, edit, sort ) the given file.

import pandas as pd
df = pd.read_csv('sample_data.csv')

Here same_data.csv is a file that exists in the same directory.

df is a pandas dataframe

Getting the first five and last five rows.

first_five = df.head(5)
print(first_five)
last_five =df.tail(5)
print(last_five)
ID     Name  Age  Salary           City
0   1    Alice   25   50000       New York
1   2      Bob   30   55000    Los Angeles
2   3  Charlie   35   60000        Chicago
3   4    David   40   65000        Houston
4   5      Eve   28   52000  San Francisco

   ID   Name  Age  Salary     City
5   6  Frank   33   58000  Seattle
6   7  Grace   29   53000   Boston
7   8   Hank   45   70000   Denver
8   9    Ivy   31   56000    Miami
9  10   Jack   38   62000  Atlanta

the tail(n) or head(n) takes in an argument which is the number of rows you want.

Finding the mean of a column.

(mean: average of n numbers)

df['Salary'].mean()
58100.0

You might have noticied the df['Salary'] syntax. Here is how to access an entire column in pandas. After selecting the ‘Salary’ column, we can find mean with .mean()

Sort a dataframe by a column’s values.

Let’s sort all the rows by the column Age

df.sort_values('Age')
  ID     Name  Age  Salary           City
0   1    Alice   25   50000       New York
4   5      Eve   28   52000  San Francisco
6   7    Grace   29   53000         Boston
1   2      Bob   30   55000    Los Angeles
8   9      Ivy   31   56000          Miami
5   6    Frank   33   58000        Seattle
2   3  Charlie   35   60000        Chicago
9  10     Jack   38   62000        Atlanta
3   4    David   40   65000        Houston
7   8     Hank   45   70000         Denver