Python and Data Science - A cool starter project for middle school and high schoolers

Updated: Apr 11, 2022

In our modern world, data is being gathered every day from websites, sensors, our phones, and more. Data Science is the practice of gathering insights from data and is a fascinating space for Middle School and High School students to explore. In this blog we describe a starter exercise that you can do to get your feet wet in Python with Data Science. We will use python libraries (the same ones that professionals use!) to explore a dataset of monthly rainfall.

The first step is to access the dataset. You will find that Python has many libraries that help access and process different types of datasets. Here we install the earthpy library.

!pip3 install earthpy

Next, we will load the dataset and process it using the Pandas library. Pandas is a very powerful Python library that processes DataFrames - the data structure used for most data science tasks. A Dataframe is a table where every column can be a different type (category, text, numbers, etc.)

import pandas as pd
import earthpy as et

# URL for .csv with avg monthly precip data
avg_monthly_precip_url = "https://ndownloader.figshare.com/files/12710618"

# Download file from URL
et.data.get_data(url=avg_monthly_precip_url)

Once we have installed Pandas and gotten the data, we can then read it into a dataframe in Pandas and look at it.

fname = '/root/earth-analytics/data/earthpy-downloads/avg-precip-months-seasons.csv'
dataset = pd.read_csv(fname)
dataset.head()

Once you run the code, you will see an output like the one below

You can see that this dataset has three columns, the months, the precip, and seasons. By using dataset.head() we can see the first few lines of the dataset.

Next we will subsample the data. Subsampling is a common operation if your dataset is too big to process. You can subsample using Pandas.

print('Length of the dataset before sampling', len(dataset))
sub_sampled_data = dataset.sample(n=5)
print('Length of the dataset after sampling', len(sub_sampled_data))

You may or may not choose to subsample - it is not required.

Next, we will visualize the data using a bar plot. You can use the code below

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(dataset.months,dataset.precip)
plt.show()

Once you run this code on the original file, you will see a plot like this

There you have it! You have taken your first steps with data science, downloaded a dataset, sampled it, and visualized it. The code for this example is also available in a Google Colab notebook here: https://colab.research.google.com/drive/1lPxpwXxOfb2jtjfRyqcOOaHAua-fG-wW?usp=sharing#scrollTo=QdusFVOAwBSx

Enjoy..

4200