For each one of the following questions, write Python code in PyCharm.
- For each question, create a new Python file. Name each lastname_firstname_hw5_1.py etc.
- Create a header in each file using comments to display your name and HW information. After that write your Python code.
- Create a Word document with screenshots of your program output. Zip the Python files and the Word doc together. Name it lastname_firstname_hw5. Submit on Blackboard.
The avocado.csv dataset contains price and quantity of avocados sold over time in various regions.
- Download the avacado CSV file.
- Read the dataset into Python using Pandas.
- Include only these columns: Date, AveragePrice, Total Volume
- Store the data in a DataFrame avocado
- Convert Date column to a timestamp using datetime.
- Print the dataframe
- Create a figure with 4 subplots
- Sort avocado by Date inplace in ascending order.
- Plot the average price of avocados over time in subplot 1. Use scatter.
- Plot the total volume of avocados sold over time in subplot 2. Use scatter.
You notice that the plots are cluttered. The reason is that there are many dates in the dataframe and there are several transactions on the same date!
To address this, we will aggregate the volume and price by date.
Create a new dataframe avocado1 which sums the Total Volume for each date. Here are the steps
- Create a new column in avocado called TotalRevenue which is the product of average price and total volume
- Then create a new dataframe called avocado1 which groups together the dataframe over the date
avocado1 = avocado.groupby(‘Date’).sum()
- Print avocado1. You will notice that the AveragePrice also got aggregated. This is not correct.
- Recalculate the average price using this
avocado1[‘AveragePrice’] = avocado1[‘TotalRevenue’]/avocado1[‘Total Volume’]
- You should now have the following dataframe. Print the dataframe
- Plot the average price of avocado1 over time in subplot 3. Use Plot.
- Plot the total volume of avocado1 sold over time in subplot 4. Use Plot.
- Create a figure with 2 subplots
- Use the code on slide 52 of Lecture 5 to smooth out the last two plots from question 2. Plot the smoothed curves in subplots 1 and 2. You could use smoothing over 20 days
- Create a statistical summary of the data in the file “CommuteStLouis.csv”. Plot a histogram of age for the CommuteStLouis data.
Age Distance Time
count 500.00000 500.000000 500.000000
mean 41.38800 14.156000 21.970000
std 13.79994 10.748895 14.232436
min 16.00000 0.000000 1.000000
25% 30.00000 6.000000 11.500000
50% 42.00000 11.000000 20.000000
75% 52.00000 20.000000 30.000000
max 84.00000 80.000000 130.000000
- For the data CommuteStLouis:
- Produce a correlation matrix of age, distance and time. Which two numeric variables are most highly correlated? What is the correlation coefficient for the above pair?
Age Distance Time
Age 1.000000 -0.000774 0.030292
Distance -0.000774 1.000000 0.830241
Time 0.030292 0.830241 1.000000
- Create a scatterplot matrix of the numeric variables in the data. What do the figures in the diagonal going from the top left to the bottom right show? What can you say about the skewness of the various attributes?
- Produce a side-by-side boxplot of distance travelled by gender. Do the data in the file indicate that women tend to commute shorter distances?
Options: You can do Questions 3 and 4 as one figure, two subplots. Or two separate figures. Your choice.
- For the pair in Question 2.a the scatter plot.
- Also superimpose a linear regression line on plot 1.
- Show the distribution of residuals of the data from Part 3.