Select Page

For each one of the following questions, write Python code in PyCharm.

• For each question, create a new Python file. Name each lastname_firstname_hw5_1.py etc.

#Name
#HW5
# Q1

Need help with essays, dissertations, homework, and assignments? Stop wasting time and post your project on tutlance and get a real professional to do your work at CHEAP prices. Watch while Tutlance experts outbid each other.

• Create a Word document with screenshots of your program output. Zip the Python files and the Word doc together. Name it lastname_firstname_hw5. Submit on Blackboard.

Problem 1

The avocado.csv dataset contains price and quantity of avocados sold over time in various regions.

1. Preparation
2. Read the dataset into Python using Pandas.
3. Include only these columns: Date, AveragePrice, Total Volume
4. Store the data in a DataFrame avocado
5. Convert Date column to a timestamp using datetime.
6. Print the dataframe
2. Plotting
1. Create a figure with 4 subplots
2. Sort avocado by Date inplace in ascending order.
3. Plot the average price of avocados over time in subplot 1. Use scatter.
4. Plot the total volume of avocados sold over time in subplot 2. Use scatter.

You notice that the plots are cluttered. The reason is that there are many dates in the dataframe and there are several transactions on the same date!

To address this, we will aggregate the volume and price by date.

Create a new dataframe avocado1 which sums the Total Volume for each date. Here are the steps

• Create a new column in avocado called TotalRevenue which is the product of average price and total volume
• Then create a new dataframe called avocado1 which groups together the dataframe over the date

• Print avocado1. You will notice that the AveragePrice also got aggregated. This is not correct.
• Recalculate the average price using this

• You should now have the following dataframe. Print the dataframe

1. Plot the average price of avocado1 over time in subplot 3. Use Plot.
2. Plot the total volume of avocado1 sold over time in subplot 4. Use Plot.

1. Plotting
1. Create a figure with 2 subplots
2. Use the code on slide 52 of Lecture 5 to smooth out the last two plots from question 2. Plot the smoothed curves in subplots 1 and 2. You could use smoothing over 20 days

Problem 2

1. Create a statistical summary of the data in the file “CommuteStLouis.csv”. Plot a histogram of age for the CommuteStLouis data.

Age Distance Time

count 500.00000 500.000000 500.000000

mean 41.38800 14.156000 21.970000

std 13.79994 10.748895 14.232436

min 16.00000 0.000000 1.000000

25% 30.00000 6.000000 11.500000

50% 42.00000 11.000000 20.000000

75% 52.00000 20.000000 30.000000

max 84.00000 80.000000 130.000000

1. For the data CommuteStLouis:
1. Produce a correlation matrix of age, distance and time. Which two numeric variables are most highly correlated? What is the correlation coefficient for the above pair?

Age Distance Time

Age 1.000000 -0.000774 0.030292

Distance -0.000774 1.000000 0.830241

Time 0.030292 0.830241 1.000000

1. Create a scatterplot matrix of the numeric variables in the data. What do the figures in the diagonal going from the top left to the bottom right show? What can you say about the skewness of the various attributes?

1. Produce a side-by-side boxplot of distance travelled by gender. Do the data in the file indicate that women tend to commute shorter distances?

Options: You can do Questions 3 and 4 as one figure, two subplots. Or two separate figures. Your choice.

1. For the pair in Question 2.a the scatter plot.
1. Also superimpose a linear regression line on plot 1.

1. Show the distribution of residuals of the data from Part 3.