Hw3 Errorbars And Correlation - Data Science Assignment

Need help with similar Computer Science questions?

Ask A Question

Question: Hw3 Errorbars And Correlation - Data Science Assignment

Asked
Modified
Viewed 135

I need help with the data science assignment, I have my midterms for other subjects to prepare for. I have uploaded the required files and I need to submit this in very less time.

HW3 Errorbars and correlation homework

To complete this homework, you need to download one csv file, which contain the monthly totals of the number of new cases of measles, mumps, and chicken pox, respectively, for New York City during the years 1931-1971 (for a total of 41 years). The data file contains 123 rows and 12 columns.

Each row represent a month from Jan to Dec. The first 41 rows are the number of new measles cases in each year during that period, the next 41 rows are for mumps, and the remaining 41 rows are chiken pox. The rows are ordered by the years in chronical order.

Complete the python script skeleton to analyze the data for the following tasks. For your information, data has been loaded with the Pandas package to load and organize the dataset into a Numpy 3D array of shape (3, 41, 12), where the first dimension represents the three diseases in the order mentioned above. Several other variables are also defined for your convenience.

More Instructions
HW3 Errorbars and correlation Electronic submission due 11:59pm, Wed 10/9 To complete this homework, you need to download one csv file, which contain the monthly totals of the number of new cases of measles, mumps, and chicken pox, respectively, for New York City during the years 1931-1971 (for a total of 41 years). The data file contains 123 rows and 12 columns. Each row represent a month from Jan to Dec. The first 41 rows are the number of new measles cases in each year during that period, the next 41 rows are for mumps, and the remaining 41 rows are chiken pox. The rows are ordered by the years in chronical order. Complete the python script skeleton to analyze the data for the following tasks. For your information, data has been loaded with the Pandas package to load and organize the dataset into a Numpy 3D array of shape (3, 41, 12), where the first dimension represents the three diseases in the order mentioned above. Several other variables are also defined for your convenience. Q1 (20 pts). Calculate the average number of cases per year for each disease, and estimate 95% confidence interval of the average (Lec4.pptx slide #4). Plot as an errorbar. (Use marker='d', linestyle='', capsize=5 to show a figure similar to example Figure 1 on the next page.) Q2 (20 pts). For each disease, calculate the fraction of cases occurred in each month of the year during this period of time. You will calculate a matrix C of size 3 x 12, where each row is for a disease, and value Cij is the total number of cases of disease i occurred in month j (of all 41 years), divided by the total number of cases of disease i. (Hint: use matrix multiplication instead of for loops for this if you can.) Plot the vectors as a line graph. (See example figure 2.) Q3.1 (10 pts) Scatter plot the average number of mumps cases occurred in each month of the year during this period of time against the average monthly number of chickpen pox cases, i.e., you are scatter plotting two vectors, x, and y, each of which has 12 values, representing the average number of mumps or chicken pox cases in Jan, Feb, etc, averaged over 41 years. (See example figure 3.1.) Q3.2 (10 pts) Scatter plot the total number of mumps cases in each year against that of chicken pox cases. (i.e., you are scatter plotting two vectors, x, and y, each of which has 41 values, representing the total number of mumps or chicken pox cases in year 1931, 1932, etc.) (See example figure 3.2.) Q4.1 (5 pts) Calculate and print out the Pearson correlation coefficient between the monthly mumps cases and monthly chicken pox cases (the two vectors x and y you calculated in Q3.1). Q4.2 (5 pts) Calculate and print out the Pearson correlation coefficient between the annual mumps cases and annual chicken pox cases (the two vectors x and y you calculated in Q3.2). Q4.3 (5 pts) Similar to Q4.1 but calculate spearman rank correlation instead (using the argsort method lecture4.pptx Slide #12). Q4.4 (5 pts) Similar to Q4.2 but calculate spearman rank correlation instead (using the argsort method lecture4.pptx Slide #12). Q5 (20 pts) Calculate and show the correlation matrix between each of the 12 months for the number of mumps cases. More formally, you have a matrix M of size 41 x 12, where Mij is the number of mumps cases in year i and month j. You need to calculate a matrix C of size 12 x 12, where Cij is the correlation between Mi and Mj. Mi is the i-th column of M. Use imshow(M) to display the matrix, and colorbar() to show the color map. Changing the months from 0-11 to 1-12 is optional but can be done with xticks and yticks as usual: xticks(range(12), range(1,13)). (See example Fig 4.) Q6 (Bonus: 20 pts). Calculate and plot the average fraction of diseases occurring in each month. Take mumps cases as an example, you start with calculating a matrix F of size 41 x 12, where Fij is the fraction of mumps cases in year i occurring in month j, i.e, Fij = Mij / _k (Mik) and M is defined in Q5. Double check that the sum of each row should be equal to 1. Then you would calculate the mean for each column of F and obtained a vector of size 12. Repeat this for the other two diseases and plot the three vectors in the same figure. (Optional: work with the three- dimensional data array to get a 3x12 matrix instead of three separate vectors.) (See example Fig 5.) Challenge questions (Provide your answer at end of python script file. See comments in there.): c1a. (3 pts) In Figure 4, why is the correlation between January and December so low? c1b. (3 pts) Support your answer to 1a using 1 line of code. c2. (10 pts) Figure 2 and Figure 5 are similarly but apparently different. Briefly explain the different meaning of the two figures, and describe a scenario that will cause dramatic differences between the two figures. (Say if you are allowed to add one more year of data. What pattern in that year will cause a big difference in Fig 2 but not much in Fig 5?) HW3 Errorbars and correlation Electronic submission due 11:59pm, Wed 10/9 To complete this homework, you need to download one csv file, which contain the monthly totals of the number of new cases of measles, mumps, and chicken pox, respectively, for New York City during the years 1931-1971 (for a total of 41 years). The data file contains 123 rows and 12 columns. Each row represent a month from Jan to Dec. The first 41 rows are the number of new measles cases in each year during that period, the next 41 rows are for mumps, and the remaining 41 rows are chiken pox. The rows are ordered by the years in chronical order. Complete the python script skeleton to analyze the data for the following tasks. For your information, data has been loaded with the Pandas package to load and organize the dataset into a Numpy 3D array of shape (3, 41, 12), where the first dimension represents the three diseases in the order mentioned above. Several other variables are also defined for your convenience. Q1 (20 pts). Calculate the average number of cases per year for each disease, and estimate 95% confidence interval of the average (Lec4.pptx slide #4). Plot as an errorbar. (Use marker='d', linestyle='', capsize=5 to show a figure similar to example Figure 1 on the next page.) Q2 (20 pts). For each disease, calculate the fraction of cases occurred in each month of the year during this period of time. You will calculate a matrix C of size 3 x 12, where each row is for a disease, and value Cij is the total number of cases of disease i occurred in month j (of all 41 years), divided by the total number of cases of disease i. (Hint: use matrix multiplication instead of for loops for this if you can.) Plot the vectors as a line graph. (See example figure 2.) Q3.1 (10 pts) Scatter plot the average number of mumps cases occurred in each month of the year during this period of time against the average monthly number of chickpen pox cases, i.e., you are scatter plotting two vectors, x, and y, each of which has 12 values, representing the average number of mumps or chicken pox cases in Jan, Feb, etc, averaged over 41 years. (See example figure 3.1.) Q3.2 (10 pts) Scatter plot the total number of mumps cases in each year against that of chicken pox cases. (i.e., you are scatter plotting two vectors, x, and y, each of which has 41 values, representing the total number of mumps or chicken pox cases in year 1931, 1932, etc.) (See example figure 3.2.) Q4.1 (5 pts) Calculate and print out the Pearson correlation coefficient between the monthly mumps cases and monthly chicken pox cases (the two vectors x and y you calculated in Q3.1). Q4.2 (5 pts) Calculate and print out the Pearson correlation coefficient between the annual mumps cases and annual chicken pox cases (the two vectors x and y you calculated in Q3.2). Q4.3 (5 pts) Similar to Q4.1 but calculate spearman rank correlation instead (using the argsort method lecture4.pptx Slide #12). Q4.4 (5 pts) Similar to Q4.2 but calculate spearman rank correlation instead (using the argsort method lecture4.pptx Slide #12). Q5 (20 pts) Calculate and show the correlation matrix between each of the 12 months for the number of mumps cases. More formally, you have a matrix M of size 41 x 12, where Mij is the number of mumps cases in year i and month j. You need to calculate a matrix C of size 12 x 12, where Cij is the correlation between Mi and Mj. Mi is the i-th column of M. Use imshow(M) to display the matrix, and colorbar() to show the color map. Changing the months from 0-11 to 1-12 is optional but can be done with xticks and yticks as usual: xticks(range(12), range(1,13)). (See example Fig 4.) Q6 (Bonus: 20 pts). Calculate and plot the average fraction of diseases occurring in each month. Take mumps cases as an example, you start with calculating a matrix F of size 41 x 12, where Fij is the fraction of mumps cases in year i occurring in month j, i.e, Fij = Mij / _k (Mik) and M is defined in Q5. Double check that the sum of each row should be equal to 1. Then you would calculate the mean for each column of F and obtained a vector of size 12. Repeat this for the other two diseases and plot the three vectors in the same figure. (Optional: work with the three- dimensional data array to get a 3x12 matrix instead of three separate vectors.) (See example Fig 5.) Challenge questions (Provide your answer at end of python script file. See comments in there.): c1a. (3 pts) In Figure 4, why is the correlation between January and December so low? c1b. (3 pts) Support your answer to 1a using 1 line of code. c2. (10 pts) Figure 2 and Figure 5 are similarly but apparently different. Briefly explain the different meaning of the two figures, and describe a scenario that will cause dramatic differences between the two figures. (Say if you are allowed to add one more year of data. What pattern in that year will cause a big difference in Fig 2 but not much in Fig 5?)
Answers 0

No answers posted

Post your Answer - free or at a fee

Login to your tutor account to post an answer

Posting a free answer earns you +20 points.

Login

Ask a question for free and get answers to get Computer Science assignment help with a similar task to this question.