High School R Homework



The vector `problem.1.data` contains numeric data.


## Part (a)


Construct a stipchart of the values in `problem.1.data`. Do you see any evidence of any outliers?


**Solution**




## Part (b)


Repair this outlier.


* First, report the location of the outlier.


* Next, convert the value to `NA`.


* Finally, display the value of the vector at this location, and show that it has the value `NA`.


**Solution**







## Part (c)


Construct a stripchart of the repaired version of `problem.1.data`. Do you see any indication of a -9 being used to represent a missing value?


**Solution**





## Part (d)


It's not best practice to represent a missing value using the value -9, so we need to repair this.


* First, report the location of the element with -9.


* Next, convert the value to `NA`.


* Finally, display the value of the vector at this location, and show that it has the value `NA`.


**Solution**








## Part (e)


Construct a stripchart of the repaired version of `problem.1.data`. Do you see any indication of an extreme outlier (i.e.\ 99999) or a -9 being used to represent a missing value?


**Solution**




Get Help With a similar task to - High School R Homework

Login to view and/or buy answers.. or post an answer
Additional Instructions:

--- title: "HW 13" output: pdf_document: default html_notebook: default --- ```{r} load( "HW13.Rdata") ``` # Problem 1 The vector `problem.1.data` contains numeric data. ## Part (a) Construct a stipchart of the values in `problem.1.data`. Do you see any evidence of any outliers? **Solution** ## Part (b) Repair this outlier. * First, report the location of the outlier. * Next, convert the value to `NA`. * Finally, display the value of the vector at this location, and show that it has the value `NA`. **Solution** ## Part (c) Construct a stripchart of the repaired version of `problem.1.data`. Do you see any indication of a -9 being used to represent a missing value? **Solution** ## Part (d) It's not best practice to represent a missing value using the value -9, so we need to repair this. * First, report the location of the element with -9. * Next, convert the value to `NA`. * Finally, display the value of the vector at this location, and show that it has the value `NA`. **Solution** ## Part (e) Construct a stripchart of the repaired version of `problem.1.data`. Do you see any indication of an extreme outlier (i.e.\ 99999) or a -9 being used to represent a missing value? **Solution** \newpage End of problem 1 \newpage # Problem 2 The factor variable `problem.2.data` was loaded in with the R objects. ## Part (a) Directly display all the levels of this factor variable. **Solution** ## Part (b) There should only be 5 levels for this factor: * "Classic Widget" * "Widget 2.0" * "Widget 3k" * "Quadcore Widget" * "Widget Mach 5" Group the factor levels so that there are now only these 5 levels. When you've done this, directly display the levels once again to show that there are only 5 levels. **Solution** ## Part (c) Tabulate the number of values in `problem.2.data` for each level. Store this table in a variable and display it directly. **Solution** ## Part (d) Using the table that you created in part (c), create a pie chart of the counts of each widget product. **Solution** \newpage End of problem 2 \newpage # Problem 3 ## Part (a) Load in the data in the file "Problem 3 Data.csv". Then display the first 8 rows using a `head()` statement. **Solution** ## Part (b) Select the `tail.length` column from the data frame and store it in a variable. Use the `class()` function to determine the class of this variable. Report your result with a single sentence. **Solution** ## Part (c) Determine the location that contains the value "Missing" in the variable that you created in part (b). Report this location using a `cat()` statement. **Solution** ## Part (d) Modify the value of the object that you created in part (b) at the location you found in part (c) so that it now has the value `NA`. Then convert this modified object to a numeric vector. (Be careful! You have to use two operations to do this.) Report the sample mean of this numeric vector using a `cat()` statement, rounding to 5 decimal places. (Remember to use `na.rm = TRUE`.) **Solution** ## Part (e) Select the `species` column as a factor from the data frame you loaded in in part (a). Then use `tapply()` to create a summary of the mean tail length for each species. Store this summary in a variable, and display it directly. **Solution** ## Part (f) Using your summary from part (e), create a barplot of the mean tail length across the four species. **Solution** \newpage End of problem 3 \newpage # Problem 4 In this problem, we'll put the ideas of problems 2 and 3 together to create a more complex analysis. The file "Problem 4 Data.csv" contains customer satisfaction data on 3 breakfast cereal brands. ## Part (a) Read in the file "Problem 4 Data.csv" and store this in a data frame variable. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (b) Select the `brand` column from the data frame, and store it in a variable. There should be 3 different cereal brands in this variable: * Sugar Bomzz * Krispy Yummm * Healthy Kale and Tofu Repair the data in this column by correcting any misspellings. When you're all done, use the `table()` function to tabulate the number of observations for each cereal brand, and display this summary table directly. **Solution** ## Part (c) Use the summary table you constructed in part (b) to create a pie chart of the number of observations for each brand. **Solution** ## Part (d) Now we'll consider the customer satisfaction ratings. Select the `satisfaction` column from the data frame from part (a) and store it in a variable. There are 4 entries in this variable that have the values -9, 99999, or "Missing". Convert each of these to `NA`. There are a variety of ways to approach this, so it's up to you to figure out what to do. When you're all done, use `tapply()` to construct a summary of the mean customer satisfaction score across the three cereal brands, and display this summary directly. **Solution** ## Part (e) Use your summary of the mean customer satisfaction score across the three brands that you constructed in part (d) to create a barplot. **Solution** \newpage End of problem 4 \newpage # Problem 5 Enzyme data from 3 labs is contained in these files: * "Problem 5 Data A.csv" * "Problem 5 Data B.csv" * "Problem 5 Data C.csv" These files are contained in the folder "Problem 5" in the "Data Files" folder. ## Part (a) Read in the file "Problem 5 Data A.csv" and store the data frame in a variable. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (b) Read in the file "Problem 5 Data B.csv" and store the data frame in a variable. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (c) Read in the file "Problem 5 Data C.csv" and store the data frame in a variable. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (d) Combine the three laboratory data sets from parts (a), (b), and (c) together into a single data frame and store it in a variable. You might have to make some adjustments before you're able to do this. The column names for this combined data frame should be: * `enzyme.a` * `enzyme.b` Once you've created the combined data frame, display the first 6 rows using a `head()` statement. **Solution** ## Part (e) Create a scatterplot of the data in the combined data frame from part (d). The horizontal $x$-axis should represent the measurements for enzyme A, and the vertical $y$-axis should represent the measurements for enzyme B. Then superimpose a least-squares regression line on this graph. **Solution** \newpage End of problem 5 \newpage # Problem 6 The file "Problem 6 Data.csv" contains data on revenues and costs for a series of projects at four different locations. ## Part (a) Read in the file "Problem 6 Data.csv", and store the data frame in a variable. Then display the first 6 rows of this variable using a `head()` statement. **Solution** ## Part (b) Select the rows where the `location` column has the value "Salt Lake City", and save this data frame in a variable. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (c) Select the `cost` column in the Salt Lake City data frame from part (b), and store this in a variable. Then report the sample mean and sample standard deviation of these values using a separate `cat()` statement for each one, rounding to 5 decimal places. **Solution** ## Part (d) Select the `revenue` column in the Salt Lake City data frame from part (b), and store this in a variable. Notice that although the values all appear to be numeric, this variable is actually a factor. This means that one of the numeric values in the column was somehow corrupted. Repair this value, and tell us how you did it. When you're all done, report the sample mean and sample standard deviation of these values using a separate `cat()` statement for each one, rounding to 5 decimal places. **Solution** ## Part (e) Using the cost vector from part (c) and the repaired revenue vector from part (d), construct a vector of profit values, defined as the revenue minus the cost. Report the sample mean and sample variance of the profit using a `cat()` statement, rounding to 5 decimal places. **Solution** ## Part (f) Construct a histogram of the profits for Salt Lake City using the profit vector you created in part (e). Since there are only 100 observations in this dataset, use 20 breaks for this histogram. Then superimpose an empirical density curve on this graph. **Solution** \newpage End of problem 6 \newpage # Problem 7 ## Part (a) Read in the data file "Problem 7 Data.csv" and display the first 6 rows using a `head()` command. **Solution** ## Part (b) Select the `species` column, and save it in a variable. This factor should have exactly 3 levels: * Aardvark * Armadillo * Hedgehog Repair any incorrect factor levels, and tell us how you did it. **Solution** ## Part (c) Select the `weight` column, and save it in a variable. Repair any -9 or 99999 values, and tell us how you did it. **Solution** ## Part (d) Create a stratified stripchart of the values in the `weight` vector from part (c) across the species in the corresponding factor from part (b). **Solution** ## Part (e) Select the `length` column, and save it in a variable. Repair any -9 or 99999 values, and tell us how you did it. **Solution** ## Part (f) Create a stratified boxplot of the values in the `length` vector from part (e) across the species in the corresponding factor from part (b). **Solution** \newpage End of problem 7 \newpage # Problem 8 It's often useful to have a sense not just of the variability of our data, but the variability relative to the overall average of the data. You can think of this as the "noise-to-signal" ratio, where the variation of the data is the "noise" and the mean of the data is the "signal". In this problem, we'll create two functions to measure this "noise-to-signal" ratio, and then we'll incorporate them into a simple reporter routine. ## Part (a) The *coefficient of determination* is the ratio of the standard deviation versus the mean. Write a function to calculate the sample coefficient of variation for a numeric vector: * The function takes one input argument, which should be a numeric vector. * First, calculate the sample standard deviation of the values of the vector using the `sd()` function. * Next, calculate the sample mean of the values of the vector. * Return the ratio of the sample standard deviation divided by the sample mean. Your function should ignore missing data, so remember to set `na.rm = TRUE` when appropriate. Don't print anything out with this function -- just return the calculated value. There's nothing to report for this part, but write your code clearly so the TAs can understand what you're doing. **Solution** ## Part (b) Run your function on the vector `problem.8.test.vector.1`. Store the return value in a variable, and report it using a `cat()` statement, rounding to 5 decimal places. **Solution** ## Part (c) Run your function on the vector `problem.8.test.vector.2`. Store the return value in a variable, and report it using a `cat()` statement, rounding to 5 decimal places. **Solution** ## Part (d) The *relative range size* is the ratio of the sample range versus the sample mean, where the sample range is defined as the difference between the sample maximum and sample minimum. Write a function to calculate the relative range size for a numeric vector: * The function takes one input argument, which should be a numeric vector. * First, calculate the sample maximum of the values of the vector using the `max()` function. * Next, calculate the sample minimum of the values of the vector using the `min()` function. * Calculate the sample range as the sample maximum minus the sample minimum. * Next, calculate the sample mean of the values of the vector. * Return the ratio of the sample range divided by the sample mean. Your function should ignore missing data, so remember to set `na.rm = TRUE` when appropriate. Don't print anything out with this function -- just return the calculated value. There's nothing to report for this part, but write your code clearly so the TAs can understand what you're doing. **Solution** ## Part (e) Run your function on the vector `problem.8.test.vector.3`. Store the return value in a variable, and report it using a `cat()` statement, rounding to 5 decimal places. **Solution** ## Part (f) Run your function on the vector `problem.8.test.vector.4`. Store the return value in a variable, and report it using a `cat()` statement, rounding to 5 decimal places. **Solution** ## Part (g) Load in the file "Problem 8 Data.csv" and store this in a data frame. Then write a `for()` loop that iterates across the columns of the data frame: * If the column is a factor, then do nothing. * If the column is a numeric vector, then print out the name of the column, the value of the coefficient of determination for that column, and the relative range size for that column. Don't just print out the numbers: use a `cat()` statement with an indication of what the value represents. **Solution** \newpage End of problem 8 \newpage # Problem 9: Extra Credit (5 points) ## Part (a) Read in the data from the file "Problem 9 Data.csv" and store it in a data frame. Then display the first 6 rows using a `head()` statement. **Solution** ## Part (b) Notice that the first column of this data frame has a complex structure: * The first four characters of every entry are a four-digit identification number. * This is always followed by a hyphen character '-'. * Finaly, there is a three-character tab indicating the office location for the item. Our goal is to extract the office location and convert it into something more readable. In this step, select the first column, convert it to a character vector using `as.character()`, and store this in a variable. Then display the first 6 elements of this character vector using a `cat()` statement. **Solution** ## Part (c) Create a new character vector by extracting the characters in each element of the vector from part (b) that represent the location. These will be the characters starting at position 6 in the string, and stopping at position 8. You'll need to use the `substr()` function to do this, and if you need a review you should check out Module 4 on character values from Lecture 8. When you've created this character vector, display the first 6 elements using a `cat()` statement. **Solution** ## Part (d) Convert the character string vector in part (c) to a factor. Then change the levels of this factor using these values: | Old Levels | New Levels | |:-----------|:-----------| | BOS | Boston | | LON | London | | SLC | Salt Lake City | | SHA | Shanghai | When you're done, directly display the first 6 elements of this factor using a `cat()` statement. **Solution** ## Part (e) Now that you've cleaned up the factor, the rest should be straightforward. Select the values in the `rating` column of the data frame from part (a), and then summarize the mean rating score across the locations using the factor from part (d) and the `tapply()` function. Save this summary in a variable, and directly display it. **Solution** ## Part (f) Use the summary of the mean rating score across the locations from part (e) to create a bar plot. **Solution**

Related Questions

Similar orders to High School R Homework
21
Views
0
Answers
Nested imbalanced design of expriment using Box-Adjusted wald-type test
I need to provide statistical analysis of a nested non-balanced design of an experiment. I am would like to have the implementation R. I will need the answers to be provided as shown in the attached file (Project.pdf), and also would like to have access to...
34
Views
0
Answers
CMPT 200 Coding Homework
Write a class called Fraction that can store a rational number (reminder: those numbers that can be expressed in the form a/b, where a and b are integers are rational numbers). For example, a variable with a value of ½ would be created using oneHalf ...
15
Views
0
Answers
Artificial Inteligence System Technique
This is a Master Degree course and I have attached example questions, there are 5 questions and only 3 need to be answered. We will get the actual questions on the day of the exam and they need to be completed within 2 hours, which means the expert has to ...