Stat 121 Fall 2011: February 2012

Tuesday, February 21, 2012

Exam 2 Review!

It's that time- Exam time!

Main Topics (Aka BIG topics)
1. Regression
2. Two-Way Tables
3. Sampling Distribution of X-bar

Side Topics (Small topics)
1. Statistic versus Parameter
2. Control Charts
3. Response versus Explanatory Variables
4. Probability

1. REGRESSION

Main Point: Regression is all about knowing your vocab, vocab, vocab. And how to apply it of course.

Regression is on Bi-variate, quantitative data (if you don't know what that means, I'd look it up!) The whole point is to find out if there is an association between your two variables.

a. Scatterplots
Scatterplots have Strength, Form and Direction. Understand what each of these means and how to recognize them.

Form: Linear/non linear
Direction: Negative/positive
Strength: Correlation coefficient, r

Correlation Coefficient: Be able to recognize false statements about r, be able to guestimate r based on a graph, and now how outliers affect r.
Things to remember about r:
-No units
-Only quantitative data
-Only linear, data
-between -1 and 1
-effected by outliers

r-squared (r^2): "The percent variation in y, explained by x" See previous posts for a good explanation on this. Be able to "interpret in context" (i.e. replace the highlighted words) as well as recognize this definition to know when they WANT r-squared.

TEST YOURSELF: What is the difference between r and r^2? What do they tell us?

Least Squared Regression Line: "Minimizing the sum of the squared residuals". Never hurts to know a definition. Basically, this is our best fit line that allows us to predict values based on the equation of the line. Which is:

y-hat = bx + a

Where y-hat is your predicted y, b=slope and a= y-intercept.

Realize you never have to come up with b and a by yourself, they will always be given in outputs. Practice knowing how to read these outputs!!

Slope: "Average change in y for every one unit increase in x". This is the big one! Know how to interpret in context (ie, exchanging out the highlighted words) and know how to recognize the definition when they are asking just for the number. (Example: "What is the average change in farm population for every year that passes? Do you see the "change in y" for a "unit increase" in x? They would just want the slope value, found from the output).

Realize that it may not be WORD FOR WORD of this definition (they can change up the order, for example, saying "for every one unit increase in x, what is the average change in y?" or for a practical example, "For every year that passes, what is the change in farm population on average?" If you KNOW the definition, you should be able to recognize it.

Summary: Know your definitions, because everything is given to you in the output. You merely need to recognize what they are asking. Don't panic. You know this stuff.

2. TWO-WAY TABLES

Main Point: This is about calculation. Two-way tables assess relationships (aka, association between variables) for bivariate categorical data.

Marginal Distributions
These deal with "overalls". They are exclusively in the margins. Don't leave the margins. Okay? Because they are distributions, expect more than one number.

Conditional Distributions
These are based on a condition, or rather, a specific. They are still distributions, but they are based on the "inside" numbers.

Conditional Values
These are a singular value based on two conditions. The first condition is the total that governs.

TEST YOURSELF
In case the above didn't make much sense, let's practice!

Study in the Library mainly

What is the marginal distribution of years?
What is the marginal distribution of people who study in the library?

What is the conditional distribution of sophomores?
What is the conditional distribution of people who DO study in the library?

For those who are seniors, how many study in the library?
For those who study in the library, how many are seniors?

Answers:
1. 0.1677, .3829, .2974, .1518
2. 0.48, .515

3. 0.5619, .438
4. 0.294, .444, .1503 .... (keep going)

5. 0.354
6. 0.111

Summary: Know your calculations.

3. Sampling Distributions of x-bar.

Main Point: This is probably the most concept-heavy portion of the second test, and thus, the thing people have the hardest time with. Know. These. Concepts. Understand them well, and you won't have a problem.

What is a sampling distribution of x-bar? It is the distribution of sample means from every possible sample of a particular size n.

Know the difference between a population graph and a sampling distribution of x-bar graph. (here's a hint to remind you: population = individual, S.D. of x-bar = sample mean).

Central Limit Theorem: For a non-normal population, if n>or= 30, then the sampling. distribution of x-bar is approximately normal. (as a side note: For an already normal population, if n> or = 30, then the sampling distribution of x-bar is exactly normal.)

Know how to use the formula associated with sampling distributions of x-bar (i.e. z=x-bar-mu/sigma/sqrt(n)). Remember, you use it in the same way as the population z-score. But practice anyways.

Things to know: The standard deviation of a S.D. is SIGMA/SQRT(n).

The mean of a sampling distribution of x-bar is ALWAYS. ALWAYS. ALWAYS = to mu. Always. Doesn't matter what n is. Always. Always.

TEST YOURSELF:What happens to the following graphs?

Let's start with a population of 15,000 that is severely left skewed.

If I take a sample of 800 people and graph it, what does it look like?
If I take all possible samples of size n=10 and graph it, what does it look like?
If I take all possible samples of size n=600 and graph it, what does it look like?

Answers:
1. The key here is that I only took ONE SAMPLE. Remember, the CLT only applies to sampling distributions of x-bar. So, we would expect it to look like the population.

2. This is a sampling distribution of x-bar, however, n<30 so CLT doesn't apply. This graph would be more normal, but cannot be considered normal because the CLT doesn't apply.

3. This is a sampling distribution of x-bar and n>30, so the CLT applies and we have an approximately normal distribution.

As a side note, don't forget about the law of large numbers, which only deals with samples (not sampling distributions of x-bar). Check out your notes on that one.

Summary: Again this is concept heavy. Of course, it does require some calculations in way of the formula, but if you were okay with these on the first exam, you should be okay now.

4. THE SMALL STUFF

Be sure to go over your definitions for:

statistic versus parameters (be able to tell the difference)

Control Charts (remember the equation for the upper and lower limits aren't on your equation she

et. Memorize them!)

Probability (know facts about the probability distribution and what it takes to have a proper one)

Response versus explanatory variables (be able to tell the difference).

And of course, brush up on knowing the different experimental designs!

GOOD LUCK!!
-Hillary

Need more Help?

I have another TA friend who writes a fantastic website sort of like this blog that is a great resource if you need more help! While my blog is homework focused, hers is more concept focused. If you have a concept you are struggling with, more often then not you will be able to find great example problems and powerpoint slides on the topic. Check Kiya's website out!

https://sites.google.com/site/kiyabyustat/

I've also listed a link to her site in the sidebar :)

Happy Stat 121-ing!

-Hillary

Assignment 14

We will be going over Control Charts on Thursday (and doing questions 1-2).

Questions 5-7 are a great review on the difference between populatino distributions and sampling distributions of x-bar. Remember, if i were to take out ONE data point from a population graph, what would it represent? What about if I were to take it out of a sampling distribution of x-bar? What would it represent? (This should help with questions 6 and 7).

Question 8-10 are a simple review on statistics versus parameters. I think question 10 poses the most challenge. Remember your key words to know if it is talking about a parameter. (Hint: things that are "known" or about "all" are generally parameters).

Finally, questions 11-13 were discussed at the end of lab last week. This is relating the central limit theorem and how graph shapes will change (as well as the mean). Look over your class notes! (PS- These are VERY helpful questions to know for the exam! Be sure to understand them.)

-Hillary

Assignment 13

Sorry this is so late folks! My computer broke over the weekend which makes it a little difficult to write blog posts!

Luckily if you were in lab, Assignment 13 shouldn't have posed too much of a problem.

Questions 2-4: When you are doing these, remember what our "new" standard deviation is (aka what standard deviation is for a sampling distribution of x-bar). Particularly on question 3, think about what you are solving for, and where this symbol arrives in your equation. On this question, you won't be using an entire formula from your equation sheet. Adapt!

Questions 5-11 test your knowledge on the difference between the graph of an individual (population) or of a sample mean (sampling distribution of x-bar). Be careful what equation you use!!

Question 14 is probably the hardest for students, but you definitely know how to do it! The key for this problem is labeling what you know. Write down on your paper the equation. Then list the variables you have to fill:

mu, sigma, n, x-bar and z.

We clearly are solving for z to get a proportion. That means the values for mu, sigma, n and x-bar are given somewhere in this question. Find them! See how much easier this problem becomes once you label what you have? Then it is just plug and chug.

HINT: If it asks the probability that the company's average loss will not exceed, we are looking for the left proportion :) (less than).

Monday, February 13, 2012

Assignment 12

Probability is a pretty easy concept that I do not want to spend too much time on in class, since it will make up very little of your exam AND most all of you have seen these concepts before.

Probability of an event can be between 0 and 1. This makes sense. The lowest a probability can be is something has 0% chance of happening. And nothing can have more than 100% of something happening.

If there is a distribution of probabilities, they should all add up to 1. Again, this makes sense. As an example, let's say we got the probability of college students at BYU having 0, 1, 2, 3, 4+ roommates. The distribution may look like:

0 | 1 | 2 | 3 | 4+ |
.05 .10 .10 .30 .45

As you notice, .05+.10+.10+.30+.45 = 1. This is because at BYU, you HAVE to have one of those options (you cannot have less than 0 roommates, and I have covered everything in "4+"). So it has to encompass 100%.

Questions 1-4
Pretty straight forward. Choose the proportion that makes the most sense.

Questions 5-7
The key to these questions is writing out the right possibilities. Make sure you get every possible combination. I'll give you the first FOUR as a hint...but you need to come up with the rest.

GGG
GGB
GBG
BGG

The hard part for most people about these questions is the whole "x=2". X stands for the number of girls a couple has. So when x=2, it means how many arrangements are there only two girls: no more, no less.

For question 7, remember all you know about probability. What are all the values that "X" (number of girls in the combination) that can be possible? Look at your arrangements. Make sure they add up to 1!

The rest of the assignment is about parameter versus statistics, experimental design and association versus causation. We have gone over the last two extensively in class, and they are good reviews for the exam! Statistic versus parameter we will discuss in class.

Good Luck!

Assignment 11

We went over most of this assignment in my first lab, and the WHOLE assignment in my second lab, so you should be well equipped for this!

For questions 8-13, Remember that marginal distributions deal with the MARGINS, so they are only total rows. (You'll notice you have to compute the total values by yourself).

Conditional distributions are based on one condition. Block out the row/column you are interested in. Remember, that will be the SPECIFIC. (For example "whether you buy or not" is not a specific since there are nonbuyers and buyers. "Higher" IS a specific, because it just isn't "quality").

Remember: Be careful with the word CAUSE. What does that mean? When can we conclude causation?

Wednesday, February 8, 2012

Assignment 10

We went over everything you need for assignment ten last week, so here is some help. We won't be going over it this Thursday in Lab.

In one of my labs, I wasn't able to get to "r^2" (r -squared). The definition of r-squared is as follows:

"The percent variation in y, explained by x".

Realize that it is a percentage basically describing how much our x variable (explanatory) explains or describes our y variable (response). For example, Let's use house price versus house size again. You can probably imagine what this would look like (Draw it if it will help). House size is our explanatory variable and price is our response. It has a positive relationship because as house size increases, so does house price.

Now, let's say our r-value (correlation) for this is .8. To get r^2, we just square it. Thus, we get .64. r^2 is usually in a percentage, so we would say 64%.

According to the definition, (The percent variation in y, explained by x), this means "64% of variation in house price is explained by how big your house is".

This probably makes sense. A lot of how expensive our house is is because of the size. But the other 36% could be explained by location, schools nearby, property, newness, etc. This should help with problem 6.

Problem 7 is the weird one I told you about. I"ll step you through it. Remember, you'll never have to do this again.

You'll notice we have the variables "Sy and Sx" and "Y-bar and X-bar". Sx and x-bar refer to the standard deviation and mean of the x, or explanatory, variable. That means Sy and Y-Bar refer to the standard deviation and mean of y, or response, variable.

Looking at the problem, which is the explanatory and which is the response variable? Try to figure it out on your own first.

Did you get that the wife's height is the explanatory and the husband's height is the response? The clue here was that we were using the "regression line to predict the husbands height from the wife's height".

Knowing that, then it becomes easy. Sx=2.7 and x-bar=64. Sy=2.8 and y-bar=69.3. r=correlation coefficient, which is given.

Solve for b first, then plug it into the next equation.

For questions 9-12, make sure you follow the StatsCrunch instructions on Blackboard. They will help you produce an output that will make answering these questions easy.

Question 12 is asking about something called "extrapolation" which you should have learned in lecture. Extrapolation means trying to predict a y for an x outside of the range of your data. For example, let's use the example of credit hours versus hours of sleep at night.

Let's say we only collected data up until 15 credit hours. We could NOT use our line to predict someone who was taking 18 credit hours. Why not? Because we would not know what the regression line was doing after 15 credit hours.

Be careful on Question 14 - make sure you are talking about differences in the students not the environments.

Good Luck!
-Hillary

Saturday, February 4, 2012

Assignment 9

We talked about most of this in class. Be careful on number 5: how are slope and correlation related? Think about it. Is it possible that a graph could have points NOT close together and another graph have points that ARE close together, yet the "best fit line" had the same slope?

Another hint is that r values can only take on certain numbers. What number is the slope?

Questions 7-9 we talked about in class. Look back to your notes on correlation, "r". Each one of the rules we discussed falls into one of these phrases.
Correlation coefficient:

has no units
is effected by outliers
can only be between quantitative variables
only describe linear relationships
is between -1 and 1

Question 12 uses the slope definition. Here is a reminder of that definition:

"The average change in y for every one unit increase in x".

All of the red variables can be exchanged for the specific circumstance.

Good Luck!
-Hillary