Stat 121 Fall 2011: 2012

Thursday, April 5, 2012

Final Exam Review

So this review will basically just cover the new material since Exam 3. If you want to see topics from Exam 3 and before, you can look at the three other exam reviews I have written for each exam.

However, after lab today a student said "I know everything we've learned all interconnects...but how do we keep it STRAIGHT?" It was a good question. So, the first part of this review will be an overall, comprehensive look at some things we have done this semester. You should be able to click on them to make the charts larger.

1. A list of all "tests" we have performed and why/when we perform them. (There is a document on blackboard similar to this. It tells the same information, mine is just in a different format. Use whichever one makes the most sense for you)

2. A list of all conditions for any test we have done.

3. A comprehensive list of symbols and their meanings.

Okay, on to new material.

New material mostly covers:
1. Proportions
2. Chi-Square
3. Tests of Significance on Slope (Regression)
4. (small topic) Can you tell which test to use?

1. Proportions

Proportions represent categorical variables, like survey questions. You should know:

a. How to compute proportions.

p-hat is just "number of successes" over the total sample size, or n.

Be able to use the test equations. BE CAREFUL on which "p" you are using- whether is is the "null p" or the "p-hat, sample p".
Seriously though, they love to test you on that. Especially true or false. For example, "For a one-sided confidence interval estimation, we would test normality by checking np>=10 and n(1-p)>=10. You said true, right? Well it's FALSE. For confidence intervals, we check p-hat, not p. Watch out.
Speaking of np checks, know the np checks for normality. In case you forgot, they are:

Know how to calculate two-sample proportions, namely knowing what just "p-hat" is (the pooled proportion p-hat).

b. General things

Be able to write parameters for proportions (homework is good for this).
Know facts about the sampling distribution of p-hat (HINT: It's a lot like facts about the sampling distribution of x-bar...)
Really anything asked in a four step process (be able to conclude, get a p-value, etc).

2. CHI-SQUARE

a. Computing chi-square

Remember chi-square is just on two-way tables.
We compute expected counts (the zorro method, row total X table total/Row total), and then the chi-square contributor for EACH cell
We sum up all of the chi-square contributors to get the chi-square test statistic. Using the new table, which works exactly like the t-table, we get a p-value.
Don't forget about conditions.
Chi-square is always testing to see if there is SOME association between the variables.
Remember the relationship between expected counts and observed counts.
You get degrees of freedom in a different way [(r-1)*(c-1)]
Don't talk about causation. Just don't do it. We need an experiment for causation.

3. REGRESSION / SLOPE ANALYSIS

a. Theory

The theory of tests of significance on regression is that SLOPE determines if there is an association or not. Thus we are testing if slope is zero versus slope is not zero (greater than/ less than / not equal to).
Be able to write/interpret a parameter in context.
"Slope is the average change in y for every one unit increase in x"
Change out what is in green, and you are on your merry way.
Be sure you know how to check conditions

b. Calculations

Remember, regression is really all based on the "output". Be sure you know how to read them. Then the equations become easy.
Don't be fooled: DF=n-2.
Conclude in a similar fashion as with every test.

c. Reminders

Don't forget other stuff you "used to" know about regression. Namely

How to write best-fit equations based on the output
how to plug in numbers to those equations
r
r^2 (and interpretation of it)

d. Confidence Intervals versus Prediction intervals

This is pretty simple stuff. Confidence intervals : are on means. Prediction intervals: Are on individuals
Which one is wider, and why?

4. WHICH TEST IS IT?

As it is the end of a semester, we have gone through a whole lot of different procedures. They are going to ask you questions about "which procedure are we using?" Use the chart I gave above about all the different procedures to help you with this one.

Above all, take these slow. Eliminate one thing at a time.

Example:

"Suzy wants to test which oven is best. In oven one, she bakes ten loaves of bread and times the average time it takes for them to bake. She then puts ten loaves in oven two, and calculates the average baking time for those. She discovers, with a p-value of 0.002, that oven two cooks faster"

Answers:
a. One-sample t- test for means
b. One-sample z-test for means
c. One-sample z-test for proportions
d.Two-sample t-test for means
e. Two-sample z-test for means
f. Two-sample z-test for proportions
g. Two-sample t-confidence interval for means
h. Two-sample z-confidence interval for proportions
g. ANOVA
e. Chi-Square
f. Regression

Whoa. There are a lot of choices. Ask yourself some questions.
1. Is this for means or proportions? She is calculating the average bake time. Means.
2. Is this z or t? No mention of sigma. T-test.
3. Is this a confidence interval or test? It mentions a p-value, so test.
4. Is this one-sample, two-sample or matched pairs? There are two ovens, so two different data samples collected. This is two-sample.

The correct answer is (d).

You'll most likely never actually have to go through all of those questions, but they are good types of questions to ask to narrow things down.

For help on other exam material, see my other reviews posted.

GOOD LUCK GUYS. YOU'LL DO GREAT.

Have a great life :)
-Hillary

Tuesday, April 3, 2012

Assignment 29

THE LAST *REAL* Assignment!! Assignment 30 is just an extra credit TA evaluation :) Congrats, you made it!

So...I want to do all of Assignment 29 in class. So, I'm not going to write about it here just yet. If we don't get to some questions in any of my classes and I think you need clarification, check back here after Thursday. I'll write something then.

Also, My Final Exam review will be coming soon.

-Hillary

Assignment 28

(I didn't go over the Chi-Square assignment since we did all but three problems in class =] )

Questions 1-3 should be a great review of regression. We will review this in class, but you should be able to get these pretty easily.

We will be doing questions 4-8 in class. We haven't gone over this concept yet.

Questions 8 until the end are GREAT final review questions. Basically, this is taking ALL of the tests we have done and making you decide which one we should be using!

Here's a quick reminder of what the three we are comparing here are:

Chi-Square: Compares multiple proportions. AKA, we need multiple categorical variables (more than two). This means we will always be using a two-way table.

Regression: Compares two quantitative variables. (Like weight versus height). We use scatterplots to recognize correlation in these (using best fit lines).

ANOVA: Compares multiple means. (like I want to know if a certain scent makes people stay in a restaurant longer, so I compare the mean time of three, like lemon, sage and lavender)

Hope that helps!
-Hillary

Wednesday, March 28, 2012

Assignment 26

Question 1-4 are about confidence intervals on p. These confidence intervals are exactly like confidence intervals for means! Just follow the equation.

Some things to note:
So the equation is:

p-hat +/- z* [ sqrt ((phat*(1-phat))/n)) ]

So we can see that this equation follows what we are used to. P-hat is like x-bar: the sample proportion. We know how to find z*- it hasn't changed! The next portion is just the standard error- just like we are used to!

Keep note though, I said it was the standard error. Can you tell the reason? It's because we are using p-hat, not Po because we do not have hypothesis.

The confidence interval conclusion has not changed. (BE CAREFUL: We aren't talking about the true mean any more! Remember the proper parameter- check question 1 or last assignment for help.)

Question 5: Here's a hint: Whenever the question is "Why can't we" it means CHECK CONDITIONS!!

Question 6: Check out the equation on the right hand side of the equation sheet, in the row right under the heading "proportions".

Questions 9-12: We will be doing questions 13-16 in class, so these questions will be easier to answer after that!

Good Luck!
-Hillary

Wednesday, March 21, 2012

Exam 3 Review

Exam 3 is here! Remember this exam typically has the lowest averages, so we typically say it is the most challenging. This may or may not be true for you specifically, but prepare well.

What makes it challenging? It is a LOT of interpretation. Most people are comfortable with all the calculations. We test you on your understanding of the concepts and definitions.

Basically, this means know your definitions. Know them in and out. Know how to interpret them and recognize them.

Main Topics:
1. Tests of Significance
2. Confidence Interval Estimations
a. For both 1&2, need to know t and z tests, four step process
3. ANOVA

Side Topics
1. Type I/Type II errors
2. What type of procedure is this?
3. Two-sided confidence intervals
4. Sample size
5. Symbols

1. Tests of Significance

Definitions to KNOW:

Test of significance: An outcome that is unlikely to happen if a claim is true is good evidence that the claim is not true. (This is the theory of a test of significance. Remember the coin example?)
p-value: The probability of getting an x-bar as extreme or more extreme if the null hypothesis is true. (KNOW THIS. You will need to be able to INTERPRET this as well. Meaning if I give you an actual situation, you could put numbers into the right locations. You can see previous posts for a more in depth explanation of this).
Parameter: The mean of what you are finding out about the population. Okay, so this isn't really a definition, but be comfortable writing parameters. (Remember we need the MEAN and the POPULATION).
Null/Alt Hypothesis: Null Hypothesis: Statement of no change. Alternative: What we want to prove.

Procedure

Obviously you need to be comfortable with every part of the four step process for a test of significance (for both t and z tests).

Write the parameter, null and alternative hypothesis and state the level of significance. I've talked about these previously, but make sure you can do them.

Conditions
For a Z-test the conditions are (and are met by):
1. Randomization: Met through SRS OR RAT.
2. Normality: Met through CLT OR graph displaying approximately normality.
3. Sigma is known: Yes or no. They give it to you or they don't.

For a t-test conditions are (and are met by):
1. Randomization: Met through SRS OR RAT.
2. Normality: Met through CLT OR graph displaying no extreme skewness or outliers.

For a z-test, we use the equation z=x-bar - mu/ (sigma/ sqrt(n)). This is called the test statistic. We then go to the z-table and get a value for the p-value.

Things to remember about how to find z-test p-values:

One-sided test with Ha: Mu<#: Read p-value directly off table.
One-sided test with Ha: Mu>#: 1-table value = P-value.
Two-sided test with x-bar < null hypothesis mu: 2*(table value)= p-value.
Two-sided test with x-bar > null hypothesis mu: 2*(1-table value)=p-value.

Don't take my word for it though...draw the picture :)

For a t-test, we use the equation t=x-bar-mu/( s/sqrt(n). This is called the t test statistic. We then go to the t-table and get a value for p-value.

We like the t table because it already accounts for if it's a one-sided or two sided or if it is greater than or less than. The basic process to find the p-value is as follows:

Take your t test-statistic
Find your degrees of freedom (df) (n-1)
Enter the table on your df row.
Find the two values that sandwich your t test stat.
Follow those two columns down to the bottom.
Decide if you have a one-sided or two-sided test
Read the two p-value values off
Say "P-value = Number on right < P-value

I have an example on a previous post.

Conclude

Compare p-value with alpha
Reject/Fail to Reject Null (p-valuealpha, fail to reject).
Conclude in context.

2. Confidence Interval Estimation

Definitions to KNOW:

What is a confidence interval: It is used to estimate the mean. Gives reasonable values for the mean, etc.
Confidence Level: If the procedure were repeated many times, confidence level is the amount of INTERVALS we would expect to contain the true mean. (This is an important one. Realize what confidence level is NOT: It is NOT how often our interval will contain mu, or x-bar or the percentage of time we are right).

Margin of Error: the amount we expect our mean (mu) to differ from our sample mean (x-bar).

Procedure

Write the parameter, choose confidence level. Conditions are the same as for test of hypothesis.

Z confidence interval

Use equation x-bar +/- z* (sigma/sqrt(n)).

Finding z*

Go to the t-table
Find the row with your confidence level in it (top of chart)
Follow it down to the third row from the bottom labeled "z*".

t confidence interval

Use equation x-bar +/- t* (s/sqrt(n))

Finding t*

Find degrees of freedom
Go to where your df and your confidence level intersect
That is your t*

Conclude

Use the cookie-cutter answer to conclude for confidence intervals.

"We are _______% confident that the true mean _____________ lies between (_____,_____)"

a. Difference between Z and t tests and four step process

We basically stepped through the four step process for both confidence intervals and tests of significance above. Wouldn't hurt to go over the different sections, though.

Remember, we use a z-test is Sigma (population standard deviation) is KNOW, we use a t-test if sigma is UNKNOWN.

For multiple choice, it's helpful to really remember if you are using a t and z test. Remember, a z-test will give you a one-number p-value. A t-test will give you a range of values. Keep this in mind for your answers.

It may help you to remember the differences between the distributions (we talked about this briefly in class, check your notes). For example, the t-distribution has more areas in the tail (less precise). As n increases, the t-distribution becomes more like the z-distribution in shape.

3. ANOVA (Analysis of Variance)

Anova is all about reading the output. Be sure you know how to:

Write Hypothesis
Find the p-value
check conditions
Conclude IN CONTEXT based on the confidence intervals

If you want to go through an example, you can see my post about Assignment 24.

SIDE TOPICS

1. Type I/ Type II Errors

You should know the definitions of these/ what they are. I can't really give you the graph on the blog, but check over your notes and be sure to understand/know where everything is on the graph.

Be able to know the connection between alpha and beta, and when we might make alpha big or small (relatively).

Type I Error: Rejecting a true null hypothesis

Type II Error: Failing to Reject a false null hypothesis

Power: Rejecting a false null hypothesis or accepting a true alternative hypothesis

Also be able to do something like this:

Ho: Hillary is not addicted to "drawsomething"

Ha: Hillary is addicted to "drawsomething"

If she is not addicted, she gets to keep the app. If she is addicted, the app gets deleted. Be able to write the errors and power for something like that.

2. What type of procedure is this?

Be able to recognize procedures (aka is this a "one sample t-test" or a "two sample z-test" etc).

Think of the chart we drew in class. You can choose one thing from each column.

One Sample | z | test (of significance)

Two Sample | t | confidence interval estimation

Matched Pairs | |

3. Two-sided confidence interval estimation

I caution to even write about this. This was a question on the homework we didn't have time to go over in class. This is when you can use a confidence interval to prove/disprove hypothesis.

BE AWARE: This can ONLY be used in a TWO-SIDED TEST. And it should ONLY be used when specifically asked for. You should never, ever conclude a confidence interval this way. Concluding confidence intervals means using the "cookie-cutter" answer. Only if they ask you to do this should you, regardless if it is a two-sided test or not.

Okay. Read that warning a few times. Make sure you understand.

Anyways, Let's say we have the following hypothesis:

Ho: Mu=15

Ha: Mu does not equal 15.

Since we have a two-sided test, we could use a confidence interval to decide whether to reject or fail to reject the null hypothesis.

For example, let's say we got the interval (20, 28). Because the purpose of a confidence interval is to estimate the true mean, what could we say in this case? Clearly, 15 is not in the interval. So we could reject the null in this case.

Let's say we got an interval of (13,20)? Well, because fifteen is IN the interval, there is not sufficient evidence that mu is NOT 15. Thus, we fail to reject the null.

4. Sample Size Estimation

This is super easy. It's just an equation. You did it once on the homework. The equation you use is on the second row, far right side, of your equation sheet.

The only thing you need to know is that you always round UP. No matter what. The reason is that this is solving for a specific number of people you need to sample in order to meet specifications given to you. So if, for example, your equation gave you:

n= 24.3 people.

If you use 24 people, you don't meet the requirements. So you must round up to 25.

5. Symbols

So symbols are on every exam. But a few of you brought them up in class, and since I like you guys, I'm going to do a section on them. Here's symbols you should know:

µ (possible answers: mean of the sampling dist. of x-bar, population mean)

σ (population standard deviation)

x-bar (sample mean)

σ/√n (standard deviation of the sampling distribution of x-bar)

s/√n (standard error of the sampling distribution of x-bar, or just standard error)

s (standard deviation of a sample)

This is not necessarily a comprehensive list, but it should help.

Finally, don't forget the written portion. Be sure you:

Go over assignment 21. It was there for a reason.
Know how to do the four step for both two-sample and matched pairs. (this includes things like, how does the parameter change? When do you do each one? What do you graph? What equations do you use? We went over all this in class).

Remember, this is not necessarily a comprehensive list, just things I think will help.

GOOD LUCK!!!

-Hillary

Monday, March 19, 2012

Assignment 24

We already went through a bulk of assignment 24, but I realize it was very rushed, so I want to go through an example that isn't on your homework (but will clearly help you do all of your homework.) It also may be a good idea to go through this example as an exam review if you have already finished the homework. Just try to answer the questions before looking at my answers.

Problem (in honor of this Thursday nights event):

The Capitol wants to know what the average survival time in hours of the tributes from each of the twelve districts will be for the 74th hunger games. Sample data was collected from a random sample of 20 hunger games. Data can be considered to be normally distributed.

1. What type of study is this?

Observational. The hunger games (the past 20 sampled) have already happened. We are observing. (Although the hunger games themselves are certainly not an observational study...)

2. What are the hypothesis of this test?

Ho: µ1=µ2=µ3=µ4=µ5=µ6=µ7=µ8=µ9=µ10=µ11=µ12

Ha: Not all means are the same.

Below is the ANOVA output.

3. Is the normality condition met?

Yes, stated in the problem.

4. Is the randomization condition met?

Yes, stated in the problem. (could be met through graphing as well).

5. Is the equal variance condition met?

The largest Standard Deviation is: 28. The Smallest standard deviation is: 20

28/20=1.4 < 2. This condition is met.

6. How does p-value compare to alpha?

We can find the p-value on the chart. It is 0. Because we have 95% confidence intervals, that gives us an alpha=0.05. Thus, 0<0.05, so we are significant. Thus, one of the means does differ.

7. What does this mean in context? (In other words, what do the confidence intervals tell us? or What means differ? or Are some districts likely to last longer? All of these ways are reasonable ways to ask the same question).

We can see that districts 3-11 have confidence intervals that significantly overlap. This shows that their true means are unlikely to differ by much. However, we can see that districts one and two's confidence intervals are much larger, showing that on average they last much longer. District 12's confidence interval shows us that it differs by them lasting much less time in the arena.

May the odds ever be in your favor...(can you tell I'm excited?)

-Hillary

Assignment 23

Sorry this is late. Wrote about it but it must have not saved...

For the first few questions, think about what each part of the procedure does for you.

You can have a:

One sample: (one population, one group of data)

Two Sample: (two populations, a treatment and no treatment, two sets of data)

Matched Pairs: (two sets of data, yet both are on the same individuals).

AND

t-test: sigma unknown

z-test: sigma known

For questions 3-12, we are doing a two sample test because we have men and women scores for the same thing.

Remember, the parameter differs from a one sample. What is the word we need in there that makes it different? HINT: It starts with a "D".

Remember this question is a great way to prepare for the exam written part.

Good Luck!

-Hillary

Thursday, March 15, 2012

Assignment 22

Way to go today! I realize it was long and boring, but you guys made it through it! Great news though, we are back on schedule now which means more homework in class! But onto assignment 22.

Something I forgot to mention in class is that we call sigma/sqrt(n) the standard deviation of x-bar.

Well s/sqrt (n) = standard error.

Questions 2-5 are just a simple one-sample t-test, so that should be simple enough.

Questions 9-15: What kind of procedure is this?

In this case, they didn't necessarily give you the differences, but you can clearly see that each student has TWO measurements: two thighs were hit with tennis balls. (And of course, I gave you this example in class, so that might have helped :) )

Be careful on calculating the t-test statistic. Remember we only care about the differences. You need to use statcrunch to do this.

Good Luck!

Sunday, March 11, 2012

Assignment 21 + t-test Practice

We talked about this assignment 21 a lot on Thursday, so I don't feel the need to go over it a whole lot. We answered all of the "Plan" section, and even one of the answers to the "Plan" section of part B.

A common mistake on this one is the "list and state how the conditions are met". This means you must STATE the condition (like Normality:) then after the colon, state how it is met for this particular problem.

Remember how the conditions change slightly for a t-test.

On part B, be careful with the t-test confidence interval: we are using t*, not Z*, which I think is a common mistake.

Okay, now onto t-test practice so you can actually solve the problems!

Let's do an example!

Hillary thinks that the statistic department isn't correctly stating the actual amount of late-fee money they receive from Stat 121 students. They claim that on average each test gives them 5,000 dollars. Hillary takes a simple random sample of a 10 different testing periods over the last five years and gets a mean of 5,800 dollars and standard deviation of 750. Alpha = 0.05. Assume test fees are normally distributed.

STATE: Is the true mean income earned by Stat 121 late fees greater than 5,000 dollars?

Okay. So there are a few things we notice here off the bat. First where is the standard deviation from? It says in the problem it is from the sample, meaning that we know S, not sigma. This means we will be doing a t-test. Also, the STATE lets us know what our hypothesis will end up being (greater than).

For the sake of this problem, we aren't going to go through the entire Plan or Solve steps, only because the point of this problem is to help you learn how to use the t-table.

Ho: Mu=5000
Ha: Mu > 5000

t=5800-5000/ [750/sqrt(10)] = 3.37

Now we go to the t chart. We need one more thing though before we use it: degrees of freedom. Remember, df= n-1.

So in this case, df= 10-1 = 9.

Go to the tenth row in the t-table. Find the two values that sandwich our t value.

I see that the t* values of 3.690 and 4.397.

I then trace my fingers down to the "one sided t test" row (because we are greater than) and read off the two p value values: .005 and .0025

Thus I can say my p value is: .0025< p value < .oo5.

Conclude as usual.

Now let's try a t confidence interval.

The only thing that changes for a t test versus a z test is we are finding t* instead of z*.

Let's try finding t* using our problem above.

We need two things to find t*. One, degrees of freedom which we already found to be 9. Second is confidence level. Since we had an alpha= 0.05, it follows that our confidence level is 95%.

Now we simply find where our df and confidence level intersect. This is our t*.

From the table, I get t*= 2.262.

Hope that helps! Remember don't wait to do assignment 21. The open lab will be overflowing on Wednesday. Get it done! As always if you have questions email me!

-Hillary

Assignment 20

Most of assignment 20 is stuff you are familiar with. Remember, since the first few questions are about "she knows the standard deviation should" we are still talking about sigma, thus doing z-tests. Try not to get confused! Otherwise, it's just the standard procedure.

Remember the definition for p-value:

"P-value is the probability of getting an x-bar as extreme or more extreme if the null hypothesis were true"

This definition has slightly more things you can see "subbed in" for. Let's try an example.

Let's do the example we talked about in class, the pink cookies from the vending machine. We get an x-bar of 650 calories, and we are testing:

Ho: Mu=600 cal
Ha: Mu>600 cal.

We calculate a p-value of 0.03. If the question asked us to interpret the p-value in context, we might say:

"The probability is 3% of getting a value as high or higher than 650 calories if the true calories of the cookies was 600."

I highlighted the same colors of the sentence that correspond to the definition sentence. See how the main points are there and how you can recognize them? There are obviously different ways to re-arrange the sentence, but all the main parts have to be there.

The last questions (questions 8-11) are what we couldn't go over and you should have learned in class. Just some hints (we will go over what it REALLY means this Thursday.)

A type I error is REJECTING a TRUE null hypothesis.

A type II error is FAILING to REJECT a FALSE null hypothesis.

For example:

Ho: The cake is done.
Ha: The cake is not done.

In a type I error, we REJECT a null hypothesis that was actually TRUE. So, We would say that the cake is not done when the cake was actually done (meaning we left it in the oven and overcooked it).

In a type II error, we would take the cake out, but it wasn't done yet (because we failed to reject the null, but it was false).

alpha=probability of a type I error
beta=probability of a type II error.

I hope that helps you answer the questions, although it contains none of the explanation. I think it will make more sense once we go over it.

Good luck!
-Hillary

Monday, March 5, 2012

Assignment 18

Assignment 18 is testing your knowledge on all the vocabulary we talked about on Thursday dealing with tests of significance.

Don't get confused by the wording on question 1. You know what the mean and standard deviation are of a sampling distribution of x-bar: remember, we always assume the null hypothesis is true.

Remember: Test-statistic = z-score.

I realize in question 6 that they do not give you an alpha. But even without an alpha, you should be able to answer the question. Which p-value gives us more evidence against the null (helps us accept the alternative)? What does it mean when p-value is low? When p-value is high? How do we get these p-values? If our test statistic (z) is further from the mean, does that give us a high or low p-value? DRAW IT OUT ON A GRAPH! It will help. I promise.

The rest of the questions step you through the process we talked about at the end of class.

Be careful on p-value in question 12: Our null hypothesis is greater than. What proportion do we want from the table?

Good Luck!
-Hillary

Assignment 17

We pretty much did all of assignment 17 in class on Thursday. Something I didn't mention:

Statistically significant means it did not happen due to chance alone. Meaning, our p-value was significant. In other words, if p-value is less than alpha, and we reject the null hypothesis, our p-value was statistically significant.

-Hillary

Thursday, March 1, 2012

Assignment 16

I am SO SORRY this is so late. This assignment used to be Assignment 17 so I was expecting to be able to do more of it in class.

The key here is to remember that you only need to choose ONE POPULATION.

So, for example, you'd have one of these populations:

Middle-aged American women of a healthy weight and BMI that don't drink wine.

OR

Middle-aged American women of a healthy weight and BMI that do drink wine.

Once you choose one, roll with it for the rest of the time.

Thus when you write the parameter, just write about one of the populations (the one you chose). How does a parameter differ from the population? What word do we need to add? (Hint...it starts with "m").

Remember the difference between confidence level and confidence interval. Interval is what we actually report, LEVEL is that complicated definition we talked about. Here's a hint for question 5, you should NOT answer:

"That is the percentage of the time we will find mu in the interval". This is WRONG. Hopefully this WRONG answer will help you remember the right one :)

The last question we want to use our conclude cookie-cutter answer.

Good Luck!

Tuesday, February 21, 2012

Exam 2 Review!

It's that time- Exam time!

Main Topics (Aka BIG topics)
1. Regression
2. Two-Way Tables
3. Sampling Distribution of X-bar

Side Topics (Small topics)
1. Statistic versus Parameter
2. Control Charts
3. Response versus Explanatory Variables
4. Probability

1. REGRESSION

Main Point: Regression is all about knowing your vocab, vocab, vocab. And how to apply it of course.

Regression is on Bi-variate, quantitative data (if you don't know what that means, I'd look it up!) The whole point is to find out if there is an association between your two variables.

a. Scatterplots
Scatterplots have Strength, Form and Direction. Understand what each of these means and how to recognize them.

Form: Linear/non linear
Direction: Negative/positive
Strength: Correlation coefficient, r

Correlation Coefficient: Be able to recognize false statements about r, be able to guestimate r based on a graph, and now how outliers affect r.
Things to remember about r:
-No units
-Only quantitative data
-Only linear, data
-between -1 and 1
-effected by outliers

r-squared (r^2): "The percent variation in y, explained by x" See previous posts for a good explanation on this. Be able to "interpret in context" (i.e. replace the highlighted words) as well as recognize this definition to know when they WANT r-squared.

TEST YOURSELF: What is the difference between r and r^2? What do they tell us?

Least Squared Regression Line: "Minimizing the sum of the squared residuals". Never hurts to know a definition. Basically, this is our best fit line that allows us to predict values based on the equation of the line. Which is:

y-hat = bx + a

Where y-hat is your predicted y, b=slope and a= y-intercept.

Realize you never have to come up with b and a by yourself, they will always be given in outputs. Practice knowing how to read these outputs!!

Slope: "Average change in y for every one unit increase in x". This is the big one! Know how to interpret in context (ie, exchanging out the highlighted words) and know how to recognize the definition when they are asking just for the number. (Example: "What is the average change in farm population for every year that passes? Do you see the "change in y" for a "unit increase" in x? They would just want the slope value, found from the output).

Realize that it may not be WORD FOR WORD of this definition (they can change up the order, for example, saying "for every one unit increase in x, what is the average change in y?" or for a practical example, "For every year that passes, what is the change in farm population on average?" If you KNOW the definition, you should be able to recognize it.

Summary: Know your definitions, because everything is given to you in the output. You merely need to recognize what they are asking. Don't panic. You know this stuff.

2. TWO-WAY TABLES

Main Point: This is about calculation. Two-way tables assess relationships (aka, association between variables) for bivariate categorical data.

Marginal Distributions
These deal with "overalls". They are exclusively in the margins. Don't leave the margins. Okay? Because they are distributions, expect more than one number.

Conditional Distributions
These are based on a condition, or rather, a specific. They are still distributions, but they are based on the "inside" numbers.

Conditional Values
These are a singular value based on two conditions. The first condition is the total that governs.

TEST YOURSELF
In case the above didn't make much sense, let's practice!

Study in the Library mainly

What is the marginal distribution of years?
What is the marginal distribution of people who study in the library?

What is the conditional distribution of sophomores?
What is the conditional distribution of people who DO study in the library?

For those who are seniors, how many study in the library?
For those who study in the library, how many are seniors?

Answers:
1. 0.1677, .3829, .2974, .1518
2. 0.48, .515

3. 0.5619, .438
4. 0.294, .444, .1503 .... (keep going)

5. 0.354
6. 0.111

Summary: Know your calculations.

3. Sampling Distributions of x-bar.

Main Point: This is probably the most concept-heavy portion of the second test, and thus, the thing people have the hardest time with. Know. These. Concepts. Understand them well, and you won't have a problem.

What is a sampling distribution of x-bar? It is the distribution of sample means from every possible sample of a particular size n.

Know the difference between a population graph and a sampling distribution of x-bar graph. (here's a hint to remind you: population = individual, S.D. of x-bar = sample mean).

Central Limit Theorem: For a non-normal population, if n>or= 30, then the sampling. distribution of x-bar is approximately normal. (as a side note: For an already normal population, if n> or = 30, then the sampling distribution of x-bar is exactly normal.)

Know how to use the formula associated with sampling distributions of x-bar (i.e. z=x-bar-mu/sigma/sqrt(n)). Remember, you use it in the same way as the population z-score. But practice anyways.

Things to know: The standard deviation of a S.D. is SIGMA/SQRT(n).

The mean of a sampling distribution of x-bar is ALWAYS. ALWAYS. ALWAYS = to mu. Always. Doesn't matter what n is. Always. Always.

TEST YOURSELF:What happens to the following graphs?

Let's start with a population of 15,000 that is severely left skewed.

If I take a sample of 800 people and graph it, what does it look like?
If I take all possible samples of size n=10 and graph it, what does it look like?
If I take all possible samples of size n=600 and graph it, what does it look like?

Answers:
1. The key here is that I only took ONE SAMPLE. Remember, the CLT only applies to sampling distributions of x-bar. So, we would expect it to look like the population.

2. This is a sampling distribution of x-bar, however, n<30 so CLT doesn't apply. This graph would be more normal, but cannot be considered normal because the CLT doesn't apply.

3. This is a sampling distribution of x-bar and n>30, so the CLT applies and we have an approximately normal distribution.

As a side note, don't forget about the law of large numbers, which only deals with samples (not sampling distributions of x-bar). Check out your notes on that one.

Summary: Again this is concept heavy. Of course, it does require some calculations in way of the formula, but if you were okay with these on the first exam, you should be okay now.

4. THE SMALL STUFF

Be sure to go over your definitions for:

statistic versus parameters (be able to tell the difference)

Control Charts (remember the equation for the upper and lower limits aren't on your equation she

et. Memorize them!)

Probability (know facts about the probability distribution and what it takes to have a proper one)

Response versus explanatory variables (be able to tell the difference).

And of course, brush up on knowing the different experimental designs!

GOOD LUCK!!
-Hillary

Need more Help?

I have another TA friend who writes a fantastic website sort of like this blog that is a great resource if you need more help! While my blog is homework focused, hers is more concept focused. If you have a concept you are struggling with, more often then not you will be able to find great example problems and powerpoint slides on the topic. Check Kiya's website out!

https://sites.google.com/site/kiyabyustat/

I've also listed a link to her site in the sidebar :)

Happy Stat 121-ing!

-Hillary

Assignment 14

We will be going over Control Charts on Thursday (and doing questions 1-2).

Questions 5-7 are a great review on the difference between populatino distributions and sampling distributions of x-bar. Remember, if i were to take out ONE data point from a population graph, what would it represent? What about if I were to take it out of a sampling distribution of x-bar? What would it represent? (This should help with questions 6 and 7).

Question 8-10 are a simple review on statistics versus parameters. I think question 10 poses the most challenge. Remember your key words to know if it is talking about a parameter. (Hint: things that are "known" or about "all" are generally parameters).

Finally, questions 11-13 were discussed at the end of lab last week. This is relating the central limit theorem and how graph shapes will change (as well as the mean). Look over your class notes! (PS- These are VERY helpful questions to know for the exam! Be sure to understand them.)

-Hillary

Assignment 13

Sorry this is so late folks! My computer broke over the weekend which makes it a little difficult to write blog posts!

Luckily if you were in lab, Assignment 13 shouldn't have posed too much of a problem.

Questions 2-4: When you are doing these, remember what our "new" standard deviation is (aka what standard deviation is for a sampling distribution of x-bar). Particularly on question 3, think about what you are solving for, and where this symbol arrives in your equation. On this question, you won't be using an entire formula from your equation sheet. Adapt!

Questions 5-11 test your knowledge on the difference between the graph of an individual (population) or of a sample mean (sampling distribution of x-bar). Be careful what equation you use!!

Question 14 is probably the hardest for students, but you definitely know how to do it! The key for this problem is labeling what you know. Write down on your paper the equation. Then list the variables you have to fill:

mu, sigma, n, x-bar and z.

We clearly are solving for z to get a proportion. That means the values for mu, sigma, n and x-bar are given somewhere in this question. Find them! See how much easier this problem becomes once you label what you have? Then it is just plug and chug.

HINT: If it asks the probability that the company's average loss will not exceed, we are looking for the left proportion :) (less than).

Monday, February 13, 2012

Assignment 12

Probability is a pretty easy concept that I do not want to spend too much time on in class, since it will make up very little of your exam AND most all of you have seen these concepts before.

Probability of an event can be between 0 and 1. This makes sense. The lowest a probability can be is something has 0% chance of happening. And nothing can have more than 100% of something happening.

If there is a distribution of probabilities, they should all add up to 1. Again, this makes sense. As an example, let's say we got the probability of college students at BYU having 0, 1, 2, 3, 4+ roommates. The distribution may look like:

0 | 1 | 2 | 3 | 4+ |
.05 .10 .10 .30 .45

As you notice, .05+.10+.10+.30+.45 = 1. This is because at BYU, you HAVE to have one of those options (you cannot have less than 0 roommates, and I have covered everything in "4+"). So it has to encompass 100%.

Questions 1-4
Pretty straight forward. Choose the proportion that makes the most sense.

Questions 5-7
The key to these questions is writing out the right possibilities. Make sure you get every possible combination. I'll give you the first FOUR as a hint...but you need to come up with the rest.

GGG
GGB
GBG
BGG

The hard part for most people about these questions is the whole "x=2". X stands for the number of girls a couple has. So when x=2, it means how many arrangements are there only two girls: no more, no less.

For question 7, remember all you know about probability. What are all the values that "X" (number of girls in the combination) that can be possible? Look at your arrangements. Make sure they add up to 1!

The rest of the assignment is about parameter versus statistics, experimental design and association versus causation. We have gone over the last two extensively in class, and they are good reviews for the exam! Statistic versus parameter we will discuss in class.

Good Luck!

Assignment 11

We went over most of this assignment in my first lab, and the WHOLE assignment in my second lab, so you should be well equipped for this!

For questions 8-13, Remember that marginal distributions deal with the MARGINS, so they are only total rows. (You'll notice you have to compute the total values by yourself).

Conditional distributions are based on one condition. Block out the row/column you are interested in. Remember, that will be the SPECIFIC. (For example "whether you buy or not" is not a specific since there are nonbuyers and buyers. "Higher" IS a specific, because it just isn't "quality").

Remember: Be careful with the word CAUSE. What does that mean? When can we conclude causation?

Wednesday, February 8, 2012

Assignment 10

We went over everything you need for assignment ten last week, so here is some help. We won't be going over it this Thursday in Lab.

In one of my labs, I wasn't able to get to "r^2" (r -squared). The definition of r-squared is as follows:

"The percent variation in y, explained by x".

Realize that it is a percentage basically describing how much our x variable (explanatory) explains or describes our y variable (response). For example, Let's use house price versus house size again. You can probably imagine what this would look like (Draw it if it will help). House size is our explanatory variable and price is our response. It has a positive relationship because as house size increases, so does house price.

Now, let's say our r-value (correlation) for this is .8. To get r^2, we just square it. Thus, we get .64. r^2 is usually in a percentage, so we would say 64%.

According to the definition, (The percent variation in y, explained by x), this means "64% of variation in house price is explained by how big your house is".

This probably makes sense. A lot of how expensive our house is is because of the size. But the other 36% could be explained by location, schools nearby, property, newness, etc. This should help with problem 6.

Problem 7 is the weird one I told you about. I"ll step you through it. Remember, you'll never have to do this again.

You'll notice we have the variables "Sy and Sx" and "Y-bar and X-bar". Sx and x-bar refer to the standard deviation and mean of the x, or explanatory, variable. That means Sy and Y-Bar refer to the standard deviation and mean of y, or response, variable.

Looking at the problem, which is the explanatory and which is the response variable? Try to figure it out on your own first.

Did you get that the wife's height is the explanatory and the husband's height is the response? The clue here was that we were using the "regression line to predict the husbands height from the wife's height".

Knowing that, then it becomes easy. Sx=2.7 and x-bar=64. Sy=2.8 and y-bar=69.3. r=correlation coefficient, which is given.

Solve for b first, then plug it into the next equation.

For questions 9-12, make sure you follow the StatsCrunch instructions on Blackboard. They will help you produce an output that will make answering these questions easy.

Question 12 is asking about something called "extrapolation" which you should have learned in lecture. Extrapolation means trying to predict a y for an x outside of the range of your data. For example, let's use the example of credit hours versus hours of sleep at night.

Let's say we only collected data up until 15 credit hours. We could NOT use our line to predict someone who was taking 18 credit hours. Why not? Because we would not know what the regression line was doing after 15 credit hours.

Be careful on Question 14 - make sure you are talking about differences in the students not the environments.

Good Luck!
-Hillary

Saturday, February 4, 2012

Assignment 9

We talked about most of this in class. Be careful on number 5: how are slope and correlation related? Think about it. Is it possible that a graph could have points NOT close together and another graph have points that ARE close together, yet the "best fit line" had the same slope?

Another hint is that r values can only take on certain numbers. What number is the slope?

Questions 7-9 we talked about in class. Look back to your notes on correlation, "r". Each one of the rules we discussed falls into one of these phrases.
Correlation coefficient:

has no units
is effected by outliers
can only be between quantitative variables
only describe linear relationships
is between -1 and 1

Question 12 uses the slope definition. Here is a reminder of that definition:

"The average change in y for every one unit increase in x".

All of the red variables can be exchanged for the specific circumstance.

Good Luck!
-Hillary

Monday, January 30, 2012

Exam 1 Review

Here are some things to focus on for exam 1. This is not meant to be a comprehensive list. It is merely trying to help you focus on some important things to study.

1. Definitions

Definitions are a huge part of this exam. Be sure to understand them, not just memorize them. Some definitions to know:

Population versus Sample (can you identify the sample/population?)
Experiment versus Observation study (How can you tell? Be sure to know! Experiment = treatments applied, observational study = just looking at something that has already happened)
A control / comparison
Replication (be careful with this one!)
Explanatory Variable versus Response Variable (explanatory = treatments, response = a measurable thing on the individual)

2. Randomization

This is a huge part of all your tests. KNOW the randomized designs and how to recognize them.

Know the difference between RANDOMIZED SAMPLING and RANDOMIZED DESIGNS.
Randomized Sampling includes: SRS (Simple Random Sample), Stratified Sampling, and Multi-Stage Sampling
Randomized Experimental Designs: CRD (completely randomized design), Block Design and Matched Pairs.
Realize that if it's an experiment, we care more about the randomized experimental design than the sampling design.

3. Graphing

Know what types of graphs are categorical and quantitative (and which ones we like and how to recognize them)
Five number summary. How to find it, how to recognize it on a blox plot. (The percentage between Q1 and Q3 is....)
Shape, Center and Spread of a graph. (Shape: Skewed right/left etc, Center: mean/median and when to use them, and Spread: IQR and St. Deviation. Know things about both.)

4 . Z-Scores!

Check my post below for a comprehensive review on z-score equation problems.

Remember, many things are concept base, so make sure you understand why, not just how.

Good Luck!
-Hillary

Z-Score Review

Z-scores are going to be an important part of the upcoming test (well and the rest of the semester). The way I see it, there are three main types of z-score problems.

1. A "higher than" or "lower than" Problem: These types of problems give you an "x" and want you to find the percentage of something higher or lower than that value.

2. A "in between" problem: These type of problems want you to find the percentage or proportion between two x-values.

3. A "Give you to percentage/proportion" problem: These problems give you a proportion and would like you to work backwards to solve for x.

Using the examples in class (although the numbers may be slightly different), I'll give you a problem of each type. We talked about Jimmer's statistics with the kings according to ESPN.

Mu: 8.8 ppg (points per game)
Sigma: 1.4 ppg

Type 1: Find the proportion of games that Jimmer scores above 11 ppg.

Type 2: Fine the proportion of games that Jimmer scores between 7 and 11 ppg.

Type 3: Find the threshold ppg for the top 5% of all of Jimmer's games.

Solution:

Type 1. This question is asking us for a proportion above. Thus, we plug in the numbers to our z-score equation: z= (11-8.8)/1.4 = 1.57
We take this z-score and look it up on the z-table. From the table, we read: 0.9418. But since we are asking for above and we know the z-table only gives us the proportion to the left, or underneath, we subtract the proportion from one. Thus, our answer is 1-0.9418=5.82% or .0582

Type 2. This question is asking for us to find the proportion between. This might seem difficult, but if you draw it out it will make more sense. The very first thing you need to do in between problems is do two separate z-score equations for both x-values. In this case, our x-values are 7 and 11. Since we already did the problem for 11, all we need is to do it for 7. So, z=(7-8.8)/1.4 = -1.29. Looking this up on the z-table, we get the proportion of 0.0985.

Now think of the graph. We can get the area to the left of 7, and the area to the left of 11. Draw that out on a piece of paper on a normal graph. See what we have to do? It's clear from the graph we just need to subtract the smaller proportion from the larger proportion.

So, 0.9418-0.0985= 0.8433 = 84.33%

Type 3: The last type is typically the hardest. They are giving us a proportion. Where do we find proportions? That's right, in the middle of the z-table. We must work backwards: we use the proportion to find a z-score, then solve our equation for x.

This one is even trickier though. Because we want the top five percent, we have to remember to look up what the area to the LEFT is, since that is what the table tells us. Thus, we look 95% up in the table. The closest proportion I can find is .9505 (you could also use .9495, going above or below doesn't matter, as long as it's the closest). This corresponds to a z-score of 1.65.

Plugging it into my equation I get 1.65= (x-8.8)/1.4. x= 11.11 ppg.

Hopefully that helps with the concept. Here are some more practice questions below I'd try out (the numbers don't correspond with the type. Try to figure out what type they are for yourself). The answers are listed after with some tips if you got it wrong.

Let's look at average months dating to engagement time at BYU. Let's say the average is 5 months with a standard deviation of 2 months. Find the following.

1. The number of months until engaged that are in the bottom 10 percent of BYU students.

2. The percentage of students who get engaged at 3 months of dating or less.

3. The proportion of students who get engaged at 9 months or more.

4. The number of months that are in the top two percent of students.

5. The proportion of students who get engaged between 4 and 13 months.

Try these on your own! But here are the answers:

1. Type 3 problem. -1.28 is the z-score from the table, so the x-value you get is : 2.44 months.

2. Type 1 problem. Since it is "or less" you keep the proportion from the table. Z-score= -1. Answer: 0.1587.

3. Type 1 problem. This is an "or more" so you need to subtract the proportion from 1. Z-score= 2. Answer: 0.0228.

4. Type 3 problem: This is "in the top". So we look up 98% in the table. We get a z-score of: 2.5. Answer = 10 months.

5. Type 2 problem. Z-score for 4: -0.5, proportion: 0.3085. Z-score for 13: 4. Proportion:... Wait...what? four? But that isn't on our table!! That's okay. What is the proportion for four? It just means that EVERYTHING is under it on the graph. Meaning our proportion is 1 (the whole area). Similarly, if we got a z-score of -4, we would assume what? (no area, so = 0).

So, now we subtract. 1- 0.3085 = 0.6915 = 69.15%.

Some things to remember:

NEVER. NEVER EVER. Subtract z-scores. Or add them. ONLY SUBTRACT OR ADD PROPORTIONS.

Be careful about "above" or "below". Know what the table shows you. Draw pictures if in doubt.

Good Luck!