# stats final

:(

## Terms

undefined, object
extreme conformer
outlier that doesn't influence the model but does inflate the R2
pie chart
shows how a whole divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category
case
an individual about whom we have data (row)
confounding variable
some other variable associated with a factor has an effect on the response variable
uniform
a distribution that is roughly flat
disjoint
two events that share no outcomes in common, mutually exclusive
non response
when a large fraction of those sampled do not respond
observational study
a study based on data in which no manipulation of factors has been employed
outliers
extreme values that don't appear to belong with the rest of the data
a numerical summary of how tightly the values are clustered around the center
something has to happen rule
the sum of the probabilities of all possible outcomes must be 1
high R squared
does not demonstrate the appropriateness of the regression
cluster samples
these randomly select among heterogeneous subgroups that each resemble the population at large, making our sampling tasks more manageable
randomize, replicate, block, control
principles of experimental design
histogram
uses adjacent, equal-width bars to show the distribution of values in a quantitative variable
independent variables
the conditional distribution of one variable is the same for each category of the other
statistic
any summary calculated from the sampled data (Latin)
bias
Any systematic failure of a sampling method to represent its population
probability
the proportion of times the event occurs in many repeated trials of a random phenomenon (the long-term relative frequency of an event)
sample survey
a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population; a poll
complement rule
the probability of an event occurring is 1 minus the probability that it doesn't occur
re-expression
a means of altering the data to achieve the conditions necessary to utilize particular summaries or models
does not change
quantitative variables, straightness, no outliers
3 conditions needed for correlation
timeplot
displays quantitative data collected over time (x-axis); can reveal trends that are ignored by box-plots and stem-and-leaf plots
extrapolation
unreliable predictions of y-values based on x-values outside the range of the original data
under coverage
when individuals from a subgroup of the population are selected less often than they should be
experimental units
individuals on whom an experiment is performed
normal probability plot
precise method of checking the nearly normal condition
residuals scatterplot
checking if linear regression models are straight enough
mean and SD (value based)
when describing the distribution of a quantitative variable, if the shape is symmetric, report....
independent
P(B|A) = P(B)
sample size
what the precision of the statistics of a sample depend on
randomness
our greatest enemy and our most important tool
Q1
median of the lower half (25%)
trial
the sequence of several components representing events
placebo
A (fake) treatment known to have no effect, administered so that all groups experience the same conditions
lurking variable
Creates an association between two other variables that tempts us to think that one may cause the other
factor
a variable whose levels are controlled by the experimenter
data table
an arrangement of data in which each row represents a case and each column represents a variable
a valid experiment
to prove a cause-and-effect relationship, we need to perform
response
for scatterplots, the _____ variable is plotted on the y-axis
reverse conditioning
when you use a tree diagram to find P(A ∩ B), P(A) when you have P(A|B) but want P(B|A)
inflection point
the ______ of a normal curve identifies one standard deviation from the mean
response variable
the result of each trial with respect to what we were interested in
IQR
Q3-Q1, the middle half of the data
68, 95, 99.7 rule
__% of the data fall w/in 1 standard deviation of the mean, about ___% w/in 2, and __% w/in 3.
median
middle value
center is shifted to 0, standard deviation is rescaled to 1
effect of standardizing (normal model)
regression line
unique line that minimizes the sum of the squared residuals
scatterplot form straighter, scatterplot scatter more consistent, histogram distribution more symmetric, boxplots spread more similar
four reasons to consider re-expression
deviation
how far each data value is from the mean
large residual
outlier that might not influence model much but isn't consistent with the overall form
regression to the mean
predicted y hat tends to be fewer SD from its mean than its corresponding x was from its mean
sample
a (representative) subset of a population, examined in hope of learning about the population
union +
with probabilities, "or" is the _____ of two events and translates into _
comparative, double-blinded, placebo-controlled, randomized
the best experiment
boxplot
displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values (effective for comparing groups)
blinding
Individuals associated with an experiment are not aware of how subjects have been allocated to treatment groups
sample space
the collection of all possible outcomes
control group
the group of experimental units assigned to a baseline treatment level (default or placebo)
standardizing the data
units can be eliminated by...
response bias
when respondents' answers may be affected by survey design
convenience
when the sample is comprised of individuals readily available
variation
Q3
median of the upper half (75%)
bar chart
shows bars representing the count of each category in a categorical variable
symmetric
a distribution where the two halves on either side of the center look approximately like mirror images of each other
census
a sample that consists of the entire population
reducing bias
often the best use of time and resources when sampling or surveying
mean
average
median and IQR
when describing the distribution of a quantitative variable, if the shape is skewed, report....
representative sample
Statistics computed from it accurately reflect the corresponding population parameters
dotplot
graphs a dot for each case against a single axis
statistically significant
when an observed difference is too large for us to believe it is likely to have occurred by chance
statistics, latin
summaries of the data denoted with ____ letters (mean: x¯, SD: s)
complement
with probabilities, "not" and "at least" indicate ________
treatment
process, intervention, or other controlled circumstance applied to randomly assigned experimental units
time series
measurements of a variable taken at regular time intervals
simulation
a sequence of random outcomes that model a situation; an artificial representation of a random process used to study long term effects
an experiment
manipulates factor levels to create treatments, randomly assigns subjects, compares the responses of the subject groups
randomization
the best defense against bias
segmented bar chart
a stacked relative frequency bar chart
center
a typical value that attempts to summarize the entire distribution with a single number
high leverage points
have x-values far from the mean mean point and pull more strongly on the regression line
placebo effect
The tendency of many human subjects (often 20% or more of experimental subjects) to show a response even when administered a placebo
lurking variable
a variable other than x and y that simultaneously affects both variables (background variable)
inferring causation, extrapolation, outliers and influential points, change in scatterplot pattern, summary data
what can go wrong with regression
stratified samples
these can reduce sampling variability by identifying homogeneous subgroups and then randomly sampling within each
trial
a single attempt or realization of a random phenomenon
Max, Q3, Median, Q1, Min
5-number summary
population
the entire group of individuals or instances about whom we hope to learn
independent
the outcome of one trial doesn't influence or change the outcome of another
range
max-min data values
sampling variability
the natural tendency of randomly drawn samples to differ from each other
scatterplot
shows relationship between two quantitative variables on the same cases
outcome
the value measured, observed, or reported for each trial
variable
holds information about the same characteristic for many cases (column)
sampling frame
a list of individuals, which defines but may not be representative of the entire population, from which the sample is drawn
event
a combination of outcomes usually for the purpose of attaching a probability to them
association
a deliberately vague term describing the relationship between two variables
shape
uniform, single, multiple modes; symmetry vs skewed
boring: no direction, shape, outliers
a good scatterplot of the residuals v the x-values is...
correlation
the strength and direction of a scatterplot
random phenomena
the rules and concepts of probability that give us a language to talk and think about ________
prospective
Subjects are followed to observe future outcomes
conditional distribution
the distribution of a variable restricting the Who to consider only a small group of individuals
matching
in a retrospective or prospective study, subjects who are similar in ways not under study may be paired and then compared with each other on the variables of interest as a way to reduce unwanted variation
multiplies (divides)
multiplying/dividing every data value __________ the same constant to measures of position/center and _________ measures of spread
multistage sample
a scheme that combines several sampling methods
intersection x
with probabilities, "and" is the ______ of the two events and translates into _
events are disjoint
P(B|A) = 0
individual
object described by a set of data
the law of averages
assumes that the more something hasn't happened the more likely it becomes
response
a variable whose values are compared across different treatments
relative frequencies
a casual term for probability
levels
specific values that the experimenter chooses for a factor
quantitative, straight, outlier
correlation/linear regression model conditions
percents
relationships among categorical variables are described by calculating _____ from the counts given to prevent variation.
clear, concise, complete, in context
4 Cs: conclusions are...
voluntary response
individuals choose whether to respond on their own
standardizing
uses standard deviation as a ruler to measure distance from the mean creating z-scores
when averages are taken across different groups, they can appear to contradict the overall averages
Venn diagrams and two-way contingency tables
______ and ______ should be used to display the sample space and help probability calculations
parameters
key numbers in math models used to represent reality (Greek)
form, direction, strength, unusual features
four descriptions of a scatterplot
marginal distribution
the distribution of one of the variables in the totals (in the last row/column of a table)
stem and leaf plot
a sideways histogram tat shows the individual values
mode
a hump or local high point in the shape of the distribution of a variable (unimodal, bimodal, multimodal)
if A and B are disjoint events, then the probability of A or B is _______
influential point
outlier that distorts the model
causation
scatterplots and correlation coefficients never prove ______
multiplication rule
if A and B are independent events, then the probability of A and B is _____
explanatory
for scatterplots, the ______ variable is plotted on the x-axis
retrospective
Subjects are selected and then their previous conditions or behaviors are determined
skewed
a non-symmetrical distribution where one tail stretches out further than the other
back to back stem and leaf plot
comparing two related distributions with a moderate number of observations
normal model
if the distribution of a quantitative variable is unimodal and roughly symmetric, we can replace histograms with...
parameters, greek
numerically valued attributes of a model with _____ letters (mean: µ, SD: σ)
identifier variable
ID number often used to protect confidentiality
z= (x-m)/SD
z-score
the law of large numbers
the long run relative frequency of repeated independent events settles down to the true probability as the number of trials increases
shape, center, spread, and any unusual features
describe a histogram's distribution by telling about its
simple random sample (SRS)
a sample in which each set of n elements in the population has an equal chance of selection; the standard method of randomization
data
values along with their context
residual
leftovers; observed value-predicted
seasonal variation
a pattern in a time series that repeats itself at known regular intervals of time
calculate the regression line with and without the point
way to verify an outlier and its effects is to
systematic samples
these are samples of a certain order; these work when there is no relation b/t the order of the sampling frame and the variables of interest
block
Group together subjects for experiments that are similar and randomize within those groups as a way to remove unwanted variation (parallel treatments on different groups); like stratifying

150