# stats final

## Terms

- extreme conformer
- outlier that doesn't influence the model but does inflate the R2
- pie chart
- shows how a whole divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category
- case
- an individual about whom we have data (row)
- confounding variable
- some other variable associated with a factor has an effect on the response variable
- uniform
- a distribution that is roughly flat
- disjoint
- two events that share no outcomes in common, mutually exclusive
- non response
- when a large fraction of those sampled do not respond
- observational study
- a study based on data in which no manipulation of factors has been employed
- outliers
- extreme values that don't appear to belong with the rest of the data
- spread
- a numerical summary of how tightly the values are clustered around the center
- something has to happen rule
- the sum of the probabilities of all possible outcomes must be 1
- high R squared
- does not demonstrate the appropriateness of the regression
- cluster samples
- these randomly select among heterogeneous subgroups that each resemble the population at large, making our sampling tasks more manageable
- randomize, replicate, block, control
- principles of experimental design
- histogram
- uses adjacent, equal-width bars to show the distribution of values in a quantitative variable
- independent variables
- the conditional distribution of one variable is the same for each category of the other
- statistic
- any summary calculated from the sampled data (Latin)
- bias
- Any systematic failure of a sampling method to represent its population
- probability
- the proportion of times the event occurs in many repeated trials of a random phenomenon (the long-term relative frequency of an event)
- sample survey
- a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population; a poll
- complement rule
- the probability of an event occurring is 1 minus the probability that it doesn't occur
- re-expression
- a means of altering the data to achieve the conditions necessary to utilize particular summaries or models
- does not change
- adding/subtracting a constant to every data value __________ measures of spread
- quantitative variables, straightness, no outliers
- 3 conditions needed for correlation
- timeplot
- displays quantitative data collected over time (x-axis); can reveal trends that are ignored by box-plots and stem-and-leaf plots
- extrapolation
- unreliable predictions of y-values based on x-values outside the range of the original data
- under coverage
- when individuals from a subgroup of the population are selected less often than they should be
- experimental units
- individuals on whom an experiment is performed
- normal probability plot
- precise method of checking the nearly normal condition
- residuals scatterplot
- checking if linear regression models are straight enough
- mean and SD (value based)
- when describing the distribution of a quantitative variable, if the shape is symmetric, report....
- independent
- P(B|A) = P(B)
- sample size
- what the precision of the statistics of a sample depend on
- randomness
- our greatest enemy and our most important tool
- Q1
- median of the lower half (25%)
- trial
- the sequence of several components representing events
- placebo
- A (fake) treatment known to have no effect, administered so that all groups experience the same conditions
- lurking variable
- Creates an association between two other variables that tempts us to think that one may cause the other
- factor
- a variable whose levels are controlled by the experimenter
- data table
- an arrangement of data in which each row represents a case and each column represents a variable
- a valid experiment
- to prove a cause-and-effect relationship, we need to perform
- response
- for scatterplots, the _____ variable is plotted on the y-axis
- reverse conditioning
- when you use a tree diagram to find P(A ∩ B), P(A) when you have P(A|B) but want P(B|A)
- inflection point
- the ______ of a normal curve identifies one standard deviation from the mean
- response variable
- the result of each trial with respect to what we were interested in
- IQR
- Q3-Q1, the middle half of the data
- 68, 95, 99.7 rule
- __% of the data fall w/in 1 standard deviation of the mean, about ___% w/in 2, and __% w/in 3.
- median
- middle value
- center is shifted to 0, standard deviation is rescaled to 1
- effect of standardizing (normal model)
- regression line
- unique line that minimizes the sum of the squared residuals
- scatterplot form straighter, scatterplot scatter more consistent, histogram distribution more symmetric, boxplots spread more similar
- four reasons to consider re-expression
- deviation
- how far each data value is from the mean
- large residual
- outlier that might not influence model much but isn't consistent with the overall form
- regression to the mean
- predicted y hat tends to be fewer SD from its mean than its corresponding x was from its mean
- sample
- a (representative) subset of a population, examined in hope of learning about the population
- union +
- with probabilities, "or" is the _____ of two events and translates into _
- comparative, double-blinded, placebo-controlled, randomized
- the best experiment
- boxplot
- displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values (effective for comparing groups)
- blinding
- Individuals associated with an experiment are not aware of how subjects have been allocated to treatment groups
- sample space
- the collection of all possible outcomes
- control group
- the group of experimental units assigned to a baseline treatment level (default or placebo)
- standardizing the data
- units can be eliminated by...
- response bias
- when respondents' answers may be affected by survey design
- convenience
- when the sample is comprised of individuals readily available
- variation
- statistics is about
- Q3
- median of the upper half (75%)
- bar chart
- shows bars representing the count of each category in a categorical variable
- symmetric
- a distribution where the two halves on either side of the center look approximately like mirror images of each other
- census
- a sample that consists of the entire population
- reducing bias
- often the best use of time and resources when sampling or surveying
- mean
- average
- median and IQR
- when describing the distribution of a quantitative variable, if the shape is skewed, report....
- representative sample
- Statistics computed from it accurately reflect the corresponding population parameters
- dotplot
- graphs a dot for each case against a single axis
- statistically significant
- when an observed difference is too large for us to believe it is likely to have occurred by chance
- statistics, latin
- summaries of the data denoted with ____ letters (mean: x¯, SD: s)
- complement
- with probabilities, "not" and "at least" indicate ________
- treatment
- process, intervention, or other controlled circumstance applied to randomly assigned experimental units
- time series
- measurements of a variable taken at regular time intervals
- simulation
- a sequence of random outcomes that model a situation; an artificial representation of a random process used to study long term effects
- an experiment
- manipulates factor levels to create treatments, randomly assigns subjects, compares the responses of the subject groups
- randomization
- the best defense against bias
- segmented bar chart
- a stacked relative frequency bar chart
- center
- a typical value that attempts to summarize the entire distribution with a single number
- high leverage points
- have x-values far from the mean mean point and pull more strongly on the regression line
- placebo effect
- The tendency of many human subjects (often 20% or more of experimental subjects) to show a response even when administered a placebo
- lurking variable
- a variable other than x and y that simultaneously affects both variables (background variable)
- inferring causation, extrapolation, outliers and influential points, change in scatterplot pattern, summary data
- what can go wrong with regression
- stratified samples
- these can reduce sampling variability by identifying homogeneous subgroups and then randomly sampling within each
- trial
- a single attempt or realization of a random phenomenon
- Max, Q3, Median, Q1, Min
- 5-number summary
- population
- the entire group of individuals or instances about whom we hope to learn
- independent
- the outcome of one trial doesn't influence or change the outcome of another
- range
- max-min data values
- sampling variability
- the natural tendency of randomly drawn samples to differ from each other
- scatterplot
- shows relationship between two quantitative variables on the same cases
- outcome
- the value measured, observed, or reported for each trial
- variable
- holds information about the same characteristic for many cases (column)
- sampling frame
- a list of individuals, which defines but may not be representative of the entire population, from which the sample is drawn
- event
- a combination of outcomes usually for the purpose of attaching a probability to them
- association
- a deliberately vague term describing the relationship between two variables
- shape
- uniform, single, multiple modes; symmetry vs skewed
- boring: no direction, shape, outliers
- a good scatterplot of the residuals v the x-values is...
- correlation
- the strength and direction of a scatterplot
- random phenomena
- the rules and concepts of probability that give us a language to talk and think about ________
- prospective
- Subjects are followed to observe future outcomes
- conditional distribution
- the distribution of a variable restricting the Who to consider only a small group of individuals
- matching
- in a retrospective or prospective study, subjects who are similar in ways not under study may be paired and then compared with each other on the variables of interest as a way to reduce unwanted variation
- multiplies (divides)
- multiplying/dividing every data value __________ the same constant to measures of position/center and _________ measures of spread
- multistage sample
- a scheme that combines several sampling methods
- intersection x
- with probabilities, "and" is the ______ of the two events and translates into _
- events are disjoint
- P(B|A) = 0
- individual
- object described by a set of data
- the law of averages
- assumes that the more something hasn't happened the more likely it becomes
- response
- a variable whose values are compared across different treatments
- relative frequencies
- a casual term for probability
- levels
- specific values that the experimenter chooses for a factor
- quantitative, straight, outlier
- correlation/linear regression model conditions
- percents
- relationships among categorical variables are described by calculating _____ from the counts given to prevent variation.
- clear, concise, complete, in context
- 4 Cs: conclusions are...
- voluntary response
- individuals choose whether to respond on their own
- standardizing
- uses standard deviation as a ruler to measure distance from the mean creating z-scores
- simpson's paradox
- when averages are taken across different groups, they can appear to contradict the overall averages
- Venn diagrams and two-way contingency tables
- ______ and ______ should be used to display the sample space and help probability calculations
- parameters
- key numbers in math models used to represent reality (Greek)
- form, direction, strength, unusual features
- four descriptions of a scatterplot
- marginal distribution
- the distribution of one of the variables in the totals (in the last row/column of a table)
- stem and leaf plot
- a sideways histogram tat shows the individual values
- mode
- a hump or local high point in the shape of the distribution of a variable (unimodal, bimodal, multimodal)
- addition rule
- if A and B are disjoint events, then the probability of A or B is _______
- influential point
- outlier that distorts the model
- causation
- scatterplots and correlation coefficients never prove ______
- multiplication rule
- if A and B are independent events, then the probability of A and B is _____
- explanatory
- for scatterplots, the ______ variable is plotted on the x-axis
- retrospective
- Subjects are selected and then their previous conditions or behaviors are determined
- skewed
- a non-symmetrical distribution where one tail stretches out further than the other
- back to back stem and leaf plot
- comparing two related distributions with a moderate number of observations
- normal model
- if the distribution of a quantitative variable is unimodal and roughly symmetric, we can replace histograms with...
- parameters, greek
- numerically valued attributes of a model with _____ letters (mean: µ, SD: σ)
- identifier variable
- ID number often used to protect confidentiality
- z= (x-m)/SD
- z-score
- the law of large numbers
- the long run relative frequency of repeated independent events settles down to the true probability as the number of trials increases
- shape, center, spread, and any unusual features
- describe a histogram's distribution by telling about its
- simple random sample (SRS)
- a sample in which each set of n elements in the population has an equal chance of selection; the standard method of randomization
- data
- values along with their context
- residual
- leftovers; observed value-predicted
- seasonal variation
- a pattern in a time series that repeats itself at known regular intervals of time
- calculate the regression line with and without the point
- way to verify an outlier and its effects is to
- systematic samples
- these are samples of a certain order; these work when there is no relation b/t the order of the sampling frame and the variables of interest
- block
- Group together subjects for experiments that are similar and randomize within those groups as a way to remove unwanted variation (parallel treatments on different groups); like stratifying