📊 Unit 1 — Exploring Data
Comparing Distributions (SOCS)
🔠 The SOCS Framework
- Shape — symmetric, skewed left/right, unimodal/bimodal
- Outliers — mention if present; use IQR rule to identify
- Center — median (or mean) with value and units
- Spread — IQR, range, or standard deviation with value and units
📦 Boxplot Anatomy
- Whiskers: min & max (excluding outliers)
- Box edges: Q1 (25th %ile) and Q3 (75th %ile)
- Line in box: Median (Q2)
- IQR = Q3 − Q1
- Outlier rule: value < Q1 − 1.5·IQR or > Q3 + 1.5·IQR
⚖️ Mean vs. Median
- Symmetric distribution → mean ≈ median
- Skewed right → mean > median (pulled toward tail)
- Skewed left → mean < median
- Outliers pull the mean but not the median
- Median is resistant; mean is not
⚠️ AP Exam Language
When comparing distributions, always use comparative language: "The median travel time for elementary students (≈ 25 min) is greater than the median for middle school students (≈ 15 min)." Never just state each group's value — you must compare them.
Normal Distribution
📐 Z-Score Formula
z tells you how many standard deviations above (+) or below (−) the mean a value falls.
🎯 Empirical Rule (68-95-99.7)
- μ ± 1σ contains ≈ 68% of data
- μ ± 2σ contains ≈ 95% of data
- μ ± 3σ contains ≈ 99.7% of data
💡 Normal Distribution Problem Types
- Find probability/proportion: convert to z-score, then use normalcdf or table → P(X < x)
- Find value from percentile: use invNorm(percentile as decimal, μ, σ)
- Find middle X%: the area outside each tail = (1 − X%)/2; use invNorm for each tail
📈 Unit 2 — Bivariate Data & Linear Regression
The Least-Squares Regression Line (LSRL)
📝 Interpreting the Slope
For each additional [one unit of x], the predicted [y variable] increases/decreases by [|b|] [y units] on average.
📝 Interpreting the Y-Intercept
When [x variable] is 0, the predicted [y variable] is [a] [y units].
Ask: does x = 0 make sense in context? If not, the y-intercept has limited practical meaning.
📊 Residuals
Residual = Observed − Predicted
e = y − ŷ
- Positive residual → point is above the line (underestimated)
- Negative residual → point is below the line (overestimated)
🔗 Correlation Coefficient (r)
- Always between −1 and +1
- r > 0: positive association
- r < 0: negative association
- |r| close to 1: strong linear relationship
- r = √(R²) — take ± based on slope direction
- r has no units; not affected by changing units
📐 Coefficient of Determination (R²)
Interpretation: R²% of the variation in [y] is explained by the linear relationship with [x].
The remaining (1 − R²)% is due to other factors or random variation.
📏 Standard Deviation of Residuals (s)
Interpretation: The actual [y] values typically differ from the predicted values by about s [y units].
Measures typical prediction error of the model.
Residual Plots & Model Appropriateness
✅ Linear Model IS Appropriate If:
- Residual plot shows random scatter with no pattern
- No curved (U-shaped or arch-shaped) pattern in residuals
- R² is reasonably high
- Scatterplot shows roughly linear pattern
❌ Linear Model NOT Appropriate If:
- Residual plot shows a curved pattern
- Residual plot shows a fan shape (increasing spread)
- Scatterplot is clearly curved (exponential, quadratic)
⚠️ Outliers & Extrapolation
- Outlier in regression: point far from the line (large residual)
- High leverage: point with extreme x-value
- Influential point: removing it changes the LSRL significantly
- Extrapolation: predicting outside the range of x-data — unreliable, avoid!
💡 Computer Output Quick-Read
- Find slope in "Coef" column, row of the x-variable name
- Find y-intercept in "Coef" column, row labeled "Constant"
- Find r² from "R-Sq" — then r = √(R²), sign matches slope
- Find s labeled directly as "s = …"
🔬 Unit 3 — Collecting Data
Sampling Methods
| Method | How It Works | Key Feature |
| Simple Random Sample (SRS) | Every individual & every group of n individuals has an equal chance of selection | Gold standard; unbiased if done correctly |
| Stratified Random Sample | Divide population into strata (homogeneous groups); take SRS from each stratum | More precise when strata differ on response variable |
| Cluster Sample | Divide into clusters (heterogeneous groups); randomly select entire clusters | Practical when population is spread out |
| Systematic Sample | Select every k-th individual from a list after random start | Easy to implement |
| Convenience Sample | Select whoever is easiest to reach | Very prone to bias — avoid! |
Sources of Bias
🎭 Voluntary Response Bias
People choose to respond; those with strong opinions are overrepresented.
Example: online poll where only motivated people participate
🚪 Convenience Bias
Sampling whoever is nearby; sample may not represent the population.
Example: surveying only football game attendees
💬 Question Wording Bias
Leading or loaded questions push respondents toward a particular answer.
Example: "Do you support the dangerously high-crime prison construction?"
📭 Non-Response Bias
People who don't respond differ systematically from those who do.
🕵️ Undercoverage
Some groups in the population have a lower probability of being included in the sample.
🗣️ Response Bias
People give inaccurate answers (social desirability, interviewer effect).
Experimental Design
🧪 Key Vocabulary
- Experimental units: the individuals being studied
- Factor: an explanatory variable (manipulated)
- Level: specific value of a factor
- Treatment: specific combination of factor levels applied
- Response variable: the outcome measured
- Placebo: fake treatment that looks real
🎯 Principles of Experiment Design
- Randomization: randomly assign units to treatments to reduce confounding
- Replication: apply each treatment to enough units to detect real effects
- Control: control for extraneous variables (use a control group or placebo)
🚧 Control Group
A group that receives no treatment (or placebo). Advantage: shows what changes occur without the treatment, giving a baseline for comparison.
🧱 Blocking
Group similar experimental units into blocks, then randomly assign treatments within each block.
Use blocking variable that is most related to the response variable — it reduces variability and makes comparisons more precise.
💡 Stratification vs. Blocking
Stratification is used in sampling (observational). Blocking is used in experiments. Both involve grouping similar individuals — the goal is to reduce variability within groups.
✅ Stratified Sampling: When is campus better than gender?
Stratify by campus when students' satisfaction with buildings differs more by campus than by gender — i.e., campus explains more variation in the response than gender does.
🎲 Unit 4 — Probability & Random Variables
Basic Probability Rules
🔒 Mutually Exclusive (Disjoint)
- Events A and B cannot both occur
- P(A ∩ B) = 0
- P(A ∪ B) = P(A) + P(B)
- Mutually exclusive events are NOT independent (unless P = 0)
🔀 Independence
- Knowing A occurred doesn't change probability of B
- P(A | B) = P(A) — or equivalently:
- P(A ∩ B) = P(A) · P(B)
- Check: does P(A|B) = P(A)? If yes → independent
📋 Two-Way Tables
- Joint probability: P(A and B) = cell / table total
- Marginal probability: P(A) = row or column total / table total
- Conditional probability: P(A | B) = cell / row or column total
❌ Common Mistake
"Mutually exclusive" and "independent" are NOT the same. If two events are mutually exclusive and both have positive probability, they are DEPENDENT (knowing one occurred means the other definitely did not).
Discrete Random Variables
Combining Random Variables
⚠️ Critical Rule
When combining independent random variables, VARIANCES always add (even for X − Y). Standard deviations do NOT add — only variances do. Always add variances first, then take the square root.
Binomial Distribution B(n, p)
✅ BINS Conditions
- Binary — two outcomes (success/failure)
- Independent — trials are independent
- Number — fixed number of trials (n)
- Success — constant probability p each trial
📐 Binomial Formulas
P(X = k) = C(n,k) · pᵏ · (1−p)^(n−k)
μ_X = np
σ_X = √(np(1−p))
Geometric Distribution G(p)
🎯 When to Use Geometric
Count the number of trials until the first success. Same BINS conditions except no fixed n.
📐 Geometric Formulas
P(X = k) = (1−p)^(k−1) · p
μ_X = 1/p (expected # of trials)
P(X > k) = (1−p)^k
Sampling Distributions & Central Limit Theorem
📊 Sampling Distribution of x̄
μ_x̄ = μ
σ_x̄ = σ / √n (standard error of the mean)
By CLT: for large n (≥ 30), x̄ is approximately Normal regardless of population shape.
💡 Effect of Sample Size
- Larger n → smaller σ_x̄ → x̄ is less variable
- Larger n → sampling distribution is more Normal
- Averaging reduces variability by factor of √n
💡 Binomial vs. Geometric — Quick Check
- Binomial: "How many successes in n trials?" → fixed n, count successes
- Geometric: "How many trials until the first success?" → no fixed n, count trials