1. Resistant Lines in Bivariate Analysis
Definition:
A resistant line is a line that summarizes the relationship between two quantitative variables (bivariate data) and is not heavily influenced by outliers. It provides a robust fit even when some data points deviate greatly.
Use:
Used to describe trends in scatter plots when data have outliers or are not perfectly linear.
Example:
If we plot height vs. weight and a few data points are extreme (very tall but light), the resistant line gives a better general trend than the least squares regression line.
2. Explanatory Variable and Response Variable
Explanatory Variable (Independent Variable):
The variable that explains or influences changes in another variable.
Response Variable (Dependent Variable):
The variable that responds to changes in the explanatory variable.
Example:
In studying the effect of study hours on marks:
- Explanatory variable → Study hours
- Response variable → Marks scored
3. Marginal Proportion in Contingency Table
Definition:
A marginal proportion is the proportion of observations in each category of one variable, obtained by dividing the marginal total by the overall total.
Example:
If 30 out of 100 students are male, the marginal proportion for “male” = 30/100 = 0.3.
4. Uses of Contingency Table
Definition:
A contingency table is used to display the frequency distribution of two or more categorical variables.
Uses:
- To study relationships or associations between categorical variables.
- To compute conditional and marginal probabilities.
- To perform Chi-square tests for independence.
5. Causal Explanation
Definition:
Causal explanation identifies how one variable directly influences another variable, establishing a cause-and-effect relationship.
Example:
Increased exercise → causes → decrease in body weight.
6. Multivariate Analysis
Definition:
A statistical method that analyzes more than two variables simultaneously to understand relationships among them.
Example:
Studying how income, education, and age together affect expenditure.
7. Grouping Time Series Data
Definition:
Grouping time series data means arranging observations collected over time into intervals (e.g., yearly, monthly) for better analysis.
Example:
Daily sales data grouped into monthly totals.
8. Two Common Methods for Resampling Time Series Data
- Upsampling: Increasing the frequency of data (e.g., daily → hourly).
- Downsampling: Decreasing the frequency of data (e.g., daily → monthly).
9. Advantages of Resampling in Time Series Analysis
- Simplifies large datasets for easier visualization.
- Helps in identifying trends or seasonal patterns.
- Useful for aligning data with other time-based variables.
10. Common Time-Based Indexing Operations
- Shifting (lead/lag data).
- Resampling (upsampling/downsampling).
- Rolling or moving averages.
- Slicing data by date/time (e.g., selecting one year or month).
11. Inequality and Measures of Inequality
Inequality:
Refers to the unequal distribution of income, wealth, or opportunities among individuals or groups.
Measures of Inequality:
- Gini Coefficient – measures income inequality (0 = perfect equality, 1 = perfect inequality).
- Lorenz Curve – graphical representation of income distribution.
12. Univariate, Bivariate, and Multivariate Data Analysis
| Type | Definition | Example |
|---|---|---|
| Univariate | Analysis of a single variable | Analyzing the average salary of employees |
| Bivariate | Analysis of two variables to find a relationship | Studying relation between age and income |
| Multivariate | Analysis involving three or more variables | Studying how age, income, and education affect spending |
13. Explanatory, Response, and Dummy Variable
- Explanatory Variable: Variable used to explain another variable (independent).
- Response Variable: Variable being explained or predicted (dependent).
- Dummy Variable: A binary variable (0 or 1) representing categories for regression models.
Example:
If gender is coded as male = 1, female = 0 → it’s a dummy variable.
14. Conventions for Constructing a Causal Path Model
Conventions:
- Arrows indicate direction of causal influence.
- Variables are placed logically — causes to the left, effects to the right.
- No circular causation.
- Variables should have a theoretical or logical basis for causality.
Example (Causal Path Model):
Age Group → Feeling Unsafe Walking Alone After Dark
Older individuals are more likely to feel unsafe →
Age Group → Feeling Unsafe
📊 Arrow from “Age Group” (cause) to “Feeling Unsafe” (effect).
15. Contingency Table, Cell Frequency, and Marginal
Contingency Table:
A table showing the frequency distribution of two or more categorical variables.
Cell Frequency:
The number of observations in each cell (intersection of categories).
Marginal:
Totals of rows or columns that represent the overall frequency of each variable.
Example:
| Male | Female | Total | |
|---|---|---|---|
| Passed | 20 | 25 | 45 |
| Failed | 10 | 5 | 15 |
| Total | 30 | 30 | 60 |
- Cell frequency: e.g., 20 (Males who passed)
- Marginal: e.g., Total males = 30
16. Good Table Manners
- Keep tables simple and clear.
- Use proper headings and labels.
- Include totals and percentages where needed.
- Avoid unnecessary decimals.
- Maintain uniform units and spacing.
17. Type I and Type II Errors
- Type I Error (α): Rejecting a true null hypothesis (false positive).
Example: Concluding a medicine works when it doesn’t. - Type II Error (β): Failing to reject a false null hypothesis (false negative).
Example: Concluding a medicine doesn’t work when it does.
18. Box Plot
Definition:
A graphical summary that shows the distribution of data based on five-number summary — minimum, Q1, median, Q3, and maximum.
Use:
Identifies spread, central tendency, and outliers in the data.
19. Types of Sources of Income
- Earned Income – from employment or business (e.g., salary).
- Investment Income – from dividends, interest, or rent.
- Transfer Income – from pensions, government aid, etc.
20. Ways to Make a Contingency Table Readable
- Use clear labels and consistent formatting.
- Include totals and percentages for clarity.
- Arrange categories logically (alphabetical or numerical order).