1. Identify what is Data & what is Information?
- Data: Raw facts, figures, or observations without context. Example:
50, 70, 90. - Information: Processed and meaningful data used for decision-making. Example: Average score is 70.
2. List out the various data formats.
- CSV (Comma Separated Values)
- Excel (
.xls,.xlsx) - JSON (JavaScript Object Notation)
- SQL Database
- XML
- Text (
.txt)
3. Define Exploratory Data Analysis (EDA).
- EDA is the process of examining datasets to summarize their main characteristics using statistics, visualization, and transformations before applying formal modeling.
- Example: Plotting a histogram of student marks.
4. Mention the four different types of measurement scales.
- Nominal – Categories without order (e.g., Gender: Male/Female).
- Ordinal – Ordered categories (e.g., Ranks: 1st, 2nd, 3rd).
- Interval – Ordered, equal spacing, no true zero (e.g., Temperature in Celsius).
- Ratio – Ordered, equal spacing, true zero (e.g., Height, Weight).
5. Find out what is transformation?
- Transformation: Applying a mathematical or statistical function to data to make it more suitable for analysis.
- Example: Applying log transformation to reduce skewness.
6. What are the steps in EDA?
- Data Collection
- Data Cleaning (handling missing/outliers)
- Data Transformation
- Data Visualization
- Summary Statistics
- Insights & Hypothesis generation
7. Define discrete variable, continuous variable & categorical variable.
- Discrete Variable: Countable, finite values. Example: Number of students.
- Continuous Variable: Infinite possible values. Example: Height = 165.7 cm.
- Categorical Variable: Qualitative groups. Example: Blood group (A, B, O).
8. What is stacking and un-stacking?
- Stacking: Converting columns into rows.
- Un-stacking: Converting rows into columns.
- Example (Pandas):
df.stack()anddf.unstack().
9. Define dichotomous variable & Polytomous variables.
- Dichotomous Variable: Has only 2 categories. Example: Yes/No, Male/Female.
- Polytomous Variable: Has more than 2 categories. Example: Grades (A, B, C, D).
10. What are the Software tools available for EDA?
- Python (Pandas, NumPy, Matplotlib, Seaborn)
- R (ggplot2, dplyr)
- Excel
- Tableau
- SPSS
11. What are the Data Collection methods used for EDA?
- Surveys & Questionnaires
- Interviews
- Observations
- Experiments
- Existing Databases & Reports
12. Justify: Why do we prefer the attribute-style way of accessing variables in a pandas object over dictionary-style indexing?
- Attribute-style (
df.column) is simpler, shorter, and more readable. - Example:
df.ageis easier thandf["age"]. - However, dictionary-style is preferred if column names have spaces or special characters.
13. Differentiate between numpy arrays and pandas series.
| Feature | NumPy Array | Pandas Series |
|---|---|---|
| Data | Homogeneous | Heterogeneous |
| Labels | Indexed by position only | Indexed by labels & position |
| Example | np.array([10,20,30]) |
pd.Series([10,20,30], index=['a','b','c']) |
14. Define slices, masking, and fancy indexing.
- Slicing: Selecting a range of data. Example:
arr[1:4]→ elements 1 to 3. - Masking: Filtering using condition. Example:
arr[arr>10]. - Fancy Indexing: Selecting multiple indices at once. Example:
arr[[0,2,4]].