Eda 2m

Eda 2m

1. Identify what is Data & what is Information?

Data: Raw facts, figures, or observations without context. Example: 50, 70, 90.
Information: Processed and meaningful data used for decision-making. Example: Average score is 70.

2. List out the various data formats.

CSV (Comma Separated Values)
Excel (.xls, .xlsx)
JSON (JavaScript Object Notation)
SQL Database
XML
Text (.txt)

3. Define Exploratory Data Analysis (EDA).

EDA is the process of examining datasets to summarize their main characteristics using statistics, visualization, and transformations before applying formal modeling.
Example: Plotting a histogram of student marks.

4. Mention the four different types of measurement scales.

Nominal – Categories without order (e.g., Gender: Male/Female).
Ordinal – Ordered categories (e.g., Ranks: 1st, 2nd, 3rd).
Interval – Ordered, equal spacing, no true zero (e.g., Temperature in Celsius).
Ratio – Ordered, equal spacing, true zero (e.g., Height, Weight).

5. Find out what is transformation?

Transformation: Applying a mathematical or statistical function to data to make it more suitable for analysis.
Example: Applying log transformation to reduce skewness.

6. What are the steps in EDA?

Data Collection
Data Cleaning (handling missing/outliers)
Data Transformation
Data Visualization
Summary Statistics
Insights & Hypothesis generation

7. Define discrete variable, continuous variable & categorical variable.

Discrete Variable: Countable, finite values. Example: Number of students.
Continuous Variable: Infinite possible values. Example: Height = 165.7 cm.
Categorical Variable: Qualitative groups. Example: Blood group (A, B, O).

8. What is stacking and un-stacking?

Stacking: Converting columns into rows.
Un-stacking: Converting rows into columns.
Example (Pandas): df.stack() and df.unstack().

9. Define dichotomous variable & Polytomous variables.

Dichotomous Variable: Has only 2 categories. Example: Yes/No, Male/Female.
Polytomous Variable: Has more than 2 categories. Example: Grades (A, B, C, D).

10. What are the Software tools available for EDA?

Python (Pandas, NumPy, Matplotlib, Seaborn)
R (ggplot2, dplyr)
Excel
Tableau
SPSS

11. What are the Data Collection methods used for EDA?

Surveys & Questionnaires
Interviews
Observations
Experiments
Existing Databases & Reports

12. Justify: Why do we prefer the attribute-style way of accessing variables in a pandas object over dictionary-style indexing?

Attribute-style (df.column) is simpler, shorter, and more readable.
Example: df.age is easier than df["age"].
However, dictionary-style is preferred if column names have spaces or special characters.

13. Differentiate between numpy arrays and pandas series.

Feature	NumPy Array	Pandas Series
Data	Homogeneous	Heterogeneous
Labels	Indexed by position only	Indexed by labels & position
Example	`np.array([10,20,30])`	`pd.Series([10,20,30], index=['a','b','c'])`

14. Define slices, masking, and fancy indexing.

Slicing: Selecting a range of data. Example: arr[1:4] → elements 1 to 3.
Masking: Filtering using condition. Example: arr[arr>10].
Fancy Indexing: Selecting multiple indices at once. Example: arr[[0,2,4]].