Eda 2m

 


1. Identify what is Data & what is Information?

  • Data: Raw facts, figures, or observations without context. Example: 50, 70, 90.
  • Information: Processed and meaningful data used for decision-making. Example: Average score is 70.

2. List out the various data formats.

  • CSV (Comma Separated Values)
  • Excel (.xls, .xlsx)
  • JSON (JavaScript Object Notation)
  • SQL Database
  • XML
  • Text (.txt)

3. Define Exploratory Data Analysis (EDA).

  • EDA is the process of examining datasets to summarize their main characteristics using statistics, visualization, and transformations before applying formal modeling.
  • Example: Plotting a histogram of student marks.

4. Mention the four different types of measurement scales.

  1. Nominal – Categories without order (e.g., Gender: Male/Female).
  2. Ordinal – Ordered categories (e.g., Ranks: 1st, 2nd, 3rd).
  3. Interval – Ordered, equal spacing, no true zero (e.g., Temperature in Celsius).
  4. Ratio – Ordered, equal spacing, true zero (e.g., Height, Weight).

5. Find out what is transformation?

  • Transformation: Applying a mathematical or statistical function to data to make it more suitable for analysis.
  • Example: Applying log transformation to reduce skewness.

6. What are the steps in EDA?

  1. Data Collection
  2. Data Cleaning (handling missing/outliers)
  3. Data Transformation
  4. Data Visualization
  5. Summary Statistics
  6. Insights & Hypothesis generation

7. Define discrete variable, continuous variable & categorical variable.

  • Discrete Variable: Countable, finite values. Example: Number of students.
  • Continuous Variable: Infinite possible values. Example: Height = 165.7 cm.
  • Categorical Variable: Qualitative groups. Example: Blood group (A, B, O).

8. What is stacking and un-stacking?

  • Stacking: Converting columns into rows.
  • Un-stacking: Converting rows into columns.
  • Example (Pandas): df.stack() and df.unstack().

9. Define dichotomous variable & Polytomous variables.

  • Dichotomous Variable: Has only 2 categories. Example: Yes/No, Male/Female.
  • Polytomous Variable: Has more than 2 categories. Example: Grades (A, B, C, D).

10. What are the Software tools available for EDA?

  • Python (Pandas, NumPy, Matplotlib, Seaborn)
  • R (ggplot2, dplyr)
  • Excel
  • Tableau
  • SPSS

11. What are the Data Collection methods used for EDA?

  • Surveys & Questionnaires
  • Interviews
  • Observations
  • Experiments
  • Existing Databases & Reports

12. Justify: Why do we prefer the attribute-style way of accessing variables in a pandas object over dictionary-style indexing?

  • Attribute-style (df.column) is simpler, shorter, and more readable.
  • Example: df.age is easier than df["age"].
  • However, dictionary-style is preferred if column names have spaces or special characters.

13. Differentiate between numpy arrays and pandas series.

Feature NumPy Array Pandas Series
Data Homogeneous Heterogeneous
Labels Indexed by position only Indexed by labels & position
Example np.array([10,20,30]) pd.Series([10,20,30], index=['a','b','c'])

14. Define slices, masking, and fancy indexing.

  • Slicing: Selecting a range of data. Example: arr[1:4] → elements 1 to 3.
  • Masking: Filtering using condition. Example: arr[arr>10].
  • Fancy Indexing: Selecting multiple indices at once. Example: arr[[0,2,4]].