Skip to content

Latest commit

 

History

History
71 lines (50 loc) · 2.86 KB

File metadata and controls

71 lines (50 loc) · 2.86 KB

Data Directory

This directory contains all data files for the Bishop State Student Success Prediction project.

📁 File Descriptions

Source Data Files

File Description Records Size
De-identified PDP AR Files.xlsx Original Excel file with AR data ~20K 4 MB
ar_bscc_with_zip.csv Bishop State AR data with zip codes ~4K 2 MB
bishop_state_cohorts_with_zip.csv Student cohort data with zip codes ~4K 7 MB
bishop_state_courses.csv Course enrollment records ~100K 29 MB

Processed Data Files

File Description Records Size
bishop_state_student_level_with_zip.csv Student-level aggregated data ~4K 12 MB

Output Files (Generated by ML Pipeline)

File Description Records Size
bishop_state_student_level_with_predictions.csv Student-level data with ML predictions ~4K 17 MB
bishop_state_merged_with_predictions.csv Course-level data with ML predictions ~100K 72 MB

Note: The original KCTCS source files (kctcs_*.csv, ar_kcts_with_zip.csv) remain in this directory for reference and will be removed in a future cleanup task.

📊 Data Schema

Key Identifier

  • Student_GUID: Unique student identifier used across all files

Data Relationships

bishop_state_cohorts_with_zip.csv (1 row per student)
    ├── Merged with → ar_bscc_with_zip.csv (1 row per student)
    └── Merged with → bishop_state_courses.csv (multiple rows per student)
         └── Aggregated to → bishop_state_student_level_with_zip.csv (1 row per student)
              └── ML Pipeline adds predictions → bishop_state_student_level_with_predictions.csv

🔄 Data Processing Workflow

  1. Source Files: Start with raw CSV and Excel files
  2. Merge: Run ai_model/merge_bishop_state_data.py to create merged dataset
  3. Aggregate: Create student-level aggregated features
  4. ML Pipeline: Run ai_model/complete_ml_pipeline.py to generate predictions
  5. Output: Predictions saved to both student-level and course-level files

📝 Notes

  • All student data is de-identified
  • Missing values are handled during preprocessing
  • Zip codes added for geographic analysis
  • Course-level data includes one row per course enrollment per student
  • Student-level data aggregates all courses into summary statistics

🔒 Data Privacy

All data files contain de-identified student information. No personally identifiable information (PII) is included.

📚 Additional Documentation

For detailed field descriptions, see: