Data Directory

This directory contains all data files for the Bishop State Student Success Prediction project.

📁 File Descriptions

Source Data Files

File	Description	Records	Size
`De-identified PDP AR Files.xlsx`	Original Excel file with AR data	~20K	4 MB
`ar_bscc_with_zip.csv`	Bishop State AR data with zip codes	~4K	2 MB
`bishop_state_cohorts_with_zip.csv`	Student cohort data with zip codes	~4K	7 MB
`bishop_state_courses.csv`	Course enrollment records	~100K	29 MB

Processed Data Files

File	Description	Records	Size
`bishop_state_student_level_with_zip.csv`	Student-level aggregated data	~4K	12 MB

Output Files (Generated by ML Pipeline)

File	Description	Records	Size
`bishop_state_student_level_with_predictions.csv`	Student-level data with ML predictions	~4K	17 MB
`bishop_state_merged_with_predictions.csv`	Course-level data with ML predictions	~100K	72 MB

Note: The original KCTCS source files (kctcs_*.csv, ar_kcts_with_zip.csv) remain in this directory for reference and will be removed in a future cleanup task.

📊 Data Schema

Key Identifier

Student_GUID: Unique student identifier used across all files

Data Relationships

bishop_state_cohorts_with_zip.csv (1 row per student)
    ├── Merged with → ar_bscc_with_zip.csv (1 row per student)
    └── Merged with → bishop_state_courses.csv (multiple rows per student)
         └── Aggregated to → bishop_state_student_level_with_zip.csv (1 row per student)
              └── ML Pipeline adds predictions → bishop_state_student_level_with_predictions.csv

🔄 Data Processing Workflow

Source Files: Start with raw CSV and Excel files
Merge: Run ai_model/merge_bishop_state_data.py to create merged dataset
Aggregate: Create student-level aggregated features
ML Pipeline: Run ai_model/complete_ml_pipeline.py to generate predictions
Output: Predictions saved to both student-level and course-level files

📝 Notes

All student data is de-identified
Missing values are handled during preprocessing
Zip codes added for geographic analysis
Course-level data includes one row per course enrollment per student
Student-level data aggregates all courses into summary statistics

🔒 Data Privacy

All data files contain de-identified student information. No personally identifiable information (PII) is included.

📚 Additional Documentation

For detailed field descriptions, see:

DATA_DICTIONARY.md: Complete data dictionary
ML_MODELS_GUIDE.md: ML models documentation
README.md: Main project README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Directory

📁 File Descriptions

Source Data Files

Processed Data Files

Output Files (Generated by ML Pipeline)

📊 Data Schema

Key Identifier

Data Relationships

🔄 Data Processing Workflow

📝 Notes

🔒 Data Privacy

📚 Additional Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Directory

📁 File Descriptions

Source Data Files

Processed Data Files

Output Files (Generated by ML Pipeline)

📊 Data Schema

Key Identifier

Data Relationships

🔄 Data Processing Workflow

📝 Notes

🔒 Data Privacy

📚 Additional Documentation