Introduction

This case study examines student performance and equity in East Bay public schools, focusing on Pleasanton Unified, Hayward Unified, and Fremont Unified school districts. As a parent and concerned citizen, I believe that understanding the factors influencing educational outcomes is crucial for our community’s future. This analysis aims to uncover disparities in student performance across different demographic groups and identify potential areas for intervention and policy changes to promote educational equity.

Using data from the California Department of Education, including graduation rates, chronic absenteeism, stability rates, and standardized test scores, we will explore three key questions:

How does student performance vary across different schools and districts in the East Bay?
Are there disparities in educational outcomes based on factors like socioeconomic status, race/ethnicity, or English learner status?
Can we identify potential factors associated with high-performing schools or programs that could inform strategies to improve student achievement and equity across the region?

Through this analysis, we hope to provide valuable insights to educators, policymakers, and community members, ultimately contributing to more equitable and effective educational practices in the East Bay.

Let’s start!

Data Cleaning and Preparation

Introduction

This section contains the data cleaning and preparation steps for our analysis of student performance and equity in East Bay public schools. We’ll be working with several datasets from the 2021-2022 academic year:

ACGR (Adjusted Cohort Graduation Rate) data
Stability Rate data
Chronic Absenteeism data
CAST (California Science Test) data
Entities data (school and district information)

Let’s begin by loading our libraries and importing our data.

Load Libraries and Import Data

First, we’ll load the required R packages and import our datasets. Here is a quick overview of our data set files:

acgr22-v3.txt - tab delimited Adjusted Cohort Graduation Rate data for AY 22.
sr2122.txt - tab delimited Stability Rate data for AY 22.
chronicabsenteeism22-v3.txt - tab delimited Chronic Absenteeism data for AY 22.
cast_ca2022_all_01_csv_v1.txt - Caret delimited CAST (California Science Test) data for AY 22.
cast_ca2022entities_csv.txt - Caret delimited Entities data (school and district information) for 2022.

All the data files have been sourced from the California Department of Education Website. The data source has been verified to be reliable, trustworthy and up to date.

# Load necessary libraries
# Load required libraries
library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)

# Import ACGR (Adjusted Cohort Graduation Rate) data
acgr_data <- read_delim("acgr22-v3.txt", delim = "\t")

# Import Stability Rate data
stability_data <- read_delim("sr2122.txt", delim = "\t")

# Import Chronic Absenteeism data
absenteeism_data <- read_delim("chronicabsenteeism22-v3.txt", delim = "\t")

# Import CAST (California Science Test) data
cast_data <- read_delim("cast_ca2022_all_01_csv_v1.txt", delim = "^")

# Import Entities data
entities_data <- read_delim("cast_ca2022entities_csv.txt", delim = "^")

Data Overview

Now that we have imported our data, lets take a peek inside to see what we are working with. We will do this by looking into the first few rows of the data sets using the head() function and then analysing the structure of the data set using the str() function.

# Display the first few rows and structure of each dataset
head(acgr_data)

## # A tibble: 6 × 34
##   AcademicYear AggregateLevel CountyCode DistrictCode SchoolCode CountyName
##   <chr>        <chr>          <chr>             <dbl> <chr>      <chr>     
## 1 2021-22      C              01                   NA <NA>       Alameda   
## 2 2021-22      C              01                   NA <NA>       Alameda   
## 3 2021-22      C              01                   NA <NA>       Alameda   
## 4 2021-22      C              01                   NA <NA>       Alameda   
## 5 2021-22      C              01                   NA <NA>       Alameda   
## 6 2021-22      C              01                   NA <NA>       Alameda   
## # ℹ 28 more variables: DistrictName <chr>, SchoolName <chr>,
## #   CharterSchool <chr>, DASS <chr>, ReportingCategory <chr>,
## #   CohortStudents <chr>, `Regular HS Diploma Graduates (Count)` <chr>,
## #   `Regular HS Diploma Graduates (Rate)` <chr>,
## #   `Met UC/CSU Grad Req's (Count)` <chr>,
## #   `Met UC/CSU Grad Req's (Rate)` <chr>, `Seal of Biliteracy (Count)` <chr>,
## #   `Seal of Biliteracy (Rate)` <chr>, …

str(acgr_data)

## spc_tbl_ [113,235 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ AcademicYear                           : chr [1:113235] "2021-22" "2021-22" "2021-22" "2021-22" ...
##  $ AggregateLevel                         : chr [1:113235] "C" "C" "C" "C" ...
##  $ CountyCode                             : chr [1:113235] "01" "01" "01" "01" ...
##  $ DistrictCode                           : num [1:113235] NA NA NA NA NA NA NA NA NA NA ...
##  $ SchoolCode                             : chr [1:113235] NA NA NA NA ...
##  $ CountyName                             : chr [1:113235] "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ DistrictName                           : chr [1:113235] NA NA NA NA ...
##  $ SchoolName                             : chr [1:113235] NA NA NA NA ...
##  $ CharterSchool                          : chr [1:113235] "All" "All" "All" "All" ...
##  $ DASS                                   : chr [1:113235] "All" "All" "All" "All" ...
##  $ ReportingCategory                      : chr [1:113235] "GF" "GM" "GX" "RA" ...
##  $ CohortStudents                         : chr [1:113235] "8536" "9104" "16" "4557" ...
##  $ Regular HS Diploma Graduates (Count)   : chr [1:113235] "7765" "7882" "11" "4349" ...
##  $ Regular HS Diploma Graduates (Rate)    : chr [1:113235] "91" "86.6" "68.8" "95.4" ...
##  $ Met UC/CSU Grad Req's (Count)          : chr [1:113235] "5317" "4597" "7" "3650" ...
##  $ Met UC/CSU Grad Req's (Rate)           : chr [1:113235] "68.5" "58.3" "63.6" "83.9" ...
##  $ Seal of Biliteracy (Count)             : chr [1:113235] "1306" "835" "1" "1188" ...
##  $ Seal of Biliteracy (Rate)              : chr [1:113235] "16.8" "10.6" "9.1" "27.3" ...
##  $ Golden State Seal Merit Diploma (Count): chr [1:113235] "2738" "2137" "2" "2461" ...
##  $ Golden State Seal Merit Diploma (Rate  : chr [1:113235] "35.3" "27.1" "18.2" "56.6" ...
##  $ CHSPE Completer (Count)                : chr [1:113235] "14" "17" "1" "12" ...
##  $ CHSPE Completer (Rate)                 : chr [1:113235] "0.2" "0.2" "6.3" "0.3" ...
##  $ Adult Ed. HS Diploma (Count)           : chr [1:113235] "0" "2" "0" "0" ...
##  $ Adult Ed. HS Diploma (Rate)            : chr [1:113235] "0" "0" "0" "0" ...
##  $ SPED Certificate (Count)               : chr [1:113235] "51" "106" "0" "37" ...
##  $ SPED Certificate (Rate)                : chr [1:113235] "0.6" "1.2" "0" "0.8" ...
##  $ GED Completer (Count)                  : chr [1:113235] "5" "1" "0" "1" ...
##  $ GED Completer (Rate)                   : chr [1:113235] "0.1" "0" "0" "0" ...
##  $ Other Transfer (Count)                 : chr [1:113235] "16" "25" "0" "4" ...
##  $ Other Transfer (Rate)                  : chr [1:113235] "0.2" "0.3" "0" "0.1" ...
##  $ Dropout (Count)                        : chr [1:113235] "465" "758" "2" "106" ...
##  $ Dropout (Rate)                         : chr [1:113235] "5.4" "8.3" "12.5" "2.3" ...
##  $ Still Enrolled (Count)                 : chr [1:113235] "220" "313" "2" "48" ...
##  $ Still Enrolled (Rate)                  : chr [1:113235] "2.6" "3.4" "12.5" "1.1" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   AcademicYear = col_character(),
##   ..   AggregateLevel = col_character(),
##   ..   CountyCode = col_character(),
##   ..   DistrictCode = col_double(),
##   ..   SchoolCode = col_character(),
##   ..   CountyName = col_character(),
##   ..   DistrictName = col_character(),
##   ..   SchoolName = col_character(),
##   ..   CharterSchool = col_character(),
##   ..   DASS = col_character(),
##   ..   ReportingCategory = col_character(),
##   ..   CohortStudents = col_character(),
##   ..   `Regular HS Diploma Graduates (Count)` = col_character(),
##   ..   `Regular HS Diploma Graduates (Rate)` = col_character(),
##   ..   `Met UC/CSU Grad Req's (Count)` = col_character(),
##   ..   `Met UC/CSU Grad Req's (Rate)` = col_character(),
##   ..   `Seal of Biliteracy (Count)` = col_character(),
##   ..   `Seal of Biliteracy (Rate)` = col_character(),
##   ..   `Golden State Seal Merit Diploma (Count)` = col_character(),
##   ..   `Golden State Seal Merit Diploma (Rate` = col_character(),
##   ..   `CHSPE Completer (Count)` = col_character(),
##   ..   `CHSPE Completer (Rate)` = col_character(),
##   ..   `Adult Ed. HS Diploma (Count)` = col_character(),
##   ..   `Adult Ed. HS Diploma (Rate)` = col_character(),
##   ..   `SPED Certificate (Count)` = col_character(),
##   ..   `SPED Certificate (Rate)` = col_character(),
##   ..   `GED Completer (Count)` = col_character(),
##   ..   `GED Completer (Rate)` = col_character(),
##   ..   `Other Transfer (Count)` = col_character(),
##   ..   `Other Transfer (Rate)` = col_character(),
##   ..   `Dropout (Count)` = col_character(),
##   ..   `Dropout (Rate)` = col_character(),
##   ..   `Still Enrolled (Count)` = col_character(),
##   ..   `Still Enrolled (Rate)` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(stability_data)

## # A tibble: 6 × 16
##   `Academic Year` `Aggregate Level` `County Code` `District Code` `School Code`
##   <chr>           <chr>             <chr>                   <dbl> <chr>        
## 1 2021-22         T                 00                         NA <NA>         
## 2 2021-22         T                 00                         NA <NA>         
## 3 2021-22         T                 00                         NA <NA>         
## 4 2021-22         T                 00                         NA <NA>         
## 5 2021-22         T                 00                         NA <NA>         
## 6 2021-22         T                 00                         NA <NA>         
## # ℹ 11 more variables: `County Name` <chr>, `District Name` <chr>,
## #   `School Name` <chr>, `Charter School` <chr>, DASS <chr>,
## #   `Reporting Category` <chr>, `Adjusted Cumulative Enrollment` <chr>,
## #   `Stability Count` <chr>, `Stability Rate (percent)` <chr>,
## #   `Non-Stability Count` <chr>, `Non-Stability Rate (percent)` <chr>

str(stability_data)

## spc_tbl_ [353,391 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Academic Year                 : chr [1:353391] "2021-22" "2021-22" "2021-22" "2021-22" ...
##  $ Aggregate Level               : chr [1:353391] "T" "T" "T" "T" ...
##  $ County Code                   : chr [1:353391] "00" "00" "00" "00" ...
##  $ District Code                 : num [1:353391] NA NA NA NA NA NA NA NA NA NA ...
##  $ School Code                   : chr [1:353391] NA NA NA NA ...
##  $ County Name                   : chr [1:353391] "State" "State" "State" "State" ...
##  $ District Name                 : chr [1:353391] NA NA NA NA ...
##  $ School Name                   : chr [1:353391] NA NA NA NA ...
##  $ Charter School                : chr [1:353391] "All" "All" "All" "No " ...
##  $ DASS                          : chr [1:353391] "All" "No " "Yes" "All" ...
##  $ Reporting Category            : chr [1:353391] "GF" "GF" "GF" "GF" ...
##  $ Adjusted Cumulative Enrollment: chr [1:353391] "2942620" "2869233" "129287" "2604353" ...
##  $ Stability Count               : chr [1:353391] "2649435" "2601593" "47860" "2344381" ...
##  $ Stability Rate (percent)      : chr [1:353391] "90.0" "90.7" "37.0" "90.0" ...
##  $ Non-Stability Count           : chr [1:353391] "293185" "267640" "81427" "259972" ...
##  $ Non-Stability Rate (percent)  : chr [1:353391] "10.0" "9.3" "63.0" "10.0" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Academic Year` = col_character(),
##   ..   `Aggregate Level` = col_character(),
##   ..   `County Code` = col_character(),
##   ..   `District Code` = col_double(),
##   ..   `School Code` = col_character(),
##   ..   `County Name` = col_character(),
##   ..   `District Name` = col_character(),
##   ..   `School Name` = col_character(),
##   ..   `Charter School` = col_character(),
##   ..   DASS = col_character(),
##   ..   `Reporting Category` = col_character(),
##   ..   `Adjusted Cumulative Enrollment` = col_character(),
##   ..   `Stability Count` = col_character(),
##   ..   `Stability Rate (percent)` = col_character(),
##   ..   `Non-Stability Count` = col_character(),
##   ..   `Non-Stability Rate (percent)` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(absenteeism_data)

## # A tibble: 6 × 14
##   `Academic Year` `Aggregate Level` `County Code` `District Code` `School Code`
##   <chr>           <chr>             <chr>                   <dbl> <chr>        
## 1 2021-22         T                 00                         NA <NA>         
## 2 2021-22         T                 00                         NA <NA>         
## 3 2021-22         T                 00                         NA <NA>         
## 4 2021-22         T                 00                         NA <NA>         
## 5 2021-22         T                 00                         NA <NA>         
## 6 2021-22         T                 00                         NA <NA>         
## # ℹ 9 more variables: `County Name` <chr>, `District Name` <chr>,
## #   `School Name` <chr>, `Charter School` <chr>, DASS <chr>,
## #   `Reporting Category` <chr>,
## #   ChronicAbsenteeismEligibleCumulativeEnrollment <chr>,
## #   ChronicAbsenteeismCount <chr>, ChronicAbsenteeismRate <chr>

str(absenteeism_data)

## spc_tbl_ [342,607 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Academic Year                                 : chr [1:342607] "2021-22" "2021-22" "2021-22" "2021-22" ...
##  $ Aggregate Level                               : chr [1:342607] "T" "T" "T" "T" ...
##  $ County Code                                   : chr [1:342607] "00" "00" "00" "00" ...
##  $ District Code                                 : num [1:342607] NA NA NA NA NA NA NA NA NA NA ...
##  $ School Code                                   : chr [1:342607] NA NA NA NA ...
##  $ County Name                                   : chr [1:342607] "State" "State" "State" "State" ...
##  $ District Name                                 : chr [1:342607] NA NA NA NA ...
##  $ School Name                                   : chr [1:342607] NA NA NA NA ...
##  $ Charter School                                : chr [1:342607] "All" "All" "All" "All" ...
##  $ DASS                                          : chr [1:342607] "All" "All" "All" "All" ...
##  $ Reporting Category                            : chr [1:342607] "GF" "GM" "GR13" "GR46" ...
##  $ ChronicAbsenteeismEligibleCumulativeEnrollment: chr [1:342607] "2913279" "3077837" "1280630" "1333547" ...
##  $ ChronicAbsenteeismCount                       : chr [1:342607] "866427" "931454" "397646" "363188" ...
##  $ ChronicAbsenteeismRate                        : chr [1:342607] "29.7" "30.3" "31.1" "27.2" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Academic Year` = col_character(),
##   ..   `Aggregate Level` = col_character(),
##   ..   `County Code` = col_character(),
##   ..   `District Code` = col_double(),
##   ..   `School Code` = col_character(),
##   ..   `County Name` = col_character(),
##   ..   `District Name` = col_character(),
##   ..   `School Name` = col_character(),
##   ..   `Charter School` = col_character(),
##   ..   DASS = col_character(),
##   ..   `Reporting Category` = col_character(),
##   ..   ChronicAbsenteeismEligibleCumulativeEnrollment = col_character(),
##   ..   ChronicAbsenteeismCount = col_character(),
##   ..   ChronicAbsenteeismRate = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(cast_data)

## # A tibble: 6 × 30
##   `County Code` `District Code` `School Code` Filler `Test Year`
##   <chr>         <chr>           <chr>         <lgl>        <dbl>
## 1 01            00000           0000000       NA            2022
## 2 01            00000           0000000       NA            2022
## 3 01            00000           0000000       NA            2022
## 4 01            00000           0000000       NA            2022
## 5 01            00000           0000000       NA            2022
## 6 01            00000           0000000       NA            2022
## # ℹ 25 more variables: `Student Group ID` <dbl>, `Test Type` <chr>,
## #   `Total Tested at Reporting Level` <chr>,
## #   `Total Tested with Scores at Reporting Level` <chr>, Grade <dbl>,
## #   `Test ID` <dbl>, `Students Enrolled` <chr>, `Students Tested` <chr>,
## #   `Mean Scale Score` <chr>, `Percentage Standard Exceeded` <chr>,
## #   `Percentage Standard Met` <chr>, `Percentage Standard Met and Above` <chr>,
## #   `Percentage Standard Nearly Met` <chr>, …

str(cast_data)

## spc_tbl_ [49,965 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ County Code                                           : chr [1:49965] "01" "01" "01" "01" ...
##  $ District Code                                         : chr [1:49965] "00000" "00000" "00000" "00000" ...
##  $ School Code                                           : chr [1:49965] "0000000" "0000000" "0000000" "0000000" ...
##  $ Filler                                                : logi [1:49965] NA NA NA NA NA NA ...
##  $ Test Year                                             : num [1:49965] 2022 2022 2022 2022 2022 ...
##  $ Student Group ID                                      : num [1:49965] 1 3 4 6 7 8 28 29 31 50 ...
##  $ Test Type                                             : chr [1:49965] "X" "X" "X" "X" ...
##  $ Total Tested at Reporting Level                       : chr [1:49965] "50473" "25986" "24441" "42902" ...
##  $ Total Tested with Scores at Reporting Level           : chr [1:49965] "50355" "25916" "24394" "42809" ...
##  $ Grade                                                 : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...
##  $ Test ID                                               : num [1:49965] 17 17 17 17 17 17 17 17 17 17 ...
##  $ Students Enrolled                                     : chr [1:49965] "15953" "8167" "7777" "12711" ...
##  $ Students Tested                                       : chr [1:49965] "15566" "7948" "7610" "12423" ...
##  $ Mean Scale Score                                      : chr [1:49965] "206.4" "205.9" "206.9" "211.9" ...
##  $ Percentage Standard Exceeded                          : chr [1:49965] "19.28" "19.20" "19.36" "23.96" ...
##  $ Percentage Standard Met                               : chr [1:49965] "23.25" "23.00" "23.49" "28.06" ...
##  $ Percentage Standard Met and Above                     : chr [1:49965] "42.53" "42.20" "42.85" "52.02" ...
##  $ Percentage Standard Nearly Met                        : chr [1:49965] "41.38" "40.40" "42.42" "38.28" ...
##  $ Percentage Standard Not Met                           : chr [1:49965] "16.09" "17.41" "14.73" "9.70" ...
##  $ Students with Scores                                  : chr [1:49965] "15555" "7944" "7603" "12413" ...
##  $ Life Sciences Domain Percent Below Standard           : chr [1:49965] "40.65" "42.05" "39.21" "30.24" ...
##  $ Life Sciences Domain Percent Near Standard            : chr [1:49965] "45.73" "45.85" "45.57" "52.88" ...
##  $ Life Sciences Domain Percent Above Standard           : chr [1:49965] "13.62" "12.10" "15.21" "16.89" ...
##  $ Physical Sciences Domain Percent Below Standard       : chr [1:49965] "38.80" "40.19" "37.39" "29.49" ...
##  $ Physical Sciences Domain Percent Near Standard        : chr [1:49965] "46.98" "45.21" "48.81" "52.96" ...
##  $ Physical Sciences Domain Percent Above Standard       : chr [1:49965] "14.22" "14.61" "13.80" "17.55" ...
##  $ Earth and Space Sciences Domain Percent Below Standard: chr [1:49965] "37.99" "38.12" "37.87" "27.69" ...
##  $ Earth and Space Sciences Domain Percent Near Standard : chr [1:49965] "43.39" "41.66" "45.20" "49.27" ...
##  $ Earth and Space Sciences Domain Percent Above Standard: chr [1:49965] "18.61" "20.21" "16.94" "23.04" ...
##  $ Type ID                                               : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `County Code` = col_character(),
##   ..   `District Code` = col_character(),
##   ..   `School Code` = col_character(),
##   ..   Filler = col_logical(),
##   ..   `Test Year` = col_double(),
##   ..   `Student Group ID` = col_double(),
##   ..   `Test Type` = col_character(),
##   ..   `Total Tested at Reporting Level` = col_character(),
##   ..   `Total Tested with Scores at Reporting Level` = col_character(),
##   ..   Grade = col_double(),
##   ..   `Test ID` = col_double(),
##   ..   `Students Enrolled` = col_character(),
##   ..   `Students Tested` = col_character(),
##   ..   `Mean Scale Score` = col_character(),
##   ..   `Percentage Standard Exceeded` = col_character(),
##   ..   `Percentage Standard Met` = col_character(),
##   ..   `Percentage Standard Met and Above` = col_character(),
##   ..   `Percentage Standard Nearly Met` = col_character(),
##   ..   `Percentage Standard Not Met` = col_character(),
##   ..   `Students with Scores` = col_character(),
##   ..   `Life Sciences Domain Percent Below Standard` = col_character(),
##   ..   `Life Sciences Domain Percent Near Standard` = col_character(),
##   ..   `Life Sciences Domain Percent Above Standard` = col_character(),
##   ..   `Physical Sciences Domain Percent Below Standard` = col_character(),
##   ..   `Physical Sciences Domain Percent Near Standard` = col_character(),
##   ..   `Physical Sciences Domain Percent Above Standard` = col_character(),
##   ..   `Earth and Space Sciences Domain Percent Below Standard` = col_character(),
##   ..   `Earth and Space Sciences Domain Percent Near Standard` = col_character(),
##   ..   `Earth and Space Sciences Domain Percent Above Standard` = col_character(),
##   ..   `Type ID` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(entities_data)

## # A tibble: 6 × 10
##   `County Code` `District Code` `School Code` Filler `Test Year` `Type ID`
##   <chr>         <chr>           <chr>         <lgl>        <dbl>     <dbl>
## 1 00            00000           0000000       NA            2022         4
## 2 01            00000           0000000       NA            2022         5
## 3 01            10017           0000000       NA            2022         6
## 4 01            10017           0112607       NA            2022         9
## 5 01            10017           0123968       NA            2022         9
## 6 01            10017           0124172       NA            2022         9
## # ℹ 4 more variables: `County Name` <chr>, `District Name` <chr>,
## #   `School Name` <chr>, `Zip Code` <dbl>

str(entities_data)

## spc_tbl_ [11,128 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ County Code  : chr [1:11128] "00" "01" "01" "01" ...
##  $ District Code: chr [1:11128] "00000" "00000" "10017" "10017" ...
##  $ School Code  : chr [1:11128] "0000000" "0000000" "0000000" "0112607" ...
##  $ Filler       : logi [1:11128] NA NA NA NA NA NA ...
##  $ Test Year    : num [1:11128] 2022 2022 2022 2022 2022 ...
##  $ Type ID      : num [1:11128] 4 5 6 9 9 9 9 7 7 9 ...
##  $ County Name  : chr [1:11128] "State of California" "Alameda" "Alameda" "Alameda" ...
##  $ District Name: chr [1:11128] NA NA "Alameda County Office of Education" "Envision Academy for Arts & Technology" ...
##  $ School Name  : chr [1:11128] NA NA NA "Envision Academy for Arts & Technology" ...
##  $ Zip Code     : num [1:11128] NA NA NA 94612 94606 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `County Code` = col_character(),
##   ..   `District Code` = col_character(),
##   ..   `School Code` = col_character(),
##   ..   Filler = col_logical(),
##   ..   `Test Year` = col_double(),
##   ..   `Type ID` = col_double(),
##   ..   `County Name` = col_character(),
##   ..   `District Name` = col_character(),
##   ..   `School Name` = col_character(),
##   ..   `Zip Code` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

Now that we have imported our data and know how it looks like, let’s clean each dataset so that it is it a state where we can run some analysis on it. This involves converting appropriate columns to numeric type and categorical variables to factors. Here are the high level steps we will take in the next few sections:

Convert appropriate columns to numeric type in all datasets.
Convert categorical variables to factors for more efficient storage and analysis.
Check for missing values in all datasets.

Clean ACGR Data

acgr_data <- acgr_data %>%
  # Convert numeric columns from character to numeric type
  mutate(across(c(CohortStudents, 
                  `Regular HS Diploma Graduates (Count)`,
                  `Regular HS Diploma Graduates (Rate)`,
                  `Met UC/CSU Grad Req's (Count)`,
                  `Met UC/CSU Grad Req's (Rate)`,
                  `Dropout (Count)`,
                  `Dropout (Rate)`), 
                as.numeric)) %>%
  # Convert categorical variables to factors
  mutate(AcademicYear = as.factor(AcademicYear),
         AggregateLevel = as.factor(AggregateLevel),
         CharterSchool = as.factor(CharterSchool),
         DASS = as.factor(DASS),
         ReportingCategory = as.factor(ReportingCategory))

# Display the structure of the cleaned ACGR data
str(acgr_data)

## tibble [113,235 × 34] (S3: tbl_df/tbl/data.frame)
##  $ AcademicYear                           : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AggregateLevel                         : Factor w/ 4 levels "C","D","S","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CountyCode                             : chr [1:113235] "01" "01" "01" "01" ...
##  $ DistrictCode                           : num [1:113235] NA NA NA NA NA NA NA NA NA NA ...
##  $ SchoolCode                             : chr [1:113235] NA NA NA NA ...
##  $ CountyName                             : chr [1:113235] "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ DistrictName                           : chr [1:113235] NA NA NA NA ...
##  $ SchoolName                             : chr [1:113235] NA NA NA NA ...
##  $ CharterSchool                          : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DASS                                   : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ReportingCategory                      : Factor w/ 19 levels "GF","GM","GX",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ CohortStudents                         : num [1:113235] 8536 9104 16 4557 1593 ...
##  $ Regular HS Diploma Graduates (Count)   : num [1:113235] 7765 7882 11 4349 1327 ...
##  $ Regular HS Diploma Graduates (Rate)    : num [1:113235] 91 86.6 68.8 95.4 83.3 78.4 93.8 81.4 82.2 86.5 ...
##  $ Met UC/CSU Grad Req's (Count)          : num [1:113235] 5317 4597 7 3650 602 ...
##  $ Met UC/CSU Grad Req's (Rate)           : num [1:113235] 68.5 58.3 63.6 83.9 45.4 53.3 65.1 45.2 35.1 29.3 ...
##  $ Seal of Biliteracy (Count)             : chr [1:113235] "1306" "835" "1" "1188" ...
##  $ Seal of Biliteracy (Rate)              : chr [1:113235] "16.8" "10.6" "9.1" "27.3" ...
##  $ Golden State Seal Merit Diploma (Count): chr [1:113235] "2738" "2137" "2" "2461" ...
##  $ Golden State Seal Merit Diploma (Rate  : chr [1:113235] "35.3" "27.1" "18.2" "56.6" ...
##  $ CHSPE Completer (Count)                : chr [1:113235] "14" "17" "1" "12" ...
##  $ CHSPE Completer (Rate)                 : chr [1:113235] "0.2" "0.2" "6.3" "0.3" ...
##  $ Adult Ed. HS Diploma (Count)           : chr [1:113235] "0" "2" "0" "0" ...
##  $ Adult Ed. HS Diploma (Rate)            : chr [1:113235] "0" "0" "0" "0" ...
##  $ SPED Certificate (Count)               : chr [1:113235] "51" "106" "0" "37" ...
##  $ SPED Certificate (Rate)                : chr [1:113235] "0.6" "1.2" "0" "0.8" ...
##  $ GED Completer (Count)                  : chr [1:113235] "5" "1" "0" "1" ...
##  $ GED Completer (Rate)                   : chr [1:113235] "0.1" "0" "0" "0" ...
##  $ Other Transfer (Count)                 : chr [1:113235] "16" "25" "0" "4" ...
##  $ Other Transfer (Rate)                  : chr [1:113235] "0.2" "0.3" "0" "0.1" ...
##  $ Dropout (Count)                        : num [1:113235] 465 758 2 106 156 18 21 715 4 15 ...
##  $ Dropout (Rate)                         : num [1:113235] 5.4 8.3 12.5 2.3 9.8 13.4 2.3 11.8 8.9 8.8 ...
##  $ Still Enrolled (Count)                 : chr [1:113235] "220" "313" "2" "48" ...
##  $ Still Enrolled (Rate)                  : chr [1:113235] "2.6" "3.4" "12.5" "1.1" ...

Clean Stability Rate Data

stability_data <- stability_data %>%
  # Convert numeric columns from character to numeric type
  mutate(across(c(`Adjusted Cumulative Enrollment`, 
                  `Stability Count`, 
                  `Stability Rate (percent)`,
                  `Non-Stability Count`,
                  `Non-Stability Rate (percent)`), 
                as.numeric)) %>%
  # Convert categorical variables to factors
  mutate(`Academic Year` = as.factor(`Academic Year`),
         `Aggregate Level` = as.factor(`Aggregate Level`),
         `Charter School` = as.factor(`Charter School`),
         DASS = as.factor(DASS),
         `Reporting Category` = as.factor(`Reporting Category`))

# Display the structure of the cleaned Stability Rate data
str(stability_data)

## tibble [353,391 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Academic Year                 : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Aggregate Level               : Factor w/ 4 levels "C","D","S","T": 4 4 4 4 4 4 4 4 4 4 ...
##  $ County Code                   : chr [1:353391] "00" "00" "00" "00" ...
##  $ District Code                 : num [1:353391] NA NA NA NA NA NA NA NA NA NA ...
##  $ School Code                   : chr [1:353391] NA NA NA NA ...
##  $ County Name                   : chr [1:353391] "State" "State" "State" "State" ...
##  $ District Name                 : chr [1:353391] NA NA NA NA ...
##  $ School Name                   : chr [1:353391] NA NA NA NA ...
##  $ Charter School                : Factor w/ 3 levels "All","No ","Yes": 1 1 1 2 2 2 3 3 3 1 ...
##  $ DASS                          : Factor w/ 3 levels "All","No ","Yes": 1 2 3 1 2 3 1 2 3 1 ...
##  $ Reporting Category            : Factor w/ 25 levels "GF","GM","GR13",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ Adjusted Cumulative Enrollment: num [1:353391] 2942620 2869233 129287 2604353 2567882 ...
##  $ Stability Count               : num [1:353391] 2649435 2601593 47860 2344381 2317063 ...
##  $ Stability Rate (percent)      : num [1:353391] 90 90.7 37 90 90.2 38.6 78.7 86.1 33.9 89.6 ...
##  $ Non-Stability Count           : num [1:353391] 293185 267640 81427 259972 250819 ...
##  $ Non-Stability Rate (percent)  : num [1:353391] 10 9.3 63 10 9.8 61.4 21.3 13.9 66.1 10.4 ...

Clean Absenteeism Data

absenteeism_data <- absenteeism_data %>%
  # Convert numeric columns from character to numeric type
  mutate(across(c(ChronicAbsenteeismEligibleCumulativeEnrollment,
                  ChronicAbsenteeismCount,
                  ChronicAbsenteeismRate), 
                as.numeric)) %>%
  # Convert categorical variables to factors
  mutate(`Academic Year` = as.factor(`Academic Year`),
         `Aggregate Level` = as.factor(`Aggregate Level`),
         `Charter School` = as.factor(`Charter School`),
         DASS = as.factor(DASS),
         `Reporting Category` = as.factor(`Reporting Category`))

# Display the structure of the cleaned Absenteeism data
str(absenteeism_data)

## tibble [342,607 × 14] (S3: tbl_df/tbl/data.frame)
##  $ Academic Year                                 : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Aggregate Level                               : Factor w/ 4 levels "C","D","S","T": 4 4 4 4 4 4 4 4 4 4 ...
##  $ County Code                                   : chr [1:342607] "00" "00" "00" "00" ...
##  $ District Code                                 : num [1:342607] NA NA NA NA NA NA NA NA NA NA ...
##  $ School Code                                   : chr [1:342607] NA NA NA NA ...
##  $ County Name                                   : chr [1:342607] "State" "State" "State" "State" ...
##  $ District Name                                 : chr [1:342607] NA NA NA NA ...
##  $ School Name                                   : chr [1:342607] NA NA NA NA ...
##  $ Charter School                                : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DASS                                          : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reporting Category                            : Factor w/ 26 levels "GF","GM","GR13",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ ChronicAbsenteeismEligibleCumulativeEnrollment: num [1:342607] 2913279 3077837 1280630 1333547 903122 ...
##  $ ChronicAbsenteeismCount                       : num [1:342607] 866427 931454 397646 363188 241996 ...
##  $ ChronicAbsenteeismRate                        : num [1:342607] 29.7 30.3 31.1 27.2 26.8 30.1 30 40.4 43.3 NA ...

Clean CAST Data

cast_data <- cast_data %>%
  # Convert numeric columns from character to numeric type
  mutate(across(c(`Students Enrolled`, 
                  `Students Tested`, 
                  `Mean Scale Score`,
                  `Percentage Standard Exceeded`,
                  `Percentage Standard Met`,
                  `Percentage Standard Met and Above`,
                  `Percentage Standard Nearly Met`,
                  `Percentage Standard Not Met`), 
                as.numeric)) %>%
  # Convert categorical variables to factors
  mutate(`Test Year` = as.factor(`Test Year`),
         `Student Group ID` = as.factor(`Student Group ID`),
         `Test Type` = as.factor(`Test Type`),
         Grade = as.factor(Grade),
         `Test ID` = as.factor(`Test ID`))

# Display the structure of the cleaned CAST data
str(cast_data)

## tibble [49,965 × 30] (S3: tbl_df/tbl/data.frame)
##  $ County Code                                           : chr [1:49965] "01" "01" "01" "01" ...
##  $ District Code                                         : chr [1:49965] "00000" "00000" "00000" "00000" ...
##  $ School Code                                           : chr [1:49965] "0000000" "0000000" "0000000" "0000000" ...
##  $ Filler                                                : logi [1:49965] NA NA NA NA NA NA ...
##  $ Test Year                                             : Factor w/ 1 level "2022": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Student Group ID                                      : Factor w/ 55 levels "1","3","4","6",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Test Type                                             : Factor w/ 1 level "X": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total Tested at Reporting Level                       : chr [1:49965] "50473" "25986" "24441" "42902" ...
##  $ Total Tested with Scores at Reporting Level           : chr [1:49965] "50355" "25916" "24394" "42809" ...
##  $ Grade                                                 : Factor w/ 8 levels "5","8","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Test ID                                               : Factor w/ 1 level "17": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Students Enrolled                                     : num [1:49965] 15953 8167 7777 12711 1209 ...
##  $ Students Tested                                       : num [1:49965] 15566 7948 7610 12423 1194 ...
##  $ Mean Scale Score                                      : num [1:49965] 206 206 207 212 223 ...
##  $ Percentage Standard Exceeded                          : num [1:49965] 19.3 19.2 19.4 24 41.6 ...
##  $ Percentage Standard Met                               : num [1:49965] 23.2 23 23.5 28.1 34.1 ...
##  $ Percentage Standard Met and Above                     : num [1:49965] 42.5 42.2 42.9 52 75.7 ...
##  $ Percentage Standard Nearly Met                        : num [1:49965] 41.4 40.4 42.4 38.3 22.3 ...
##  $ Percentage Standard Not Met                           : num [1:49965] 16.09 17.41 14.73 9.7 2.01 ...
##  $ Students with Scores                                  : chr [1:49965] "15555" "7944" "7603" "12413" ...
##  $ Life Sciences Domain Percent Below Standard           : chr [1:49965] "40.65" "42.05" "39.21" "30.24" ...
##  $ Life Sciences Domain Percent Near Standard            : chr [1:49965] "45.73" "45.85" "45.57" "52.88" ...
##  $ Life Sciences Domain Percent Above Standard           : chr [1:49965] "13.62" "12.10" "15.21" "16.89" ...
##  $ Physical Sciences Domain Percent Below Standard       : chr [1:49965] "38.80" "40.19" "37.39" "29.49" ...
##  $ Physical Sciences Domain Percent Near Standard        : chr [1:49965] "46.98" "45.21" "48.81" "52.96" ...
##  $ Physical Sciences Domain Percent Above Standard       : chr [1:49965] "14.22" "14.61" "13.80" "17.55" ...
##  $ Earth and Space Sciences Domain Percent Below Standard: chr [1:49965] "37.99" "38.12" "37.87" "27.69" ...
##  $ Earth and Space Sciences Domain Percent Near Standard : chr [1:49965] "43.39" "41.66" "45.20" "49.27" ...
##  $ Earth and Space Sciences Domain Percent Above Standard: chr [1:49965] "18.61" "20.21" "16.94" "23.04" ...
##  $ Type ID                                               : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...

Clean Entities Data

entities_data <- entities_data %>%
  # Convert categorical variables to factors
  mutate(`Test Year` = as.factor(`Test Year`),
         `Type ID` = as.factor(`Type ID`))

# Display the structure of the cleaned Entities data
str(entities_data)

## tibble [11,128 × 10] (S3: tbl_df/tbl/data.frame)
##  $ County Code  : chr [1:11128] "00" "01" "01" "01" ...
##  $ District Code: chr [1:11128] "00000" "00000" "10017" "10017" ...
##  $ School Code  : chr [1:11128] "0000000" "0000000" "0000000" "0112607" ...
##  $ Filler       : logi [1:11128] NA NA NA NA NA NA ...
##  $ Test Year    : Factor w/ 1 level "2022": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type ID      : Factor w/ 6 levels "4","5","6","7",..: 1 2 3 5 5 5 5 4 4 5 ...
##  $ County Name  : chr [1:11128] "State of California" "Alameda" "Alameda" "Alameda" ...
##  $ District Name: chr [1:11128] NA NA "Alameda County Office of Education" "Envision Academy for Arts & Technology" ...
##  $ School Name  : chr [1:11128] NA NA NA "Envision Academy for Arts & Technology" ...
##  $ Zip Code     : num [1:11128] NA NA NA 94612 94606 ...

Check for Missing Values

Now that we’ve cleaned our data, let’s check for any missing values in each dataset.

# Function to check and display missing values
check_missing <- function(df) {
  missing <- sapply(df, function(x) sum(is.na(x)))
  missing[missing > 0]
}

# Check for missing values in all datasets
print("Missing values in ACGR data:")

## [1] "Missing values in ACGR data:"

print(check_missing(acgr_data))

##                         DistrictCode                           SchoolCode 
##                                 8666                                 8666 
##                         DistrictName                           SchoolName 
##                                 8666                                 8666 
##                       CohortStudents Regular HS Diploma Graduates (Count) 
##                                60203                                60203 
##  Regular HS Diploma Graduates (Rate)        Met UC/CSU Grad Req's (Count) 
##                                60203                                60203 
##         Met UC/CSU Grad Req's (Rate)                      Dropout (Count) 
##                                60203                                60203 
##                       Dropout (Rate) 
##                                60203

print("Missing values in Stability Rate data:")

## [1] "Missing values in Stability Rate data:"

print(check_missing(stability_data))

##                  District Code                    School Code 
##                          11908                         135723 
##                  District Name                    School Name 
##                          11908                         135723 
## Adjusted Cumulative Enrollment                Stability Count 
##                         111618                         111618 
##       Stability Rate (percent)            Non-Stability Count 
##                         111618                         111618 
##   Non-Stability Rate (percent) 
##                         111618

print("Missing values in Absenteeism data:")

## [1] "Missing values in Absenteeism data:"

print(check_missing(absenteeism_data))

##                                  District Code 
##                                          11898 
##                                    School Code 
##                                          11898 
##                                  District Name 
##                                          11898 
##                                    School Name 
##                                          11898 
## ChronicAbsenteeismEligibleCumulativeEnrollment 
##                                         105376 
##                        ChronicAbsenteeismCount 
##                                         105376 
##                         ChronicAbsenteeismRate 
##                                         105376

print("Missing values in CAST data:")

## [1] "Missing values in CAST data:"

print(check_missing(cast_data))

##                                                 Filler 
##                                                  49965 
##                                      Students Enrolled 
##                                                  12875 
##                                        Students Tested 
##                                                  12346 
##                                       Mean Scale Score 
##                                                  33090 
##                           Percentage Standard Exceeded 
##                                                  24201 
##                                Percentage Standard Met 
##                                                  24201 
##                      Percentage Standard Met and Above 
##                                                  24201 
##                         Percentage Standard Nearly Met 
##                                                  24201 
##                            Percentage Standard Not Met 
##                                                  24201 
##            Life Sciences Domain Percent Below Standard 
##                                                   2976 
##             Life Sciences Domain Percent Near Standard 
##                                                   2976 
##            Life Sciences Domain Percent Above Standard 
##                                                   2976 
##        Physical Sciences Domain Percent Below Standard 
##                                                   2976 
##         Physical Sciences Domain Percent Near Standard 
##                                                   2976 
##        Physical Sciences Domain Percent Above Standard 
##                                                   2976 
## Earth and Space Sciences Domain Percent Below Standard 
##                                                   2976 
##  Earth and Space Sciences Domain Percent Near Standard 
##                                                   2976 
## Earth and Space Sciences Domain Percent Above Standard 
##                                                   2976

print("Missing values in Entities data:")

## [1] "Missing values in Entities data:"

print(check_missing(entities_data))

##        Filler District Name   School Name      Zip Code 
##         11128            59          1074          1074

Based on the output from our check for missing values, we can see that each dataset has some missing data. Let’s summarize the findings and suggest some approaches to handle these missing values:

Summary of Missing Values

ACGR (Adjusted Cohort Graduation Rate) Data:
- Missing values in DistrictCode, SchoolCode, DistrictName, and SchoolName (8,666 each)
- Missing values in several numeric columns including CohortStudents, graduation counts and rates, and dropout information (60,203 each)
Stability Rate Data:
- Missing values in District Code, School Code, District Name, and School Name
- Missing values in enrollment and stability metrics (111,618 each)
Absenteeism Data:
- Missing values in District Code, School Code, District Name, and School Name (11,898 each)
- Missing values in absenteeism metrics (105,376 each)
CAST (California Science Test) Data:
- The “Filler” field has the highest number of missing values (49,965).
- Performance metric fields (e.g., Percentage Standard Exceeded) consistently have 24,201 missing values each.
- The “Mean Scale Score” field has 33,090 missing values.
- Student count fields have fewer missing values:
  - Students Enrolled: 12,875 missing
  - Students Tested: 12,346 missing
- Domain-specific fields all have 2,976 missing values each.
Entities Data:
- The “Filler” field has the highest number of missing values (11,128), which is equal to the total number of records in the dataset.
- “School Name” and “Zip Code” both have 1,074 missing values each, suggesting a possible correlation between these two fields.
- The “District Name” field has the least number of missing values (59), indicating that most records have district information.

Addressing Missing Values

Given the nature of our analysis, we should consider the following approaches:

For missing school and district identifiers:
- These might represent aggregate data (e.g., county or state level). We should keep these rows but flag them appropriately.
- Create a new column ‘data_level’ to indicate whether the data is at the school, district, county, or state level.
For CAST data:
- The missing values in test scores might indicate schools or groups that didn’t participate in the test. We should flag these for our analysis.

Let’s implement these strategies on ACGR and CAST data sets. The other data sets will be used to filter these two:

# Add data level indicator to ACGR data
acgr_data <- acgr_data %>%
  mutate(data_level = case_when(
    !is.na(SchoolCode) ~ "School",
    !is.na(DistrictCode) ~ "District",
    !is.na(CountyCode) ~ "County",
    TRUE ~ "State"
  ))

# Flag missing test scores in CAST data
cast_data <- cast_data %>%
  mutate(missing_scores = is.na(`Mean Scale Score`))

Filtering

Let’s filter the data for Pleasanton unified, Hayward unified and Fremont unified school districts using the following criteria:

District codes: 75101, 61176, 61192
Aggregate level: D (District)
Charter School: No

We’ll apply these filters to each relevant dataset. We will also include only race, socioeconomic and gender data and exclude other factors like grade.

The following code does the following:

Filters each dataset for the 3 chosen school districts.
For ACGR, Stability, and Absenteeism data, it also filters for district-level data and non-charter schools.
For all datasets except Entities, it includes only race, socioeconomic, and gender data as specified.

After running this code, we’ll have filtered datasets that focus only on Pleasanton Unified School District. We can then use these filtered datasets for our subsequent analysis.

# Load necessary libraries if not already loaded
library(dplyr)
library(tidyr)

# Define the district codes we're interested in
target_districts <- c("75101", "61176", "61192")  # Pleasanton, Hayward, Fremont

# Filter ACGR (Adjusted Cohort Graduation Rate) Data
acgr_data_filtered <- acgr_data %>%
  filter(DistrictCode %in% target_districts,
         AggregateLevel == "D",
         CharterSchool == "No ") %>%
  filter(ReportingCategory %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM", "TA")) %>%
  mutate(`District Code` = DistrictCode)

# Filter Stability Rate Data
stability_data_filtered <- stability_data %>%
  filter(`District Code` %in% target_districts,
         `Aggregate Level` == "D",
         `Charter School` == "No ") %>%
  filter(`Reporting Category` %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM"))

# Filter Absenteeism Data
absenteeism_data_filtered <- absenteeism_data %>%
  filter(`District Code` %in% target_districts,
         `Aggregate Level` == "D",
         `Charter School` == "No ") %>%
  filter(`Reporting Category` %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM"))

# Filter CAST Data
cast_data_filtered <- cast_data %>%
  filter(`District Code` %in% target_districts,
         `School Code` == "0000000") %>%  # District-level data
  filter(`Student Group ID` %in% c(1, 3, 4, 6, 7, 8, 9, 10, 11, 31, 75, 99, 100, 128, 129, 130, 144, 160, 180, 181))

# Filter Entities Data
entities_data_filtered <- entities_data %>%
  filter(`District Code` %in% target_districts)

# Display the number of rows in each filtered dataset
cat("Number of rows in filtered ACGR data:", nrow(acgr_data_filtered), "\n")

## Number of rows in filtered ACGR data: 108

cat("Number of rows in filtered Stability data:", nrow(stability_data_filtered), "\n")

## Number of rows in filtered Stability data: 99

cat("Number of rows in filtered Absenteeism data:", nrow(absenteeism_data_filtered), "\n")

## Number of rows in filtered Absenteeism data: 99

cat("Number of rows in filtered CAST data:", nrow(cast_data_filtered), "\n")

## Number of rows in filtered CAST data: 269

cat("Number of rows in filtered Entities data:", nrow(entities_data_filtered), "\n")

## Number of rows in filtered Entities data: 96

# Display unique districts in our filtered data
unique_districts <- entities_data_filtered %>%
  filter(`Type ID` == 6) %>%  # Assuming Type ID 6 represents districts
  select(`District Name`, `District Code`) %>%
  distinct()

cat("\nDistricts included in our analysis:\n")

## 
## Districts included in our analysis:

print(unique_districts)

## # A tibble: 3 × 2
##   `District Name`    `District Code`
##   <chr>              <chr>          
## 1 Fremont Unified    61176          
## 2 Hayward Unified    61192          
## 3 Pleasanton Unified 75101

Analysis

1. Comparing Graduation Rates

Let’s start by comparing graduation rates across Pleasanton, Hayward, and Fremont Unified school districts. We’ll break down the rates by demographic groups to identify any disparities or trends.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)

# Create a function to map reporting categories to full names
get_full_category_name <- function(code) {
  case_when(
    code == "RA" ~ "Asian",
    code == "RB" ~ "African American",
    code == "RF" ~ "Filipino",
    code == "RD" ~ "Not Reported",
    code == "RH" ~ "Hispanic or Latino",
    code == "RI" ~ "American Indian or Alaska Native",
    code == "RP" ~ "Pacific Islander",
    code == "RT" ~ "Two or More Races",
    code == "RW" ~ "White",
    code == "SE" ~ "English Learners",
    code == "SS" ~ "Socioeconomically Disadvantaged",
    code == "GF" ~ "Female",
    code == "GM" ~ "Male",
    code == "GX" ~ "Non-Binary",
    code == "TA" ~ "All Students",
    TRUE ~ code
  )
}

# Filter and prepare graduation rate data
graduation_data <- acgr_data_filtered %>%
  select(DistrictName, ReportingCategory, `Regular HS Diploma Graduates (Rate)`) %>%
  # Convert rate to numeric, removing any '%' signs if present
  mutate(`Regular HS Diploma Graduates (Rate)` = as.numeric(gsub("%", "", `Regular HS Diploma Graduates (Rate)`)),
         ReportingCategory = get_full_category_name(ReportingCategory))

# Calculate average graduation rate by district and reporting category
avg_grad_rates <- graduation_data %>%
  group_by(DistrictName, ReportingCategory) %>%
  summarize(AvgGradRate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE))

# Create a bar plot of graduation rates
ggplot(avg_grad_rates, aes(x = ReportingCategory, y = AvgGradRate, fill = DistrictName)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Average Graduation Rates by Demographic Group",
       x = "Demographic Group",
       y = "Average Graduation Rate (%)",
       fill = "District") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

# Calculate overall average graduation rate for each district
overall_avg_rates <- graduation_data %>%
  group_by(DistrictName) %>%
  summarize(OverallAvgRate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE))

print(overall_avg_rates)

## # A tibble: 3 × 2
##   DistrictName       OverallAvgRate
##   <chr>                       <dbl>
## 1 Fremont Unified              85.8
## 2 Hayward Unified              79.4
## 3 Pleasanton Unified           90.0

This analysis provides a visual comparison of graduation rates across different demographic groups in the three districts. The bar plot allows us to easily identify any disparities between groups or districts. The overall average graduation rates give us a quick snapshot of each district’s performance.

2. Stability Rates and Chronic Absenteeism

Next, we’ll examine the stability rates and chronic absenteeism in these districts. Stability rates indicate the percentage of students who remain in the same school throughout the academic year, while chronic absenteeism refers to students who miss a significant number of school days.

# Prepare stability rate data
stability_data_prep <- stability_data_filtered %>%
  select(`District Name`, `Reporting Category`, `Stability Rate (percent)`) %>%
  mutate(`Stability Rate (percent)` = as.numeric(`Stability Rate (percent)`),
         `Reporting Category` = get_full_category_name(`Reporting Category`))

# Prepare absenteeism data
absenteeism_data_prep <- absenteeism_data_filtered %>%
  select(`District Name`, `Reporting Category`, `ChronicAbsenteeismRate`) %>%
  mutate(ChronicAbsenteeismRate = as.numeric(ChronicAbsenteeismRate),
         `Reporting Category` = get_full_category_name(`Reporting Category`))

# Calculate average stability and absenteeism rates by district and reporting category
avg_rates <- stability_data_prep %>%
  group_by(`District Name`, `Reporting Category`) %>%
  summarize(AvgStabilityRate = mean(`Stability Rate (percent)`, na.rm = TRUE)) %>%
  left_join(
    absenteeism_data_prep %>%
      group_by(`District Name`, `Reporting Category`) %>%
      summarize(AvgAbsenteeismRate = mean(ChronicAbsenteeismRate, na.rm = TRUE)),
    by = c("District Name", "Reporting Category")
  )

# Create a scatter plot
ggplot(avg_rates, aes(x = AvgStabilityRate, y = AvgAbsenteeismRate, color = `District Name`)) +
  geom_point(size = 4) +
  geom_text(aes(label = `Reporting Category`), hjust = 0, vjust = 0, check_overlap = TRUE, size = 3) +
  theme_minimal() +
  labs(title = "Stability Rate vs Chronic Absenteeism Rate by Demographic Group",
       x = "Average Stability Rate (%)",
       y = "Average Chronic Absenteeism Rate (%)",
       color = "District") +
  scale_color_brewer(palette = "Set1")

print(avg_rates)

## # A tibble: 33 × 4
## # Groups:   District Name [3]
##    `District Name` `Reporting Category`      AvgStabilityRate AvgAbsenteeismRate
##    <chr>           <chr>                                <dbl>              <dbl>
##  1 Fremont Unified African American                      59.8               48.6
##  2 Fremont Unified Asian                                 64.5               22.5
##  3 Fremont Unified English Learners                      56.6               34.9
##  4 Fremont Unified Female                                67.1               34.2
##  5 Fremont Unified Filipino                              67.6               31.3
##  6 Fremont Unified Hispanic or Latino                    66.8               43.4
##  7 Fremont Unified Male                                  67.0               30  
##  8 Fremont Unified Pacific Islander                      80.2               44.5
##  9 Fremont Unified Socioeconomically Disadv…             64.4               41.9
## 10 Fremont Unified Two or More Races                     70.8               36.2
## # ℹ 23 more rows

This analysis compares the stability rates and chronic absenteeism rates across the three districts. The scatter plot helps visualize any potential relationship between stability and absenteeism. We can observe that districts with higher stability rates tend to have lower absenteeism rates, and vice versa.

3. Examining CAST (California Science Test) Results

Now, let’s analyze the CAST results to see how science performance varies across these districts. We’ll look at the percentage of students meeting or exceeding standards in different science domains.

# Load necessary libraries if not already loaded
library(dplyr)
library(ggplot2)
library(tidyr)

# Create a function to map student group IDs to full names
get_full_group_name <- function(id) {
  case_when(
    id == 1 ~ "All Students",
    id == 3 ~ "Female",
    id == 4 ~ "Male",
    id == 6 ~ "African American",
    id == 7 ~ "American Indian or Alaska Native",
    id == 8 ~ "Asian",
    id == 9 ~ "Filipino",
    id == 10 ~ "Hispanic or Latino",
    id == 11 ~ "Pacific Islander",
    id == 31 ~ "Socioeconomically Disadvantaged",
    id == 75 ~ "Two or More Races",
    id == 99 ~ "White",
    id == 100 ~ "Non-Binary",
    id == 128 ~ "English Learners",
    id == 129 ~ "Students with Disabilities",
    id == 130 ~ "Homeless Youth",
    id == 144 ~ "Foster Youth",
    id == 160 ~ "Migrant Education",
    id == 180 ~ "Military Dependents",
    id == 181 ~ "Not Reported",
    TRUE ~ as.character(id)
  )
}

# Prepare CAST data
cast_data_prep <- cast_data_filtered %>%
  select(`District Code`, `County Code`, Grade, `Student Group ID`,
         `Percentage Standard Met and Above`,
         `Life Sciences Domain Percent Above Standard`,
         `Physical Sciences Domain Percent Above Standard`,
         `Earth and Space Sciences Domain Percent Above Standard`) %>%
  mutate(across(c(`Percentage Standard Met and Above`,
                  `Life Sciences Domain Percent Above Standard`,
                  `Physical Sciences Domain Percent Above Standard`,
                  `Earth and Space Sciences Domain Percent Above Standard`),
                as.numeric),
         `Student Group` = get_full_group_name(`Student Group ID`))

# Add district names based on district codes
cast_data_prep <- cast_data_prep %>%
  mutate(DistrictName = case_when(
    `District Code` == 75101 ~ "Pleasanton Unified",
    `District Code` == 61176 ~ "Fremont Unified",
    `District Code` == 61192 ~ "Hayward Unified",
    TRUE ~ "Unknown"
  ))

# Calculate average percentages by district and student group
avg_cast_results <- cast_data_prep %>%
  group_by(DistrictName, `Student Group`) %>%
  summarize(
    AvgOverallMeetExceed = mean(`Percentage Standard Met and Above`, na.rm = TRUE),
    AvgLifeSciences = mean(`Life Sciences Domain Percent Above Standard`, na.rm = TRUE),
    AvgPhysicalSciences = mean(`Physical Sciences Domain Percent Above Standard`, na.rm = TRUE),
    AvgEarthSpaceSciences = mean(`Earth and Space Sciences Domain Percent Above Standard`, na.rm = TRUE)
  )

# Create a grouped bar plot
avg_cast_results_long <- avg_cast_results %>%
  pivot_longer(cols = starts_with("Avg"), 
               names_to = "Domain", 
               values_to = "Percentage") %>%
  mutate(Domain = case_when(
    Domain == "AvgOverallMeetExceed" ~ "Overall Met/Exceeded",
    Domain == "AvgLifeSciences" ~ "Life Sciences",
    Domain == "AvgPhysicalSciences" ~ "Physical Sciences",
    Domain == "AvgEarthSpaceSciences" ~ "Earth and Space Sciences",
    TRUE ~ Domain
  ))

ggplot(avg_cast_results_long, aes(x = `Student Group`, y = Percentage, fill = DistrictName)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Domain, scales = "free_y") +
  theme_minimal() +
  labs(title = "CAST Results by Science Domain and Student Group",
       x = "Student Group",
       y = "Average Percentage Above Standard",
       fill = "District") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.margin = margin(l = 30, unit = "pt")  # Add left padding
  ) +
  scale_fill_brewer(palette = "Set3")

# Print summary of results
print(avg_cast_results)

## # A tibble: 39 × 6
## # Groups:   DistrictName [3]
##    DistrictName    `Student Group`          AvgOverallMeetExceed AvgLifeSciences
##    <chr>           <chr>                                   <dbl>           <dbl>
##  1 Fremont Unified African American                        62.1            34.3 
##  2 Fremont Unified All Students                            56.9            31.4 
##  3 Fremont Unified American Indian or Alas…                80.0            51.7 
##  4 Fremont Unified Asian                                   61.6            33.0 
##  5 Fremont Unified English Learners                        11.2             4.14
##  6 Fremont Unified Female                                  56.7            30.5 
##  7 Fremont Unified Foster Youth                            48.7            31.3 
##  8 Fremont Unified Male                                    57.0            32.3 
##  9 Fremont Unified Migrant Education                        4.56            1.17
## 10 Fremont Unified Military Dependents                     57.6            30.3 
## # ℹ 29 more rows
## # ℹ 2 more variables: AvgPhysicalSciences <dbl>, AvgEarthSpaceSciences <dbl>

This analysis provides insights into the science performance across different domains in each district. The grouped bar plot allows for easy comparison of performance in Life Sciences, Physical Sciences, and Earth and Space Sciences, as well as the overall percentage of students meeting or exceeding standards.

4. Socioeconomic Factors and Educational Outcomes

Lastly, we’ll look at the performance across different science domains for each district:

library(dplyr)
library(tidyr)
library(ggplot2)

# Prepare CAST data for analysis
cast_data_prep <- cast_data_filtered %>%
  select(`District Code`, Grade, 
         `Life Sciences Domain Percent Above Standard`,
         `Physical Sciences Domain Percent Above Standard`,
         `Earth and Space Sciences Domain Percent Above Standard`) %>%
  mutate(across(contains("Percent Above Standard"), as.numeric)) %>%
  mutate(DistrictName = case_when(
    `District Code` == 75101 ~ "Pleasanton Unified",
    `District Code` == 61176 ~ "Fremont Unified",
    `District Code` == 61192 ~ "Hayward Unified",
    TRUE ~ "Unknown"
  ))

# Reshape the data for plotting
cast_data_long <- cast_data_prep %>%
  pivot_longer(
    cols = contains("Percent Above Standard"),
    names_to = "Domain",
    values_to = "Percent_Above_Standard"
  ) %>%
  mutate(Domain = gsub(" Domain Percent Above Standard", "", Domain))

# Calculate average performance by district and domain
avg_performance <- cast_data_long %>%
  group_by(DistrictName, Domain) %>%
  summarize(Avg_Percent_Above_Standard = mean(Percent_Above_Standard, na.rm = TRUE))

# Create a grouped bar plot
ggplot(avg_performance, aes(x = Domain, y = Avg_Percent_Above_Standard, fill = DistrictName)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Average Performance Across Science Domains by District",
       x = "Science Domain",
       y = "Average Percent of Students Above Standard",
       fill = "District") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.margin = margin(l = 20, unit = "pt")) +
  scale_fill_brewer(palette = "Set2")

# Print summary statistics
print(avg_performance)

## # A tibble: 9 × 3
## # Groups:   DistrictName [3]
##   DistrictName       Domain                   Avg_Percent_Above_Standard
##   <chr>              <chr>                                         <dbl>
## 1 Fremont Unified    Earth and Space Sciences                      27.2 
## 2 Fremont Unified    Life Sciences                                 26.8 
## 3 Fremont Unified    Physical Sciences                             27.5 
## 4 Hayward Unified    Earth and Space Sciences                       5.97
## 5 Hayward Unified    Life Sciences                                  7.04
## 6 Hayward Unified    Physical Sciences                              7.43
## 7 Pleasanton Unified Earth and Space Sciences                      28.6 
## 8 Pleasanton Unified Life Sciences                                 28.5 
## 9 Pleasanton Unified Physical Sciences                             30.3

# Perform analysis of variance (ANOVA) to check for significant differences
anova_result <- aov(Percent_Above_Standard ~ DistrictName + Domain, data = cast_data_long)
print(summary(anova_result))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## DistrictName   2  76965   38482 250.428 <2e-16 ***
## Domain         2    198      99   0.645  0.525    
## Residuals    754 115865     154                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 48 observations deleted due to missingness

Observations

District Performance Disparity: Hayward Unified consistently underperforms across all measured categories compared to other districts in the study, indicating a need for comprehensive review of educational strategies.
Demographic Achievement Gap: Students from migrant, socioeconomically disadvantaged, and English learner backgrounds show significantly lower performance across subjects, highlighting the need for targeted interventions.
Absenteeism and Stability Correlation: A strong inverse relationship exists between chronic absenteeism and student stability rates, with high absenteeism (>50%) correlating with notably lower stability.
English Learner Graduation Gap: English language learners exhibit a small but significant lag in graduation rates compared to their peers, suggesting a need for enhanced support systems.

Visualizations

Let’s create visualizations to support each of these findings.

library(dplyr)
library(ggplot2)
library(tidyr)
library(forcats)

# Helper function to add district names
add_district_names <- function(df) {
  df %>%
    mutate(DistrictName = case_when(
      `District Code` == 75101 ~ "Pleasanton Unified",
      `District Code` == 61176 ~ "Fremont Unified",
      `District Code` == 61192 ~ "Hayward Unified",
      TRUE ~ "Unknown"
    ))
}

District Performance Across Categories

This graph shows that Hayward unified generally has lower performance across all categories compared to the other districts.

library(tidyverse)
library(ggplot2)

# Prepare the data
cast_data_summary <- cast_data_filtered %>%
  filter(`Student Group ID` == 1, Grade == 99) %>%
  select(`District Code`, 
         `Percentage Standard Exceeded`,
         `Percentage Standard Met`,
         `Percentage Standard Nearly Met`,
         `Percentage Standard Not Met`) %>%
  pivot_longer(cols = starts_with("Percentage"), 
               names_to = "Category", 
               values_to = "Percentage") %>%
  mutate(
    Percentage = as.numeric(Percentage),
    Category = factor(Category, 
                      levels = c("Percentage Standard Not Met",
                                 "Percentage Standard Nearly Met",
                                 "Percentage Standard Met",
                                 "Percentage Standard Exceeded")),
    DistrictName = case_when(
      `District Code` == "61176" ~ "Fremont Unified",
      `District Code` == "61192" ~ "Hayward Unified",
      `District Code` == "75101" ~ "Pleasanton Unified"
    )
  )

# Create the visualization
ggplot(cast_data_summary, aes(x = DistrictName, y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = sprintf("%.1f%%", Percentage)), 
            position = position_stack(vjust = 0.5), 
            color = "white", fontface = "bold", angle=90) +
  scale_fill_manual(values = c("Percentage Standard Not Met" = "#FFA07A",
                               "Percentage Standard Nearly Met" = "#98FB98",
                               "Percentage Standard Met" = "#87CEFA",
                               "Percentage Standard Exceeded" = "#DDA0DD")) +
  labs(title = "CAST Performance Comparison Across Districts",
       subtitle = "Based on California Science Test Results",
       x = "District",
       y = "Percentage",
       fill = "Performance Category") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5),
        legend.position = "right",
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12)) +
  coord_flip()

Performance comparison for marginalized communities

This graph illustrates that Migrant Education, Socioeconomically Disadvantaged, and English Learners generally perform lower than the overall student population across all districts.

vulnerable_groups <- cast_data_filtered %>%
  add_district_names() %>%
  filter(`Student Group ID` %in% c(160, 31, 128, 1)) %>%  # Include All Students (1) for comparison
  mutate(`Student Group` = case_when(
    `Student Group ID` == 160 ~ "Migrant Education",
    `Student Group ID` == 31 ~ "Socioeconomically Disadvantaged",
    `Student Group ID` == 128 ~ "English Learners",
    `Student Group ID` == 1 ~ "All Students"
  )) %>%
  select(DistrictName, `Student Group`, `Percentage Standard Met and Above`) %>%
  mutate(`Percentage Standard Met and Above` = as.numeric(`Percentage Standard Met and Above`))

ggplot(vulnerable_groups, aes(x = `Student Group`, y = `Percentage Standard Met and Above`, fill = DistrictName)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Performance of Vulnerable Groups Across Districts",
       x = "Student Group",
       y = "Percentage Met and Above Standard",
       fill = "District") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Graduation Rates for English Learners Compared to Overall

This graph compares graduation rates between English Learners and the overall student population, showing that English Learners indeed have slightly lower graduation rates across all three districts.

library(dplyr)
library(ggplot2)
library(forcats)

# Prepare the data
graduation_rates <- acgr_data_filtered %>%
  add_district_names() %>%
  filter(ReportingCategory %in% c("SE", "TA", "RB")) %>%  # SE for English Learners, TA for All Students, RB for African American
  select(DistrictName, ReportingCategory, `Regular HS Diploma Graduates (Rate)`) %>%
  mutate(
    `Regular HS Diploma Graduates (Rate)` = as.numeric(gsub("%", "", `Regular HS Diploma Graduates (Rate)`)),
    ReportingCategory = case_when(
      ReportingCategory == "SE" ~ "English Learners",
      ReportingCategory == "TA" ~ "All Students",
      ReportingCategory == "RB" ~ "African American",
      TRUE ~ ReportingCategory
    )
  )

# Create the plot
ggplot(graduation_rates, aes(x = DistrictName, y = `Regular HS Diploma Graduates (Rate)`, fill = ReportingCategory)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("All Students" = "steelblue", "English Learners" = "orange", "African American" = "darkgreen")) +
  labs(title = "Graduation Rates: Comparison Across Select Groups",
       x = "District",
       y = "Graduation Rate (%)",
       fill = "Student Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Print summary statistics
summary_stats <- graduation_rates %>%
  group_by(ReportingCategory) %>%
  summarize(
    Mean_Rate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE),
    SD_Rate = sd(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE)
  )

print(summary_stats)

## # A tibble: 3 × 3
##   ReportingCategory Mean_Rate SD_Rate
##   <chr>                 <dbl>   <dbl>
## 1 African American       83.3    6.86
## 2 All Students           84.6   11.1 
## 3 English Learners       70.2   13.0

Recommendations

For Policymakers

Targeted Resource Allocation:
- Allocate additional resources to Hayward Unified, which shows lower performance across most metrics compared to Fremont and Pleasanton.
- Focus on programs that support English Learners and socioeconomically disadvantaged students, as these groups consistently show lower performance.
Early Intervention Programs:
- Implement and fund early intervention programs in elementary and middle schools, particularly in Hayward Unified, to address achievement gaps before they widen in high school.
Chronic Absenteeism:
- Develop district-wide policies to address chronic absenteeism, particularly for marginalized groups. This could include improved transportation options, health services, and family outreach programs.

For School Administrators

Data-Driven Decision Making:
- Regularly analyze performance data disaggregated by demographic groups to identify specific areas of need and track progress.
- Use this data to inform curriculum adjustments and resource allocation within schools.
Professional Development:
- Provide teachers with cultural competency training to better support diverse student populations.
- Offer training on differentiated instruction techniques to address varying learning needs.

For Teachers

Culturally Responsive Teaching:
- Incorporate culturally relevant content and examples in lesson plans to engage students from diverse backgrounds.
- Create an inclusive classroom environment that values and respects all cultural backgrounds.
Individualized Support:
- Implement personalized learning plans for students from marginalized groups who are struggling academically.
- Provide additional after-school or weekend tutoring sessions targeted at closing achievement gaps.

For Parents and Community Members

Engagement and Advocacy:
- Encourage active participation in school board meetings and parent-teacher associations, especially from underrepresented communities.
- Advocate for increased funding and resources for programs supporting marginalized students.
Mentorship Programs:
- Establish community mentorship programs pairing successful individuals from marginalized backgrounds with students to provide guidance and role models.
Home-School Partnerships:
- Develop programs to strengthen the connection between homes and schools, such as parent education workshops on supporting student learning at home.

District-Specific Recommendations

Hayward Unified:
- Focus on improving overall graduation rates and reducing dropout rates through targeted interventions and support programs.
- Address the higher chronic absenteeism rates by implementing attendance incentive programs and improving school climate.
Fremont Unified:
- While performing well overall, focus on closing the achievement gap for African American and Hispanic/Latino students.
Pleasanton Unified:
- Despite high overall performance, address the achievement gap for socioeconomically disadvantaged students.

Cross-District Collaboration

Establish a tri-district taskforce to share best practices, particularly focusing on how Pleasanton and Fremont can share successful strategies with Hayward.
Implement a student exchange program allowing students from marginalized groups to experience learning environments in higher-performing schools.

Technology and Access

Ensure all students, especially those from marginalized communities, have access to necessary technology and internet connectivity for remote learning and homework completion.
Implement digital literacy programs for both students and parents to maximize the use of available technological resources.

Mental Health and Well-being

Increase the number of school counselors and mental health professionals, especially in schools with higher proportions of marginalized students.
Implement social-emotional learning programs to help students develop resilience and coping skills.

Conclusion

By implementing these recommendations, we aim to see measurable improvements in the performance of students from marginalized communities across all three districts over the next 3 years. Regular monitoring and adjustment of these strategies will be crucial to ensure their effectiveness and to make necessary modifications as needs evolve.

East Bay Public Schools Performance and Equity Analysis

Case Study for BSU Fall 24 COID 269 Assignment 7

Setu Garg

Introduction

Data Cleaning and Preparation

Introduction

Load Libraries and Import Data

Data Overview

Data Cleaning

Clean ACGR Data

Clean Stability Rate Data

Clean Absenteeism Data

Clean CAST Data

Clean Entities Data

Check for Missing Values

Summary of Missing Values

Addressing Missing Values

Filtering

Analysis

1. Comparing Graduation Rates

2. Stability Rates and Chronic Absenteeism

3. Examining CAST (California Science Test) Results

4. Socioeconomic Factors and Educational Outcomes

Observations

Visualizations

District Performance Across Categories

Performance comparison for marginalized communities

Graduation Rates for English Learners Compared to Overall

Recommendations

For Policymakers

For School Administrators

For Teachers

For Parents and Community Members

District-Specific Recommendations

Cross-District Collaboration

Technology and Access

Mental Health and Well-being

Conclusion