This case study examines student performance and equity in East Bay public schools, focusing on Pleasanton Unified, Hayward Unified, and Fremont Unified school districts. As a parent and concerned citizen, I believe that understanding the factors influencing educational outcomes is crucial for our community’s future. This analysis aims to uncover disparities in student performance across different demographic groups and identify potential areas for intervention and policy changes to promote educational equity.
Using data from the California Department of Education, including graduation rates, chronic absenteeism, stability rates, and standardized test scores, we will explore three key questions:
Through this analysis, we hope to provide valuable insights to educators, policymakers, and community members, ultimately contributing to more equitable and effective educational practices in the East Bay.
Let’s start!
This section contains the data cleaning and preparation steps for our analysis of student performance and equity in East Bay public schools. We’ll be working with several datasets from the 2021-2022 academic year:
Let’s begin by loading our libraries and importing our data.
First, we’ll load the required R packages and import our datasets. Here is a quick overview of our data set files:
All the data files have been sourced from the California Department of Education Website. The data source has been verified to be reliable, trustworthy and up to date.
# Load necessary libraries
# Load required libraries
library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)
# Import ACGR (Adjusted Cohort Graduation Rate) data
acgr_data <- read_delim("acgr22-v3.txt", delim = "\t")
# Import Stability Rate data
stability_data <- read_delim("sr2122.txt", delim = "\t")
# Import Chronic Absenteeism data
absenteeism_data <- read_delim("chronicabsenteeism22-v3.txt", delim = "\t")
# Import CAST (California Science Test) data
cast_data <- read_delim("cast_ca2022_all_01_csv_v1.txt", delim = "^")
# Import Entities data
entities_data <- read_delim("cast_ca2022entities_csv.txt", delim = "^")
Now that we have imported our data, lets take a peek inside to see what we are working with. We will do this by looking into the first few rows of the data sets using the head() function and then analysing the structure of the data set using the str() function.
# Display the first few rows and structure of each dataset
head(acgr_data)
## # A tibble: 6 × 34
## AcademicYear AggregateLevel CountyCode DistrictCode SchoolCode CountyName
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2021-22 C 01 NA <NA> Alameda
## 2 2021-22 C 01 NA <NA> Alameda
## 3 2021-22 C 01 NA <NA> Alameda
## 4 2021-22 C 01 NA <NA> Alameda
## 5 2021-22 C 01 NA <NA> Alameda
## 6 2021-22 C 01 NA <NA> Alameda
## # ℹ 28 more variables: DistrictName <chr>, SchoolName <chr>,
## # CharterSchool <chr>, DASS <chr>, ReportingCategory <chr>,
## # CohortStudents <chr>, `Regular HS Diploma Graduates (Count)` <chr>,
## # `Regular HS Diploma Graduates (Rate)` <chr>,
## # `Met UC/CSU Grad Req's (Count)` <chr>,
## # `Met UC/CSU Grad Req's (Rate)` <chr>, `Seal of Biliteracy (Count)` <chr>,
## # `Seal of Biliteracy (Rate)` <chr>, …
str(acgr_data)
## spc_tbl_ [113,235 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ AcademicYear : chr [1:113235] "2021-22" "2021-22" "2021-22" "2021-22" ...
## $ AggregateLevel : chr [1:113235] "C" "C" "C" "C" ...
## $ CountyCode : chr [1:113235] "01" "01" "01" "01" ...
## $ DistrictCode : num [1:113235] NA NA NA NA NA NA NA NA NA NA ...
## $ SchoolCode : chr [1:113235] NA NA NA NA ...
## $ CountyName : chr [1:113235] "Alameda" "Alameda" "Alameda" "Alameda" ...
## $ DistrictName : chr [1:113235] NA NA NA NA ...
## $ SchoolName : chr [1:113235] NA NA NA NA ...
## $ CharterSchool : chr [1:113235] "All" "All" "All" "All" ...
## $ DASS : chr [1:113235] "All" "All" "All" "All" ...
## $ ReportingCategory : chr [1:113235] "GF" "GM" "GX" "RA" ...
## $ CohortStudents : chr [1:113235] "8536" "9104" "16" "4557" ...
## $ Regular HS Diploma Graduates (Count) : chr [1:113235] "7765" "7882" "11" "4349" ...
## $ Regular HS Diploma Graduates (Rate) : chr [1:113235] "91" "86.6" "68.8" "95.4" ...
## $ Met UC/CSU Grad Req's (Count) : chr [1:113235] "5317" "4597" "7" "3650" ...
## $ Met UC/CSU Grad Req's (Rate) : chr [1:113235] "68.5" "58.3" "63.6" "83.9" ...
## $ Seal of Biliteracy (Count) : chr [1:113235] "1306" "835" "1" "1188" ...
## $ Seal of Biliteracy (Rate) : chr [1:113235] "16.8" "10.6" "9.1" "27.3" ...
## $ Golden State Seal Merit Diploma (Count): chr [1:113235] "2738" "2137" "2" "2461" ...
## $ Golden State Seal Merit Diploma (Rate : chr [1:113235] "35.3" "27.1" "18.2" "56.6" ...
## $ CHSPE Completer (Count) : chr [1:113235] "14" "17" "1" "12" ...
## $ CHSPE Completer (Rate) : chr [1:113235] "0.2" "0.2" "6.3" "0.3" ...
## $ Adult Ed. HS Diploma (Count) : chr [1:113235] "0" "2" "0" "0" ...
## $ Adult Ed. HS Diploma (Rate) : chr [1:113235] "0" "0" "0" "0" ...
## $ SPED Certificate (Count) : chr [1:113235] "51" "106" "0" "37" ...
## $ SPED Certificate (Rate) : chr [1:113235] "0.6" "1.2" "0" "0.8" ...
## $ GED Completer (Count) : chr [1:113235] "5" "1" "0" "1" ...
## $ GED Completer (Rate) : chr [1:113235] "0.1" "0" "0" "0" ...
## $ Other Transfer (Count) : chr [1:113235] "16" "25" "0" "4" ...
## $ Other Transfer (Rate) : chr [1:113235] "0.2" "0.3" "0" "0.1" ...
## $ Dropout (Count) : chr [1:113235] "465" "758" "2" "106" ...
## $ Dropout (Rate) : chr [1:113235] "5.4" "8.3" "12.5" "2.3" ...
## $ Still Enrolled (Count) : chr [1:113235] "220" "313" "2" "48" ...
## $ Still Enrolled (Rate) : chr [1:113235] "2.6" "3.4" "12.5" "1.1" ...
## - attr(*, "spec")=
## .. cols(
## .. AcademicYear = col_character(),
## .. AggregateLevel = col_character(),
## .. CountyCode = col_character(),
## .. DistrictCode = col_double(),
## .. SchoolCode = col_character(),
## .. CountyName = col_character(),
## .. DistrictName = col_character(),
## .. SchoolName = col_character(),
## .. CharterSchool = col_character(),
## .. DASS = col_character(),
## .. ReportingCategory = col_character(),
## .. CohortStudents = col_character(),
## .. `Regular HS Diploma Graduates (Count)` = col_character(),
## .. `Regular HS Diploma Graduates (Rate)` = col_character(),
## .. `Met UC/CSU Grad Req's (Count)` = col_character(),
## .. `Met UC/CSU Grad Req's (Rate)` = col_character(),
## .. `Seal of Biliteracy (Count)` = col_character(),
## .. `Seal of Biliteracy (Rate)` = col_character(),
## .. `Golden State Seal Merit Diploma (Count)` = col_character(),
## .. `Golden State Seal Merit Diploma (Rate` = col_character(),
## .. `CHSPE Completer (Count)` = col_character(),
## .. `CHSPE Completer (Rate)` = col_character(),
## .. `Adult Ed. HS Diploma (Count)` = col_character(),
## .. `Adult Ed. HS Diploma (Rate)` = col_character(),
## .. `SPED Certificate (Count)` = col_character(),
## .. `SPED Certificate (Rate)` = col_character(),
## .. `GED Completer (Count)` = col_character(),
## .. `GED Completer (Rate)` = col_character(),
## .. `Other Transfer (Count)` = col_character(),
## .. `Other Transfer (Rate)` = col_character(),
## .. `Dropout (Count)` = col_character(),
## .. `Dropout (Rate)` = col_character(),
## .. `Still Enrolled (Count)` = col_character(),
## .. `Still Enrolled (Rate)` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(stability_data)
## # A tibble: 6 × 16
## `Academic Year` `Aggregate Level` `County Code` `District Code` `School Code`
## <chr> <chr> <chr> <dbl> <chr>
## 1 2021-22 T 00 NA <NA>
## 2 2021-22 T 00 NA <NA>
## 3 2021-22 T 00 NA <NA>
## 4 2021-22 T 00 NA <NA>
## 5 2021-22 T 00 NA <NA>
## 6 2021-22 T 00 NA <NA>
## # ℹ 11 more variables: `County Name` <chr>, `District Name` <chr>,
## # `School Name` <chr>, `Charter School` <chr>, DASS <chr>,
## # `Reporting Category` <chr>, `Adjusted Cumulative Enrollment` <chr>,
## # `Stability Count` <chr>, `Stability Rate (percent)` <chr>,
## # `Non-Stability Count` <chr>, `Non-Stability Rate (percent)` <chr>
str(stability_data)
## spc_tbl_ [353,391 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Academic Year : chr [1:353391] "2021-22" "2021-22" "2021-22" "2021-22" ...
## $ Aggregate Level : chr [1:353391] "T" "T" "T" "T" ...
## $ County Code : chr [1:353391] "00" "00" "00" "00" ...
## $ District Code : num [1:353391] NA NA NA NA NA NA NA NA NA NA ...
## $ School Code : chr [1:353391] NA NA NA NA ...
## $ County Name : chr [1:353391] "State" "State" "State" "State" ...
## $ District Name : chr [1:353391] NA NA NA NA ...
## $ School Name : chr [1:353391] NA NA NA NA ...
## $ Charter School : chr [1:353391] "All" "All" "All" "No " ...
## $ DASS : chr [1:353391] "All" "No " "Yes" "All" ...
## $ Reporting Category : chr [1:353391] "GF" "GF" "GF" "GF" ...
## $ Adjusted Cumulative Enrollment: chr [1:353391] "2942620" "2869233" "129287" "2604353" ...
## $ Stability Count : chr [1:353391] "2649435" "2601593" "47860" "2344381" ...
## $ Stability Rate (percent) : chr [1:353391] "90.0" "90.7" "37.0" "90.0" ...
## $ Non-Stability Count : chr [1:353391] "293185" "267640" "81427" "259972" ...
## $ Non-Stability Rate (percent) : chr [1:353391] "10.0" "9.3" "63.0" "10.0" ...
## - attr(*, "spec")=
## .. cols(
## .. `Academic Year` = col_character(),
## .. `Aggregate Level` = col_character(),
## .. `County Code` = col_character(),
## .. `District Code` = col_double(),
## .. `School Code` = col_character(),
## .. `County Name` = col_character(),
## .. `District Name` = col_character(),
## .. `School Name` = col_character(),
## .. `Charter School` = col_character(),
## .. DASS = col_character(),
## .. `Reporting Category` = col_character(),
## .. `Adjusted Cumulative Enrollment` = col_character(),
## .. `Stability Count` = col_character(),
## .. `Stability Rate (percent)` = col_character(),
## .. `Non-Stability Count` = col_character(),
## .. `Non-Stability Rate (percent)` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(absenteeism_data)
## # A tibble: 6 × 14
## `Academic Year` `Aggregate Level` `County Code` `District Code` `School Code`
## <chr> <chr> <chr> <dbl> <chr>
## 1 2021-22 T 00 NA <NA>
## 2 2021-22 T 00 NA <NA>
## 3 2021-22 T 00 NA <NA>
## 4 2021-22 T 00 NA <NA>
## 5 2021-22 T 00 NA <NA>
## 6 2021-22 T 00 NA <NA>
## # ℹ 9 more variables: `County Name` <chr>, `District Name` <chr>,
## # `School Name` <chr>, `Charter School` <chr>, DASS <chr>,
## # `Reporting Category` <chr>,
## # ChronicAbsenteeismEligibleCumulativeEnrollment <chr>,
## # ChronicAbsenteeismCount <chr>, ChronicAbsenteeismRate <chr>
str(absenteeism_data)
## spc_tbl_ [342,607 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Academic Year : chr [1:342607] "2021-22" "2021-22" "2021-22" "2021-22" ...
## $ Aggregate Level : chr [1:342607] "T" "T" "T" "T" ...
## $ County Code : chr [1:342607] "00" "00" "00" "00" ...
## $ District Code : num [1:342607] NA NA NA NA NA NA NA NA NA NA ...
## $ School Code : chr [1:342607] NA NA NA NA ...
## $ County Name : chr [1:342607] "State" "State" "State" "State" ...
## $ District Name : chr [1:342607] NA NA NA NA ...
## $ School Name : chr [1:342607] NA NA NA NA ...
## $ Charter School : chr [1:342607] "All" "All" "All" "All" ...
## $ DASS : chr [1:342607] "All" "All" "All" "All" ...
## $ Reporting Category : chr [1:342607] "GF" "GM" "GR13" "GR46" ...
## $ ChronicAbsenteeismEligibleCumulativeEnrollment: chr [1:342607] "2913279" "3077837" "1280630" "1333547" ...
## $ ChronicAbsenteeismCount : chr [1:342607] "866427" "931454" "397646" "363188" ...
## $ ChronicAbsenteeismRate : chr [1:342607] "29.7" "30.3" "31.1" "27.2" ...
## - attr(*, "spec")=
## .. cols(
## .. `Academic Year` = col_character(),
## .. `Aggregate Level` = col_character(),
## .. `County Code` = col_character(),
## .. `District Code` = col_double(),
## .. `School Code` = col_character(),
## .. `County Name` = col_character(),
## .. `District Name` = col_character(),
## .. `School Name` = col_character(),
## .. `Charter School` = col_character(),
## .. DASS = col_character(),
## .. `Reporting Category` = col_character(),
## .. ChronicAbsenteeismEligibleCumulativeEnrollment = col_character(),
## .. ChronicAbsenteeismCount = col_character(),
## .. ChronicAbsenteeismRate = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(cast_data)
## # A tibble: 6 × 30
## `County Code` `District Code` `School Code` Filler `Test Year`
## <chr> <chr> <chr> <lgl> <dbl>
## 1 01 00000 0000000 NA 2022
## 2 01 00000 0000000 NA 2022
## 3 01 00000 0000000 NA 2022
## 4 01 00000 0000000 NA 2022
## 5 01 00000 0000000 NA 2022
## 6 01 00000 0000000 NA 2022
## # ℹ 25 more variables: `Student Group ID` <dbl>, `Test Type` <chr>,
## # `Total Tested at Reporting Level` <chr>,
## # `Total Tested with Scores at Reporting Level` <chr>, Grade <dbl>,
## # `Test ID` <dbl>, `Students Enrolled` <chr>, `Students Tested` <chr>,
## # `Mean Scale Score` <chr>, `Percentage Standard Exceeded` <chr>,
## # `Percentage Standard Met` <chr>, `Percentage Standard Met and Above` <chr>,
## # `Percentage Standard Nearly Met` <chr>, …
str(cast_data)
## spc_tbl_ [49,965 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ County Code : chr [1:49965] "01" "01" "01" "01" ...
## $ District Code : chr [1:49965] "00000" "00000" "00000" "00000" ...
## $ School Code : chr [1:49965] "0000000" "0000000" "0000000" "0000000" ...
## $ Filler : logi [1:49965] NA NA NA NA NA NA ...
## $ Test Year : num [1:49965] 2022 2022 2022 2022 2022 ...
## $ Student Group ID : num [1:49965] 1 3 4 6 7 8 28 29 31 50 ...
## $ Test Type : chr [1:49965] "X" "X" "X" "X" ...
## $ Total Tested at Reporting Level : chr [1:49965] "50473" "25986" "24441" "42902" ...
## $ Total Tested with Scores at Reporting Level : chr [1:49965] "50355" "25916" "24394" "42809" ...
## $ Grade : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...
## $ Test ID : num [1:49965] 17 17 17 17 17 17 17 17 17 17 ...
## $ Students Enrolled : chr [1:49965] "15953" "8167" "7777" "12711" ...
## $ Students Tested : chr [1:49965] "15566" "7948" "7610" "12423" ...
## $ Mean Scale Score : chr [1:49965] "206.4" "205.9" "206.9" "211.9" ...
## $ Percentage Standard Exceeded : chr [1:49965] "19.28" "19.20" "19.36" "23.96" ...
## $ Percentage Standard Met : chr [1:49965] "23.25" "23.00" "23.49" "28.06" ...
## $ Percentage Standard Met and Above : chr [1:49965] "42.53" "42.20" "42.85" "52.02" ...
## $ Percentage Standard Nearly Met : chr [1:49965] "41.38" "40.40" "42.42" "38.28" ...
## $ Percentage Standard Not Met : chr [1:49965] "16.09" "17.41" "14.73" "9.70" ...
## $ Students with Scores : chr [1:49965] "15555" "7944" "7603" "12413" ...
## $ Life Sciences Domain Percent Below Standard : chr [1:49965] "40.65" "42.05" "39.21" "30.24" ...
## $ Life Sciences Domain Percent Near Standard : chr [1:49965] "45.73" "45.85" "45.57" "52.88" ...
## $ Life Sciences Domain Percent Above Standard : chr [1:49965] "13.62" "12.10" "15.21" "16.89" ...
## $ Physical Sciences Domain Percent Below Standard : chr [1:49965] "38.80" "40.19" "37.39" "29.49" ...
## $ Physical Sciences Domain Percent Near Standard : chr [1:49965] "46.98" "45.21" "48.81" "52.96" ...
## $ Physical Sciences Domain Percent Above Standard : chr [1:49965] "14.22" "14.61" "13.80" "17.55" ...
## $ Earth and Space Sciences Domain Percent Below Standard: chr [1:49965] "37.99" "38.12" "37.87" "27.69" ...
## $ Earth and Space Sciences Domain Percent Near Standard : chr [1:49965] "43.39" "41.66" "45.20" "49.27" ...
## $ Earth and Space Sciences Domain Percent Above Standard: chr [1:49965] "18.61" "20.21" "16.94" "23.04" ...
## $ Type ID : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...
## - attr(*, "spec")=
## .. cols(
## .. `County Code` = col_character(),
## .. `District Code` = col_character(),
## .. `School Code` = col_character(),
## .. Filler = col_logical(),
## .. `Test Year` = col_double(),
## .. `Student Group ID` = col_double(),
## .. `Test Type` = col_character(),
## .. `Total Tested at Reporting Level` = col_character(),
## .. `Total Tested with Scores at Reporting Level` = col_character(),
## .. Grade = col_double(),
## .. `Test ID` = col_double(),
## .. `Students Enrolled` = col_character(),
## .. `Students Tested` = col_character(),
## .. `Mean Scale Score` = col_character(),
## .. `Percentage Standard Exceeded` = col_character(),
## .. `Percentage Standard Met` = col_character(),
## .. `Percentage Standard Met and Above` = col_character(),
## .. `Percentage Standard Nearly Met` = col_character(),
## .. `Percentage Standard Not Met` = col_character(),
## .. `Students with Scores` = col_character(),
## .. `Life Sciences Domain Percent Below Standard` = col_character(),
## .. `Life Sciences Domain Percent Near Standard` = col_character(),
## .. `Life Sciences Domain Percent Above Standard` = col_character(),
## .. `Physical Sciences Domain Percent Below Standard` = col_character(),
## .. `Physical Sciences Domain Percent Near Standard` = col_character(),
## .. `Physical Sciences Domain Percent Above Standard` = col_character(),
## .. `Earth and Space Sciences Domain Percent Below Standard` = col_character(),
## .. `Earth and Space Sciences Domain Percent Near Standard` = col_character(),
## .. `Earth and Space Sciences Domain Percent Above Standard` = col_character(),
## .. `Type ID` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(entities_data)
## # A tibble: 6 × 10
## `County Code` `District Code` `School Code` Filler `Test Year` `Type ID`
## <chr> <chr> <chr> <lgl> <dbl> <dbl>
## 1 00 00000 0000000 NA 2022 4
## 2 01 00000 0000000 NA 2022 5
## 3 01 10017 0000000 NA 2022 6
## 4 01 10017 0112607 NA 2022 9
## 5 01 10017 0123968 NA 2022 9
## 6 01 10017 0124172 NA 2022 9
## # ℹ 4 more variables: `County Name` <chr>, `District Name` <chr>,
## # `School Name` <chr>, `Zip Code` <dbl>
str(entities_data)
## spc_tbl_ [11,128 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ County Code : chr [1:11128] "00" "01" "01" "01" ...
## $ District Code: chr [1:11128] "00000" "00000" "10017" "10017" ...
## $ School Code : chr [1:11128] "0000000" "0000000" "0000000" "0112607" ...
## $ Filler : logi [1:11128] NA NA NA NA NA NA ...
## $ Test Year : num [1:11128] 2022 2022 2022 2022 2022 ...
## $ Type ID : num [1:11128] 4 5 6 9 9 9 9 7 7 9 ...
## $ County Name : chr [1:11128] "State of California" "Alameda" "Alameda" "Alameda" ...
## $ District Name: chr [1:11128] NA NA "Alameda County Office of Education" "Envision Academy for Arts & Technology" ...
## $ School Name : chr [1:11128] NA NA NA "Envision Academy for Arts & Technology" ...
## $ Zip Code : num [1:11128] NA NA NA 94612 94606 ...
## - attr(*, "spec")=
## .. cols(
## .. `County Code` = col_character(),
## .. `District Code` = col_character(),
## .. `School Code` = col_character(),
## .. Filler = col_logical(),
## .. `Test Year` = col_double(),
## .. `Type ID` = col_double(),
## .. `County Name` = col_character(),
## .. `District Name` = col_character(),
## .. `School Name` = col_character(),
## .. `Zip Code` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Now that we have imported our data and know how it looks like, let’s clean each dataset so that it is it a state where we can run some analysis on it. This involves converting appropriate columns to numeric type and categorical variables to factors. Here are the high level steps we will take in the next few sections:
acgr_data <- acgr_data %>%
# Convert numeric columns from character to numeric type
mutate(across(c(CohortStudents,
`Regular HS Diploma Graduates (Count)`,
`Regular HS Diploma Graduates (Rate)`,
`Met UC/CSU Grad Req's (Count)`,
`Met UC/CSU Grad Req's (Rate)`,
`Dropout (Count)`,
`Dropout (Rate)`),
as.numeric)) %>%
# Convert categorical variables to factors
mutate(AcademicYear = as.factor(AcademicYear),
AggregateLevel = as.factor(AggregateLevel),
CharterSchool = as.factor(CharterSchool),
DASS = as.factor(DASS),
ReportingCategory = as.factor(ReportingCategory))
# Display the structure of the cleaned ACGR data
str(acgr_data)
## tibble [113,235 × 34] (S3: tbl_df/tbl/data.frame)
## $ AcademicYear : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
## $ AggregateLevel : Factor w/ 4 levels "C","D","S","T": 1 1 1 1 1 1 1 1 1 1 ...
## $ CountyCode : chr [1:113235] "01" "01" "01" "01" ...
## $ DistrictCode : num [1:113235] NA NA NA NA NA NA NA NA NA NA ...
## $ SchoolCode : chr [1:113235] NA NA NA NA ...
## $ CountyName : chr [1:113235] "Alameda" "Alameda" "Alameda" "Alameda" ...
## $ DistrictName : chr [1:113235] NA NA NA NA ...
## $ SchoolName : chr [1:113235] NA NA NA NA ...
## $ CharterSchool : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ DASS : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ ReportingCategory : Factor w/ 19 levels "GF","GM","GX",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ CohortStudents : num [1:113235] 8536 9104 16 4557 1593 ...
## $ Regular HS Diploma Graduates (Count) : num [1:113235] 7765 7882 11 4349 1327 ...
## $ Regular HS Diploma Graduates (Rate) : num [1:113235] 91 86.6 68.8 95.4 83.3 78.4 93.8 81.4 82.2 86.5 ...
## $ Met UC/CSU Grad Req's (Count) : num [1:113235] 5317 4597 7 3650 602 ...
## $ Met UC/CSU Grad Req's (Rate) : num [1:113235] 68.5 58.3 63.6 83.9 45.4 53.3 65.1 45.2 35.1 29.3 ...
## $ Seal of Biliteracy (Count) : chr [1:113235] "1306" "835" "1" "1188" ...
## $ Seal of Biliteracy (Rate) : chr [1:113235] "16.8" "10.6" "9.1" "27.3" ...
## $ Golden State Seal Merit Diploma (Count): chr [1:113235] "2738" "2137" "2" "2461" ...
## $ Golden State Seal Merit Diploma (Rate : chr [1:113235] "35.3" "27.1" "18.2" "56.6" ...
## $ CHSPE Completer (Count) : chr [1:113235] "14" "17" "1" "12" ...
## $ CHSPE Completer (Rate) : chr [1:113235] "0.2" "0.2" "6.3" "0.3" ...
## $ Adult Ed. HS Diploma (Count) : chr [1:113235] "0" "2" "0" "0" ...
## $ Adult Ed. HS Diploma (Rate) : chr [1:113235] "0" "0" "0" "0" ...
## $ SPED Certificate (Count) : chr [1:113235] "51" "106" "0" "37" ...
## $ SPED Certificate (Rate) : chr [1:113235] "0.6" "1.2" "0" "0.8" ...
## $ GED Completer (Count) : chr [1:113235] "5" "1" "0" "1" ...
## $ GED Completer (Rate) : chr [1:113235] "0.1" "0" "0" "0" ...
## $ Other Transfer (Count) : chr [1:113235] "16" "25" "0" "4" ...
## $ Other Transfer (Rate) : chr [1:113235] "0.2" "0.3" "0" "0.1" ...
## $ Dropout (Count) : num [1:113235] 465 758 2 106 156 18 21 715 4 15 ...
## $ Dropout (Rate) : num [1:113235] 5.4 8.3 12.5 2.3 9.8 13.4 2.3 11.8 8.9 8.8 ...
## $ Still Enrolled (Count) : chr [1:113235] "220" "313" "2" "48" ...
## $ Still Enrolled (Rate) : chr [1:113235] "2.6" "3.4" "12.5" "1.1" ...
stability_data <- stability_data %>%
# Convert numeric columns from character to numeric type
mutate(across(c(`Adjusted Cumulative Enrollment`,
`Stability Count`,
`Stability Rate (percent)`,
`Non-Stability Count`,
`Non-Stability Rate (percent)`),
as.numeric)) %>%
# Convert categorical variables to factors
mutate(`Academic Year` = as.factor(`Academic Year`),
`Aggregate Level` = as.factor(`Aggregate Level`),
`Charter School` = as.factor(`Charter School`),
DASS = as.factor(DASS),
`Reporting Category` = as.factor(`Reporting Category`))
# Display the structure of the cleaned Stability Rate data
str(stability_data)
## tibble [353,391 × 16] (S3: tbl_df/tbl/data.frame)
## $ Academic Year : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
## $ Aggregate Level : Factor w/ 4 levels "C","D","S","T": 4 4 4 4 4 4 4 4 4 4 ...
## $ County Code : chr [1:353391] "00" "00" "00" "00" ...
## $ District Code : num [1:353391] NA NA NA NA NA NA NA NA NA NA ...
## $ School Code : chr [1:353391] NA NA NA NA ...
## $ County Name : chr [1:353391] "State" "State" "State" "State" ...
## $ District Name : chr [1:353391] NA NA NA NA ...
## $ School Name : chr [1:353391] NA NA NA NA ...
## $ Charter School : Factor w/ 3 levels "All","No ","Yes": 1 1 1 2 2 2 3 3 3 1 ...
## $ DASS : Factor w/ 3 levels "All","No ","Yes": 1 2 3 1 2 3 1 2 3 1 ...
## $ Reporting Category : Factor w/ 25 levels "GF","GM","GR13",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Adjusted Cumulative Enrollment: num [1:353391] 2942620 2869233 129287 2604353 2567882 ...
## $ Stability Count : num [1:353391] 2649435 2601593 47860 2344381 2317063 ...
## $ Stability Rate (percent) : num [1:353391] 90 90.7 37 90 90.2 38.6 78.7 86.1 33.9 89.6 ...
## $ Non-Stability Count : num [1:353391] 293185 267640 81427 259972 250819 ...
## $ Non-Stability Rate (percent) : num [1:353391] 10 9.3 63 10 9.8 61.4 21.3 13.9 66.1 10.4 ...
absenteeism_data <- absenteeism_data %>%
# Convert numeric columns from character to numeric type
mutate(across(c(ChronicAbsenteeismEligibleCumulativeEnrollment,
ChronicAbsenteeismCount,
ChronicAbsenteeismRate),
as.numeric)) %>%
# Convert categorical variables to factors
mutate(`Academic Year` = as.factor(`Academic Year`),
`Aggregate Level` = as.factor(`Aggregate Level`),
`Charter School` = as.factor(`Charter School`),
DASS = as.factor(DASS),
`Reporting Category` = as.factor(`Reporting Category`))
# Display the structure of the cleaned Absenteeism data
str(absenteeism_data)
## tibble [342,607 × 14] (S3: tbl_df/tbl/data.frame)
## $ Academic Year : Factor w/ 1 level "2021-22": 1 1 1 1 1 1 1 1 1 1 ...
## $ Aggregate Level : Factor w/ 4 levels "C","D","S","T": 4 4 4 4 4 4 4 4 4 4 ...
## $ County Code : chr [1:342607] "00" "00" "00" "00" ...
## $ District Code : num [1:342607] NA NA NA NA NA NA NA NA NA NA ...
## $ School Code : chr [1:342607] NA NA NA NA ...
## $ County Name : chr [1:342607] "State" "State" "State" "State" ...
## $ District Name : chr [1:342607] NA NA NA NA ...
## $ School Name : chr [1:342607] NA NA NA NA ...
## $ Charter School : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ DASS : Factor w/ 3 levels "All","No ","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Reporting Category : Factor w/ 26 levels "GF","GM","GR13",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ ChronicAbsenteeismEligibleCumulativeEnrollment: num [1:342607] 2913279 3077837 1280630 1333547 903122 ...
## $ ChronicAbsenteeismCount : num [1:342607] 866427 931454 397646 363188 241996 ...
## $ ChronicAbsenteeismRate : num [1:342607] 29.7 30.3 31.1 27.2 26.8 30.1 30 40.4 43.3 NA ...
cast_data <- cast_data %>%
# Convert numeric columns from character to numeric type
mutate(across(c(`Students Enrolled`,
`Students Tested`,
`Mean Scale Score`,
`Percentage Standard Exceeded`,
`Percentage Standard Met`,
`Percentage Standard Met and Above`,
`Percentage Standard Nearly Met`,
`Percentage Standard Not Met`),
as.numeric)) %>%
# Convert categorical variables to factors
mutate(`Test Year` = as.factor(`Test Year`),
`Student Group ID` = as.factor(`Student Group ID`),
`Test Type` = as.factor(`Test Type`),
Grade = as.factor(Grade),
`Test ID` = as.factor(`Test ID`))
# Display the structure of the cleaned CAST data
str(cast_data)
## tibble [49,965 × 30] (S3: tbl_df/tbl/data.frame)
## $ County Code : chr [1:49965] "01" "01" "01" "01" ...
## $ District Code : chr [1:49965] "00000" "00000" "00000" "00000" ...
## $ School Code : chr [1:49965] "0000000" "0000000" "0000000" "0000000" ...
## $ Filler : logi [1:49965] NA NA NA NA NA NA ...
## $ Test Year : Factor w/ 1 level "2022": 1 1 1 1 1 1 1 1 1 1 ...
## $ Student Group ID : Factor w/ 55 levels "1","3","4","6",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Test Type : Factor w/ 1 level "X": 1 1 1 1 1 1 1 1 1 1 ...
## $ Total Tested at Reporting Level : chr [1:49965] "50473" "25986" "24441" "42902" ...
## $ Total Tested with Scores at Reporting Level : chr [1:49965] "50355" "25916" "24394" "42809" ...
## $ Grade : Factor w/ 8 levels "5","8","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Test ID : Factor w/ 1 level "17": 1 1 1 1 1 1 1 1 1 1 ...
## $ Students Enrolled : num [1:49965] 15953 8167 7777 12711 1209 ...
## $ Students Tested : num [1:49965] 15566 7948 7610 12423 1194 ...
## $ Mean Scale Score : num [1:49965] 206 206 207 212 223 ...
## $ Percentage Standard Exceeded : num [1:49965] 19.3 19.2 19.4 24 41.6 ...
## $ Percentage Standard Met : num [1:49965] 23.2 23 23.5 28.1 34.1 ...
## $ Percentage Standard Met and Above : num [1:49965] 42.5 42.2 42.9 52 75.7 ...
## $ Percentage Standard Nearly Met : num [1:49965] 41.4 40.4 42.4 38.3 22.3 ...
## $ Percentage Standard Not Met : num [1:49965] 16.09 17.41 14.73 9.7 2.01 ...
## $ Students with Scores : chr [1:49965] "15555" "7944" "7603" "12413" ...
## $ Life Sciences Domain Percent Below Standard : chr [1:49965] "40.65" "42.05" "39.21" "30.24" ...
## $ Life Sciences Domain Percent Near Standard : chr [1:49965] "45.73" "45.85" "45.57" "52.88" ...
## $ Life Sciences Domain Percent Above Standard : chr [1:49965] "13.62" "12.10" "15.21" "16.89" ...
## $ Physical Sciences Domain Percent Below Standard : chr [1:49965] "38.80" "40.19" "37.39" "29.49" ...
## $ Physical Sciences Domain Percent Near Standard : chr [1:49965] "46.98" "45.21" "48.81" "52.96" ...
## $ Physical Sciences Domain Percent Above Standard : chr [1:49965] "14.22" "14.61" "13.80" "17.55" ...
## $ Earth and Space Sciences Domain Percent Below Standard: chr [1:49965] "37.99" "38.12" "37.87" "27.69" ...
## $ Earth and Space Sciences Domain Percent Near Standard : chr [1:49965] "43.39" "41.66" "45.20" "49.27" ...
## $ Earth and Space Sciences Domain Percent Above Standard: chr [1:49965] "18.61" "20.21" "16.94" "23.04" ...
## $ Type ID : num [1:49965] 5 5 5 5 5 5 5 5 5 5 ...
entities_data <- entities_data %>%
# Convert categorical variables to factors
mutate(`Test Year` = as.factor(`Test Year`),
`Type ID` = as.factor(`Type ID`))
# Display the structure of the cleaned Entities data
str(entities_data)
## tibble [11,128 × 10] (S3: tbl_df/tbl/data.frame)
## $ County Code : chr [1:11128] "00" "01" "01" "01" ...
## $ District Code: chr [1:11128] "00000" "00000" "10017" "10017" ...
## $ School Code : chr [1:11128] "0000000" "0000000" "0000000" "0112607" ...
## $ Filler : logi [1:11128] NA NA NA NA NA NA ...
## $ Test Year : Factor w/ 1 level "2022": 1 1 1 1 1 1 1 1 1 1 ...
## $ Type ID : Factor w/ 6 levels "4","5","6","7",..: 1 2 3 5 5 5 5 4 4 5 ...
## $ County Name : chr [1:11128] "State of California" "Alameda" "Alameda" "Alameda" ...
## $ District Name: chr [1:11128] NA NA "Alameda County Office of Education" "Envision Academy for Arts & Technology" ...
## $ School Name : chr [1:11128] NA NA NA "Envision Academy for Arts & Technology" ...
## $ Zip Code : num [1:11128] NA NA NA 94612 94606 ...
Now that we’ve cleaned our data, let’s check for any missing values in each dataset.
# Function to check and display missing values
check_missing <- function(df) {
missing <- sapply(df, function(x) sum(is.na(x)))
missing[missing > 0]
}
# Check for missing values in all datasets
print("Missing values in ACGR data:")
## [1] "Missing values in ACGR data:"
print(check_missing(acgr_data))
## DistrictCode SchoolCode
## 8666 8666
## DistrictName SchoolName
## 8666 8666
## CohortStudents Regular HS Diploma Graduates (Count)
## 60203 60203
## Regular HS Diploma Graduates (Rate) Met UC/CSU Grad Req's (Count)
## 60203 60203
## Met UC/CSU Grad Req's (Rate) Dropout (Count)
## 60203 60203
## Dropout (Rate)
## 60203
print("Missing values in Stability Rate data:")
## [1] "Missing values in Stability Rate data:"
print(check_missing(stability_data))
## District Code School Code
## 11908 135723
## District Name School Name
## 11908 135723
## Adjusted Cumulative Enrollment Stability Count
## 111618 111618
## Stability Rate (percent) Non-Stability Count
## 111618 111618
## Non-Stability Rate (percent)
## 111618
print("Missing values in Absenteeism data:")
## [1] "Missing values in Absenteeism data:"
print(check_missing(absenteeism_data))
## District Code
## 11898
## School Code
## 11898
## District Name
## 11898
## School Name
## 11898
## ChronicAbsenteeismEligibleCumulativeEnrollment
## 105376
## ChronicAbsenteeismCount
## 105376
## ChronicAbsenteeismRate
## 105376
print("Missing values in CAST data:")
## [1] "Missing values in CAST data:"
print(check_missing(cast_data))
## Filler
## 49965
## Students Enrolled
## 12875
## Students Tested
## 12346
## Mean Scale Score
## 33090
## Percentage Standard Exceeded
## 24201
## Percentage Standard Met
## 24201
## Percentage Standard Met and Above
## 24201
## Percentage Standard Nearly Met
## 24201
## Percentage Standard Not Met
## 24201
## Life Sciences Domain Percent Below Standard
## 2976
## Life Sciences Domain Percent Near Standard
## 2976
## Life Sciences Domain Percent Above Standard
## 2976
## Physical Sciences Domain Percent Below Standard
## 2976
## Physical Sciences Domain Percent Near Standard
## 2976
## Physical Sciences Domain Percent Above Standard
## 2976
## Earth and Space Sciences Domain Percent Below Standard
## 2976
## Earth and Space Sciences Domain Percent Near Standard
## 2976
## Earth and Space Sciences Domain Percent Above Standard
## 2976
print("Missing values in Entities data:")
## [1] "Missing values in Entities data:"
print(check_missing(entities_data))
## Filler District Name School Name Zip Code
## 11128 59 1074 1074
Based on the output from our check for missing values, we can see that each dataset has some missing data. Let’s summarize the findings and suggest some approaches to handle these missing values:
Given the nature of our analysis, we should consider the following approaches:
Let’s implement these strategies on ACGR and CAST data sets. The other data sets will be used to filter these two:
# Add data level indicator to ACGR data
acgr_data <- acgr_data %>%
mutate(data_level = case_when(
!is.na(SchoolCode) ~ "School",
!is.na(DistrictCode) ~ "District",
!is.na(CountyCode) ~ "County",
TRUE ~ "State"
))
# Flag missing test scores in CAST data
cast_data <- cast_data %>%
mutate(missing_scores = is.na(`Mean Scale Score`))
Let’s filter the data for Pleasanton unified, Hayward unified and Fremont unified school districts using the following criteria:
We’ll apply these filters to each relevant dataset. We will also include only race, socioeconomic and gender data and exclude other factors like grade.
The following code does the following:
After running this code, we’ll have filtered datasets that focus only on Pleasanton Unified School District. We can then use these filtered datasets for our subsequent analysis.
# Load necessary libraries if not already loaded
library(dplyr)
library(tidyr)
# Define the district codes we're interested in
target_districts <- c("75101", "61176", "61192") # Pleasanton, Hayward, Fremont
# Filter ACGR (Adjusted Cohort Graduation Rate) Data
acgr_data_filtered <- acgr_data %>%
filter(DistrictCode %in% target_districts,
AggregateLevel == "D",
CharterSchool == "No ") %>%
filter(ReportingCategory %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM", "TA")) %>%
mutate(`District Code` = DistrictCode)
# Filter Stability Rate Data
stability_data_filtered <- stability_data %>%
filter(`District Code` %in% target_districts,
`Aggregate Level` == "D",
`Charter School` == "No ") %>%
filter(`Reporting Category` %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM"))
# Filter Absenteeism Data
absenteeism_data_filtered <- absenteeism_data %>%
filter(`District Code` %in% target_districts,
`Aggregate Level` == "D",
`Charter School` == "No ") %>%
filter(`Reporting Category` %in% c("RA", "RB", "RF", "RH", "RP", "RT", "RW", "SE", "SS", "GF", "GM"))
# Filter CAST Data
cast_data_filtered <- cast_data %>%
filter(`District Code` %in% target_districts,
`School Code` == "0000000") %>% # District-level data
filter(`Student Group ID` %in% c(1, 3, 4, 6, 7, 8, 9, 10, 11, 31, 75, 99, 100, 128, 129, 130, 144, 160, 180, 181))
# Filter Entities Data
entities_data_filtered <- entities_data %>%
filter(`District Code` %in% target_districts)
# Display the number of rows in each filtered dataset
cat("Number of rows in filtered ACGR data:", nrow(acgr_data_filtered), "\n")
## Number of rows in filtered ACGR data: 108
cat("Number of rows in filtered Stability data:", nrow(stability_data_filtered), "\n")
## Number of rows in filtered Stability data: 99
cat("Number of rows in filtered Absenteeism data:", nrow(absenteeism_data_filtered), "\n")
## Number of rows in filtered Absenteeism data: 99
cat("Number of rows in filtered CAST data:", nrow(cast_data_filtered), "\n")
## Number of rows in filtered CAST data: 269
cat("Number of rows in filtered Entities data:", nrow(entities_data_filtered), "\n")
## Number of rows in filtered Entities data: 96
# Display unique districts in our filtered data
unique_districts <- entities_data_filtered %>%
filter(`Type ID` == 6) %>% # Assuming Type ID 6 represents districts
select(`District Name`, `District Code`) %>%
distinct()
cat("\nDistricts included in our analysis:\n")
##
## Districts included in our analysis:
print(unique_districts)
## # A tibble: 3 × 2
## `District Name` `District Code`
## <chr> <chr>
## 1 Fremont Unified 61176
## 2 Hayward Unified 61192
## 3 Pleasanton Unified 75101
Let’s start by comparing graduation rates across Pleasanton, Hayward, and Fremont Unified school districts. We’ll break down the rates by demographic groups to identify any disparities or trends.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)
# Create a function to map reporting categories to full names
get_full_category_name <- function(code) {
case_when(
code == "RA" ~ "Asian",
code == "RB" ~ "African American",
code == "RF" ~ "Filipino",
code == "RD" ~ "Not Reported",
code == "RH" ~ "Hispanic or Latino",
code == "RI" ~ "American Indian or Alaska Native",
code == "RP" ~ "Pacific Islander",
code == "RT" ~ "Two or More Races",
code == "RW" ~ "White",
code == "SE" ~ "English Learners",
code == "SS" ~ "Socioeconomically Disadvantaged",
code == "GF" ~ "Female",
code == "GM" ~ "Male",
code == "GX" ~ "Non-Binary",
code == "TA" ~ "All Students",
TRUE ~ code
)
}
# Filter and prepare graduation rate data
graduation_data <- acgr_data_filtered %>%
select(DistrictName, ReportingCategory, `Regular HS Diploma Graduates (Rate)`) %>%
# Convert rate to numeric, removing any '%' signs if present
mutate(`Regular HS Diploma Graduates (Rate)` = as.numeric(gsub("%", "", `Regular HS Diploma Graduates (Rate)`)),
ReportingCategory = get_full_category_name(ReportingCategory))
# Calculate average graduation rate by district and reporting category
avg_grad_rates <- graduation_data %>%
group_by(DistrictName, ReportingCategory) %>%
summarize(AvgGradRate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE))
# Create a bar plot of graduation rates
ggplot(avg_grad_rates, aes(x = ReportingCategory, y = AvgGradRate, fill = DistrictName)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Average Graduation Rates by Demographic Group",
x = "Demographic Group",
y = "Average Graduation Rate (%)",
fill = "District") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
# Calculate overall average graduation rate for each district
overall_avg_rates <- graduation_data %>%
group_by(DistrictName) %>%
summarize(OverallAvgRate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE))
print(overall_avg_rates)
## # A tibble: 3 × 2
## DistrictName OverallAvgRate
## <chr> <dbl>
## 1 Fremont Unified 85.8
## 2 Hayward Unified 79.4
## 3 Pleasanton Unified 90.0
This analysis provides a visual comparison of graduation rates across different demographic groups in the three districts. The bar plot allows us to easily identify any disparities between groups or districts. The overall average graduation rates give us a quick snapshot of each district’s performance.
Next, we’ll examine the stability rates and chronic absenteeism in these districts. Stability rates indicate the percentage of students who remain in the same school throughout the academic year, while chronic absenteeism refers to students who miss a significant number of school days.
# Prepare stability rate data
stability_data_prep <- stability_data_filtered %>%
select(`District Name`, `Reporting Category`, `Stability Rate (percent)`) %>%
mutate(`Stability Rate (percent)` = as.numeric(`Stability Rate (percent)`),
`Reporting Category` = get_full_category_name(`Reporting Category`))
# Prepare absenteeism data
absenteeism_data_prep <- absenteeism_data_filtered %>%
select(`District Name`, `Reporting Category`, `ChronicAbsenteeismRate`) %>%
mutate(ChronicAbsenteeismRate = as.numeric(ChronicAbsenteeismRate),
`Reporting Category` = get_full_category_name(`Reporting Category`))
# Calculate average stability and absenteeism rates by district and reporting category
avg_rates <- stability_data_prep %>%
group_by(`District Name`, `Reporting Category`) %>%
summarize(AvgStabilityRate = mean(`Stability Rate (percent)`, na.rm = TRUE)) %>%
left_join(
absenteeism_data_prep %>%
group_by(`District Name`, `Reporting Category`) %>%
summarize(AvgAbsenteeismRate = mean(ChronicAbsenteeismRate, na.rm = TRUE)),
by = c("District Name", "Reporting Category")
)
# Create a scatter plot
ggplot(avg_rates, aes(x = AvgStabilityRate, y = AvgAbsenteeismRate, color = `District Name`)) +
geom_point(size = 4) +
geom_text(aes(label = `Reporting Category`), hjust = 0, vjust = 0, check_overlap = TRUE, size = 3) +
theme_minimal() +
labs(title = "Stability Rate vs Chronic Absenteeism Rate by Demographic Group",
x = "Average Stability Rate (%)",
y = "Average Chronic Absenteeism Rate (%)",
color = "District") +
scale_color_brewer(palette = "Set1")
print(avg_rates)
## # A tibble: 33 × 4
## # Groups: District Name [3]
## `District Name` `Reporting Category` AvgStabilityRate AvgAbsenteeismRate
## <chr> <chr> <dbl> <dbl>
## 1 Fremont Unified African American 59.8 48.6
## 2 Fremont Unified Asian 64.5 22.5
## 3 Fremont Unified English Learners 56.6 34.9
## 4 Fremont Unified Female 67.1 34.2
## 5 Fremont Unified Filipino 67.6 31.3
## 6 Fremont Unified Hispanic or Latino 66.8 43.4
## 7 Fremont Unified Male 67.0 30
## 8 Fremont Unified Pacific Islander 80.2 44.5
## 9 Fremont Unified Socioeconomically Disadv… 64.4 41.9
## 10 Fremont Unified Two or More Races 70.8 36.2
## # ℹ 23 more rows
This analysis compares the stability rates and chronic absenteeism rates across the three districts. The scatter plot helps visualize any potential relationship between stability and absenteeism. We can observe that districts with higher stability rates tend to have lower absenteeism rates, and vice versa.
Now, let’s analyze the CAST results to see how science performance varies across these districts. We’ll look at the percentage of students meeting or exceeding standards in different science domains.
# Load necessary libraries if not already loaded
library(dplyr)
library(ggplot2)
library(tidyr)
# Create a function to map student group IDs to full names
get_full_group_name <- function(id) {
case_when(
id == 1 ~ "All Students",
id == 3 ~ "Female",
id == 4 ~ "Male",
id == 6 ~ "African American",
id == 7 ~ "American Indian or Alaska Native",
id == 8 ~ "Asian",
id == 9 ~ "Filipino",
id == 10 ~ "Hispanic or Latino",
id == 11 ~ "Pacific Islander",
id == 31 ~ "Socioeconomically Disadvantaged",
id == 75 ~ "Two or More Races",
id == 99 ~ "White",
id == 100 ~ "Non-Binary",
id == 128 ~ "English Learners",
id == 129 ~ "Students with Disabilities",
id == 130 ~ "Homeless Youth",
id == 144 ~ "Foster Youth",
id == 160 ~ "Migrant Education",
id == 180 ~ "Military Dependents",
id == 181 ~ "Not Reported",
TRUE ~ as.character(id)
)
}
# Prepare CAST data
cast_data_prep <- cast_data_filtered %>%
select(`District Code`, `County Code`, Grade, `Student Group ID`,
`Percentage Standard Met and Above`,
`Life Sciences Domain Percent Above Standard`,
`Physical Sciences Domain Percent Above Standard`,
`Earth and Space Sciences Domain Percent Above Standard`) %>%
mutate(across(c(`Percentage Standard Met and Above`,
`Life Sciences Domain Percent Above Standard`,
`Physical Sciences Domain Percent Above Standard`,
`Earth and Space Sciences Domain Percent Above Standard`),
as.numeric),
`Student Group` = get_full_group_name(`Student Group ID`))
# Add district names based on district codes
cast_data_prep <- cast_data_prep %>%
mutate(DistrictName = case_when(
`District Code` == 75101 ~ "Pleasanton Unified",
`District Code` == 61176 ~ "Fremont Unified",
`District Code` == 61192 ~ "Hayward Unified",
TRUE ~ "Unknown"
))
# Calculate average percentages by district and student group
avg_cast_results <- cast_data_prep %>%
group_by(DistrictName, `Student Group`) %>%
summarize(
AvgOverallMeetExceed = mean(`Percentage Standard Met and Above`, na.rm = TRUE),
AvgLifeSciences = mean(`Life Sciences Domain Percent Above Standard`, na.rm = TRUE),
AvgPhysicalSciences = mean(`Physical Sciences Domain Percent Above Standard`, na.rm = TRUE),
AvgEarthSpaceSciences = mean(`Earth and Space Sciences Domain Percent Above Standard`, na.rm = TRUE)
)
# Create a grouped bar plot
avg_cast_results_long <- avg_cast_results %>%
pivot_longer(cols = starts_with("Avg"),
names_to = "Domain",
values_to = "Percentage") %>%
mutate(Domain = case_when(
Domain == "AvgOverallMeetExceed" ~ "Overall Met/Exceeded",
Domain == "AvgLifeSciences" ~ "Life Sciences",
Domain == "AvgPhysicalSciences" ~ "Physical Sciences",
Domain == "AvgEarthSpaceSciences" ~ "Earth and Space Sciences",
TRUE ~ Domain
))
ggplot(avg_cast_results_long, aes(x = `Student Group`, y = Percentage, fill = DistrictName)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Domain, scales = "free_y") +
theme_minimal() +
labs(title = "CAST Results by Science Domain and Student Group",
x = "Student Group",
y = "Average Percentage Above Standard",
fill = "District") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.margin = margin(l = 30, unit = "pt") # Add left padding
) +
scale_fill_brewer(palette = "Set3")
# Print summary of results
print(avg_cast_results)
## # A tibble: 39 × 6
## # Groups: DistrictName [3]
## DistrictName `Student Group` AvgOverallMeetExceed AvgLifeSciences
## <chr> <chr> <dbl> <dbl>
## 1 Fremont Unified African American 62.1 34.3
## 2 Fremont Unified All Students 56.9 31.4
## 3 Fremont Unified American Indian or Alas… 80.0 51.7
## 4 Fremont Unified Asian 61.6 33.0
## 5 Fremont Unified English Learners 11.2 4.14
## 6 Fremont Unified Female 56.7 30.5
## 7 Fremont Unified Foster Youth 48.7 31.3
## 8 Fremont Unified Male 57.0 32.3
## 9 Fremont Unified Migrant Education 4.56 1.17
## 10 Fremont Unified Military Dependents 57.6 30.3
## # ℹ 29 more rows
## # ℹ 2 more variables: AvgPhysicalSciences <dbl>, AvgEarthSpaceSciences <dbl>
This analysis provides insights into the science performance across different domains in each district. The grouped bar plot allows for easy comparison of performance in Life Sciences, Physical Sciences, and Earth and Space Sciences, as well as the overall percentage of students meeting or exceeding standards.
Lastly, we’ll look at the performance across different science domains for each district:
library(dplyr)
library(tidyr)
library(ggplot2)
# Prepare CAST data for analysis
cast_data_prep <- cast_data_filtered %>%
select(`District Code`, Grade,
`Life Sciences Domain Percent Above Standard`,
`Physical Sciences Domain Percent Above Standard`,
`Earth and Space Sciences Domain Percent Above Standard`) %>%
mutate(across(contains("Percent Above Standard"), as.numeric)) %>%
mutate(DistrictName = case_when(
`District Code` == 75101 ~ "Pleasanton Unified",
`District Code` == 61176 ~ "Fremont Unified",
`District Code` == 61192 ~ "Hayward Unified",
TRUE ~ "Unknown"
))
# Reshape the data for plotting
cast_data_long <- cast_data_prep %>%
pivot_longer(
cols = contains("Percent Above Standard"),
names_to = "Domain",
values_to = "Percent_Above_Standard"
) %>%
mutate(Domain = gsub(" Domain Percent Above Standard", "", Domain))
# Calculate average performance by district and domain
avg_performance <- cast_data_long %>%
group_by(DistrictName, Domain) %>%
summarize(Avg_Percent_Above_Standard = mean(Percent_Above_Standard, na.rm = TRUE))
# Create a grouped bar plot
ggplot(avg_performance, aes(x = Domain, y = Avg_Percent_Above_Standard, fill = DistrictName)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Average Performance Across Science Domains by District",
x = "Science Domain",
y = "Average Percent of Students Above Standard",
fill = "District") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.margin = margin(l = 20, unit = "pt")) +
scale_fill_brewer(palette = "Set2")
# Print summary statistics
print(avg_performance)
## # A tibble: 9 × 3
## # Groups: DistrictName [3]
## DistrictName Domain Avg_Percent_Above_Standard
## <chr> <chr> <dbl>
## 1 Fremont Unified Earth and Space Sciences 27.2
## 2 Fremont Unified Life Sciences 26.8
## 3 Fremont Unified Physical Sciences 27.5
## 4 Hayward Unified Earth and Space Sciences 5.97
## 5 Hayward Unified Life Sciences 7.04
## 6 Hayward Unified Physical Sciences 7.43
## 7 Pleasanton Unified Earth and Space Sciences 28.6
## 8 Pleasanton Unified Life Sciences 28.5
## 9 Pleasanton Unified Physical Sciences 30.3
# Perform analysis of variance (ANOVA) to check for significant differences
anova_result <- aov(Percent_Above_Standard ~ DistrictName + Domain, data = cast_data_long)
print(summary(anova_result))
## Df Sum Sq Mean Sq F value Pr(>F)
## DistrictName 2 76965 38482 250.428 <2e-16 ***
## Domain 2 198 99 0.645 0.525
## Residuals 754 115865 154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 48 observations deleted due to missingness
District Performance Disparity: Hayward Unified consistently underperforms across all measured categories compared to other districts in the study, indicating a need for comprehensive review of educational strategies.
Demographic Achievement Gap: Students from migrant, socioeconomically disadvantaged, and English learner backgrounds show significantly lower performance across subjects, highlighting the need for targeted interventions.
Absenteeism and Stability Correlation: A strong inverse relationship exists between chronic absenteeism and student stability rates, with high absenteeism (>50%) correlating with notably lower stability.
English Learner Graduation Gap: English language learners exhibit a small but significant lag in graduation rates compared to their peers, suggesting a need for enhanced support systems.
Let’s create visualizations to support each of these findings.
library(dplyr)
library(ggplot2)
library(tidyr)
library(forcats)
# Helper function to add district names
add_district_names <- function(df) {
df %>%
mutate(DistrictName = case_when(
`District Code` == 75101 ~ "Pleasanton Unified",
`District Code` == 61176 ~ "Fremont Unified",
`District Code` == 61192 ~ "Hayward Unified",
TRUE ~ "Unknown"
))
}
This graph shows that Hayward unified generally has lower performance across all categories compared to the other districts.
library(tidyverse)
library(ggplot2)
# Prepare the data
cast_data_summary <- cast_data_filtered %>%
filter(`Student Group ID` == 1, Grade == 99) %>%
select(`District Code`,
`Percentage Standard Exceeded`,
`Percentage Standard Met`,
`Percentage Standard Nearly Met`,
`Percentage Standard Not Met`) %>%
pivot_longer(cols = starts_with("Percentage"),
names_to = "Category",
values_to = "Percentage") %>%
mutate(
Percentage = as.numeric(Percentage),
Category = factor(Category,
levels = c("Percentage Standard Not Met",
"Percentage Standard Nearly Met",
"Percentage Standard Met",
"Percentage Standard Exceeded")),
DistrictName = case_when(
`District Code` == "61176" ~ "Fremont Unified",
`District Code` == "61192" ~ "Hayward Unified",
`District Code` == "75101" ~ "Pleasanton Unified"
)
)
# Create the visualization
ggplot(cast_data_summary, aes(x = DistrictName, y = Percentage, fill = Category)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = sprintf("%.1f%%", Percentage)),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", angle=90) +
scale_fill_manual(values = c("Percentage Standard Not Met" = "#FFA07A",
"Percentage Standard Nearly Met" = "#98FB98",
"Percentage Standard Met" = "#87CEFA",
"Percentage Standard Exceeded" = "#DDA0DD")) +
labs(title = "CAST Performance Comparison Across Districts",
subtitle = "Based on California Science Test Results",
x = "District",
y = "Percentage",
fill = "Performance Category") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "right",
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12)) +
coord_flip()
This graph illustrates that Migrant Education, Socioeconomically Disadvantaged, and English Learners generally perform lower than the overall student population across all districts.
vulnerable_groups <- cast_data_filtered %>%
add_district_names() %>%
filter(`Student Group ID` %in% c(160, 31, 128, 1)) %>% # Include All Students (1) for comparison
mutate(`Student Group` = case_when(
`Student Group ID` == 160 ~ "Migrant Education",
`Student Group ID` == 31 ~ "Socioeconomically Disadvantaged",
`Student Group ID` == 128 ~ "English Learners",
`Student Group ID` == 1 ~ "All Students"
)) %>%
select(DistrictName, `Student Group`, `Percentage Standard Met and Above`) %>%
mutate(`Percentage Standard Met and Above` = as.numeric(`Percentage Standard Met and Above`))
ggplot(vulnerable_groups, aes(x = `Student Group`, y = `Percentage Standard Met and Above`, fill = DistrictName)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Performance of Vulnerable Groups Across Districts",
x = "Student Group",
y = "Percentage Met and Above Standard",
fill = "District") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This graph compares graduation rates between English Learners and the overall student population, showing that English Learners indeed have slightly lower graduation rates across all three districts.
library(dplyr)
library(ggplot2)
library(forcats)
# Prepare the data
graduation_rates <- acgr_data_filtered %>%
add_district_names() %>%
filter(ReportingCategory %in% c("SE", "TA", "RB")) %>% # SE for English Learners, TA for All Students, RB for African American
select(DistrictName, ReportingCategory, `Regular HS Diploma Graduates (Rate)`) %>%
mutate(
`Regular HS Diploma Graduates (Rate)` = as.numeric(gsub("%", "", `Regular HS Diploma Graduates (Rate)`)),
ReportingCategory = case_when(
ReportingCategory == "SE" ~ "English Learners",
ReportingCategory == "TA" ~ "All Students",
ReportingCategory == "RB" ~ "African American",
TRUE ~ ReportingCategory
)
)
# Create the plot
ggplot(graduation_rates, aes(x = DistrictName, y = `Regular HS Diploma Graduates (Rate)`, fill = ReportingCategory)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("All Students" = "steelblue", "English Learners" = "orange", "African American" = "darkgreen")) +
labs(title = "Graduation Rates: Comparison Across Select Groups",
x = "District",
y = "Graduation Rate (%)",
fill = "Student Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Print summary statistics
summary_stats <- graduation_rates %>%
group_by(ReportingCategory) %>%
summarize(
Mean_Rate = mean(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE),
SD_Rate = sd(`Regular HS Diploma Graduates (Rate)`, na.rm = TRUE)
)
print(summary_stats)
## # A tibble: 3 × 3
## ReportingCategory Mean_Rate SD_Rate
## <chr> <dbl> <dbl>
## 1 African American 83.3 6.86
## 2 All Students 84.6 11.1
## 3 English Learners 70.2 13.0
By implementing these recommendations, we aim to see measurable improvements in the performance of students from marginalized communities across all three districts over the next 3 years. Regular monitoring and adjustment of these strategies will be crucial to ensure their effectiveness and to make necessary modifications as needs evolve.