Raw data
The DOHMH Childcare Center Inspections dataset contains 26280
observations with 34 variables, including “center name”, “(a center’s)
legal name”, “building (number)”, “street”, “borough”, “zip code”,
“phone (number)”, “(a center’s) permit number”, “permit expiration”,
“status”, “age range”, “maximum capacity”, “day care ID”, “program
type”, “facility type”, “childcare type”, “building identification
number”, “URL (website)”, “date permitted”, “actual” (i.e., flag for
correct date of original permit), “violation rate percent”, “average
violation rate percent”, “total educational workers”, “average total
educational workers”, “public health hazard violation rate”, “average
public health hazard violation rate”, “critical violation rate”,
“average critical violation rate”, “inspection date”, “regulation
summary”, “violation category”, “health code sub section”, “violation
status” and “inspection summary result.” Each row is a single violation
cited during an inspection.
Data cleaning and manipulation
First, we observed that there were repeated observations in the
dataset, so we deleted them, remaining 21541 observations.
Based on our purpose of data analysis, we selected 22 key variables
in this dataset, including “center name”, “borough”, “zip code”,
“status”, “age range”, “maximum capacity”, “program type”, “facility
type”, “child care type”, “violation category”, “violation status”,
“violation rate percent”, “average violation rate percent”, “total
educational workers”, “average total educational workers”, “public
health hazard violation rate”, “average public health hazard violation
rate”, “critical violation rate”, “average critical violation rate”,
“regulation summary,” and “inspection summary result.”
Next, we dropped NAs in the following variables: “zip code”, “age
range”, “violation rate percent”, “public health hazard violation rate”
and “critical violation rate.” We also dropped the observations whose
value of maximum capacity equals to 0. Additionally, we found that the
format in “program type” and “facility type” is inconsistent, which
might lead to wrong analysis results, so we made the two categorical
variables all shown in lower case.
Furthermore, we created a new variable called “educational worker
ratio” (total educational workers divided by maximum capacity) to
explore whether violation is associated with the proportion of
educational workers. Last but not least, we calculated a new violation
rate for each distinct program using “violation category”. (the number
of violation cases divided by inspection cases)
After tidying the data, we have 16451 observations with 25 variables
for the following analysis.