2017-09-16 15:15:00+02:00
What variables might affect the total crime rate?
When it comes to the issue of crime, one of the questions we'd like to answer is what factors have a relationship with crime. In this analysis, I'll be looking at data for the counties of New York State for the year of 2016. The article Variables Affecting Crime will serve as the guide to what variables I will choose initially for the analysis. These are the steps I'll be taking:
- Collect the data from various sources.
- Clean the data.
- Explore the data to try to identify variables that are likely to have a relationship with Total Crime Rate.
- Finally, perform an OLS multiple regression on our identified variables.
NOTE:
- The last U.S. census was performed in 2010, therefore, some variables are merely estimates from the source for the year 2016.
- A few of the variables are from 2015 and have no 2016 update as of the completion of this notebook. I'm making the assumption that the true values for the year of 2016 on that variable are not substantially different from the year prior.
Source of Image: http://dankilde.tripod.com/webonmediacontents/crimestats.jpg?1409168946245
Gather and Clean the Data
We will be looking at data from five sources for our analysis.
- Census statistics taken from the U.S. Census Bureau.
- Unemployment rates taken from Bureau of Labor Statistics
- High School Drop Outrates taken from New York State Education Department
- Crime rates taken from Division of Criminal Justics Services, New York State
- Divorce rates taken from New York State Department of Health
NOTE: The links take you directly to the website where the data can be found. I've provided that data in their original form as well as the final cleaned version here.
# Import the libraries we'll be using
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from statsmodels.formula.api import ols
%matplotlib inline
1a. Census
I had to retrieve the data six counties at the time per the limit they had on their website. This resulted in 11 csv files that I concatenated as shown below.
# Get a list of all the datasets
census_datasets = []
census_datasets.append(pd.read_csv('../data/crime-notebook-data/2016-ny-census-data/QuickFacts Sep-06-2017.csv'))
for i in range(1,11):
census_datasets.append(pd.read_csv('../data/crime-notebook-data/2016-ny-census-data/QuickFacts Sep-06-2017({0}).csv'.format(i)))
# Concatenate them into one DataFrame named 'census'
census = census_datasets[0]
for dataframe in census_datasets[1:]:
census = pd.concat([census,dataframe.ix[:,2:]],axis=1)
census.head(3)
# Many of the columns that start with 'Value Note' seem to be have a lot of NaN values.
# Show's the first five columns where all values are NaN
print(census.columns[census.isnull().all()][:5])
# In fact, all columns that start with 'Value Note' are filled entirely with NaN so we can easily drop them as they
# don't give us any information. 'Fact Note' is mostly NaN as well so we'll drop that tboo.
census.dropna(axis=1,how='all',inplace=True)
census.drop('Fact Note', axis=1,inplace=True)
# If we look at the last few rows we also see that they are mostly filled with NaN's so we'll drop those too
census.tail(3)
# Drops any rows with less than 5 non-NaN values. I'll drop the row 'FIPS Code' here because we'll be using county
# names instead
census.dropna(thresh=5,inplace=True)
FIPS_index = census[census['Fact'] == 'FIPS Code'].index
census.drop(FIPS_index,inplace=True)
# Because we are going to merging data from multiple datasets we want to use the names of the counties as the index to
# merge on. Therefore, we're going to rename them so that they're easier to match with.
import re
# Function to pull out only the County name
def get_county_name_census(name):
if name == 'New York County (Manhattan Borough), New York':
return 'New York'
elif name == 'St. Lawrence County, New York':
return 'St. Lawrence'
else:
return re.split(', | ',name)[0]
col_names = list(map(get_county_name_census,census.columns[1:]))
col_names.insert(0, 'Fact')
census.columns = col_names
census.columns
# Next, we want to transpose the dataframe so that we can perform our analyis on it more appropriately. We can do this
# by setting 'Fact' as the index and then using 'T' on the dataframe to transpose it
census.set_index('Fact',inplace=True)
census = census.T
census.head()
# Currently all our values our strings. We want to remove all the unnecessary characters
# Removes ',' '$' and '%
def clean_value(val):
chars = ['$',',','%']
for c in chars:
if c in val:
val = val.replace(c,'')
return val
for col in census.columns:
census[col] = census[col].apply(clean_value)
# For some odd reason the letter 'Z' is in place for the value of the percent change in population from 2010 to 2016
# for the county of Suffolk, so we're going to change that to represent the right number.
old_value = census.ix['Suffolk']['Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)']
# This code merely represents (Total Population in 2016 - Total Population in 2010) / Total Population in 2010 for the
# county of Suffolk
suffolk = census.ix['Suffolk']
census.ix['Suffolk','Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)'] = (
float(suffolk['Population estimates, July 1, 2016, (V2016)']) - (
float(suffolk['Population estimates base, April 1, 2010, (V2016)'])
)) / float(suffolk['Population estimates base, April 1, 2010, (V2016)']) * 100
new_value = census.ix['Suffolk']['Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)']
print('Old Value: {0}, New Value: {1}'.format(old_value,new_value))
# Our last step with this dataset is to select the variables(features) we think have a relationship with
# the crime rate. I've selected the ones I believe I have a relationship. You're more than welcome to use the other
# variables as you see fit
final_cols = ['Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)',
'Owner-occupied housing unit rate, 2011-2015',
'Persons per household, 2011-2015',
'High school graduate or higher, percent of persons age 25 years+, 2011-2015',
'Bachelor\'s degree or higher, percent of persons age 25 years+, 2011-2015',
'Persons without health insurance, under age 65 years, percent',
'In civilian labor force, total, percent of population age 16 years+, 2011-2015',
'Median household income (in 2015 dollars), 2011-2015',
'Persons in poverty, percent',
'Population estimates, July 1, 2016, (V2016)',
'Land area in square miles, 2010']
census = census[final_cols]
# Now let's convert all the values to floats so we can perform calculations on them later
census = census.astype(float)
# We'll also create another column for Population Density(per sq. mile) because it's not already in the dataset but we have all
# information we need to calculate it
census['Population Density(sq.m)'] = census['Population estimates, July 1, 2016, (V2016)'] / census['Land area in square miles, 2010']
# Drop the last two columns.
census.drop(['Population estimates, July 1, 2016, (V2016)','Land area in square miles, 2010'],axis=1,inplace=True)
# Lastly for this dataset, we'll rename the column names to be a little more concise
new_col_names = ['Pop.Pct.Change 2010-16', 'Owner Occupy House Rate, 2011-2015','Persons per household',
'High School grad or Higher, 2011-2015','Bachelor\'s Degree or Higher, 2011-2015',
'Persons w/out health insurance, under 65 pct', 'Civilian Labor Force Partcipation Pct',
'Median Household income', 'Poverty Pct.', 'Population Density(sq.m)']
census.columns = new_col_names
census.reset_index(inplace=True)
census.rename(index=str,columns={'index':'County'},inplace=True)
census.head()
1b. Unemployment
unemployment = pd.read_excel('../data/crime-notebook-data/laucnty16.xlsx',sheetname='laucnty16',skiprows=np.arange(4))
# Select columns we're interested in and drop the first(all NaN) row
unemployment = unemployment[['County Name/State Abbreviation','(%)']]
unemployment.columns = ['County','Unemployment Rate']
unemployment.drop(0)
# The data includes all counties in the U.S. so we're going to extract only those counties in New York
idx1 = unemployment.index[(unemployment['County'] == 'Albany County, NY')]
idx2 = unemployment.index[(unemployment['County'] == 'Yates County, NY')]
unemployment = unemployment.ix[idx1[0]:idx2[0]]
# Function to pull out only the County Name
def get_county_name_unemployment(name):
if name == 'New York County, NY':
return 'New York'
elif name == 'St. Lawrence County, NY':
return 'St. Lawrence'
else:
return name.split(' ')[0]
unemployment['County'] = unemployment['County'].apply(get_county_name_unemployment)
# unemployment.set_index('County',inplace=True)
unemployment.head()
1c. High School Dropout
dropout = pd.read_csv('../data/crime-notebook-data/GRAD_RATE_AND_OUTCOMES_2016.csv')
# First, we want to only use the rows that are county aggregated and represent all students
dropout = dropout[(dropout['AGGREGATION_TYPE'] == 'County') & (dropout['SUBGROUP_NAME'] == 'All Students')]
dropout.head()
# Now we want to replace '-' with NaN and convert the column types to int
for col in ['ENROLL_CNT', 'DROPOUT_CNT']:
dropout[col].replace('-',np.NaN,inplace=True)
dropout[col] = dropout[col].astype(int)
# Each county has 4 types on cohorts graduating based on when a student started highschool. Therefore, we want to sum
# the counts for the cohorts to get the total count for the county
dropout = dropout.groupby('COUNTY_NAME')[['ENROLL_CNT', 'DROPOUT_CNT']].sum()
# Create a column for the dropout rate
dropout['DROPOUT_PCT'] = dropout['DROPOUT_CNT'] / dropout['ENROLL_CNT'] * 100
dropout.reset_index(inplace=True)
dropout.rename(index=str,columns={'COUNTY_NAME':'County'},inplace=True)
dropout.drop(['ENROLL_CNT','DROPOUT_CNT'],axis=1,inplace=True)
# Function to pull out only the county name
def get_county_name_dropout(name):
if name == 'NEW YORK':
return 'New York'
elif name == 'SAINT LAWRENCE':
return 'St. Lawrence'
else:
return name[0] + name[1:].lower()
dropout['County'] = dropout['County'].apply(get_county_name_dropout)
dropout.head()
1d. Divorce
# Get Divorce totals by County for 2015
divorce_total = pd.read_html('https://www.health.ny.gov/statistics/vital_statistics/2015/table51.htm')
divorce_total = divorce_total[0]
# Get Population Estimated Counts for 2015
population_2015 = pd.read_html('https://www.health.ny.gov/statistics/vital_statistics/2015/table02.htm')
population_2015 = population_2015[0]
# Combine the data using County as the key
divorce = pd.merge(divorce_total, population_2015,on='County')
#Drop Non-County rows
divorce.drop([0,1,2,3,4,5,6,7,13,14,15,16],inplace=True)
# The 'crude divorce rate' is typically the number of divorces per 1000 people
divorce['DivorceRate'] = (divorce['Total'] * 1000) / divorce['2015 Population Estimate ']
divorce = divorce[['County','DivorceRate']]
#Find index of row 'St Lawrence' and change it to 'St. Lawrence'
idx = divorce.index[divorce['County'] == 'St Lawrence']
divorce.set_value(index=idx[0],col='County',value='St. Lawrence')
divorce.head()
1e. Crime
crime = pd.read_excel('../data/crime-notebook-data/2016-county-index-rates.xls', sheetname='rates',header=None,skiprows=np.arange(4),
names=['County', '2016 Pop','Total Count','Total Rate', 'Violent Count', 'Violent Rate',
'Property Count', 'Property Rate'])
# There are a lot of empty and unnnecessary rows that we need to drop
crime.drop(0,inplace=True)
crime.dropna(inplace=True)
crime.drop(crime.County.iloc[-3:].index,inplace=True)
# Select the columns we want and turn the rates into floats. Since the rate is per 100,000 thousand people we will
# divide the rates by 100,000 and multiply by 100
crime = crime[['County','Total Rate','Violent Rate','Property Rate']]
crime = pd.concat([crime['County'],crime[['Total Rate','Violent Rate','Property Rate']].apply(pd.to_numeric)],axis=1)
crime[['Total Rate','Violent Rate','Property Rate']] = crime[['Total Rate','Violent Rate','Property Rate']].apply(
lambda num: num/1000)
#Rename 'St Lawrence' to 'St. Lawrence' for merging purposes
st_idx = crime.index[crime['County'] == 'St Lawrence']
crime.set_value(index=st_idx,col='County',value='St. Lawrence')
crime.head()
1f. Combining the data
# Datasets
df_list = [census,unemployment,dropout,divorce]
df = crime
for d in df_list:
df = df.merge(d, on='County',how='outer')
# Make sure there aren't any missing values
df.isnull().any()
# Let's see what our data looks like now
df.info()
df.head()
Exploring the data
Now that the data has been collected and cleaned. The next steps are to figure out what variables we'll want to use for our OLS regression.
Source of Image: http://cnasstudent.ucr.edu/images/statistics-arrow.jpg
2a. Statistical Overview
# Call describe(0) on the first 10 variables. This will give us descriptive statistics such as mean, std, etc..
df.ix[:,:10].describe()
# then on the remaining variables
df.ix[:,10:].describe()
# Create a heatmap of correlation matrix
sns.heatmap(df.corr())
Looking at the heatmap, we notice a few variables that have modest correlation coefficients with Total Rate. Some of these are the percent of highschool dropouts, percent of population living in poverty, and the rate of those who live in the house they own.
2b. Finding the right variables
# Reduce the variables for our regression to those with a p-value of < 0.05 when computing their correlation
# coefficient.
sig_attributes = []
for col in df.columns[4:]:
cor, p_value = pearsonr(df['Total Rate'], df[col])
if p_value <= 0.05:
sig_attributes.append(col)
print('{0}\nCorrelation Coefficient: {1}, P-value: {2}\n'.format(col,cor,p_value))
sns.heatmap(df[sig_attributes].corr())
After reducing our variables we can see more clearly which variables are correlated with each other. Four in particular stick out to me:
- Owner Occupy House Rate, 2011-2015 with Poverty Pct.: -0.599762
- Median Household Income with Poverty Pct: -0.665606
- DROPOUT_PCT wiht Poverty Pct: 0.608678
- Median Household Income with DROPOUT_PCT: -0.509701
We're likely going to have to drop even more variables after our regression because many of these variables appear collinear as we would expect. For example, we probably aren't surprised that on average counties with a higher percent of those in the population living under the poverty line have higher high school dropout rates. The reasons for this may be that school funding for such areas may be minimal, students may need to help their parents out with living expenses, etc.
sns.pairplot(df[sig_attributes])
fig , ax1 = plt.subplots(1,1,figsize=(12,5))
sns.distplot(df['Population Density(sq.m)'],ax=ax1,kde=False)
Because there are so many outlier values in the Population Density variable we will not include it in our regression.
2c. OLS Regression
#Let's create a dataset of the with only those variables we think have a relationship with the Total crime Rate
df_final = df[['Total Rate', 'Persons per household', 'Median Household income', 'Poverty Pct.', 'DROPOUT_PCT',
'Owner Occupy House Rate, 2011-2015', 'DivorceRate']]
df_final.columns = ['C_rate', 'PPH', 'MedianIncome', 'PctPoverty', 'PctDropout', 'HouseOwnerRate', 'DivorceRate']
# First we're going to run an OLS regression on all the selected variables.
est1 = ols(formula='C_rate ~ PPH + MedianIncome + PctPoverty + PctDropout + HouseOwnerRate + DivorceRate',
data=df_final).fit()
print(est1.summary())
If we look at the p-values for our variables only two are below our 0.05 designation, PctDropout and HouseOwnerRate.
est2 = ols(formula='C_rate ~ HouseOwnerRate + PctDropout',data=df_final).fit()
print(est2.summary())
Conclusion
3a. Results
After completing our analysis we found that the percent of high school students that drop out of highschool and the rate of Owner Occupied Housing have a strong relationship with the Total Crime Rate. The results for our final regression tell us quite a few things. An increase of the Owner Occupied Housing Rate by 1% will result in a decrease in the Total Crime Rate on average by about 0.0151% per thousand people. An increase of the percent of high school dropouts by one percent will result in an increase in the Total Crime Rate by 0.0669% per thousand people.
3b. Limitations
There are quite a number of limitations in this analysis but I'll briefly mention a few of them.
A. Limitations of the Analysis itself:
- We only have data for the counties of New York City in the year of 2016. Ideally, we would want to have data for multiple years as well as for different states. Many of the variables did have data going years back, however, some of them didn't.
- There are probably other variables I could have found and included in this analysis but I choose to only use a limited amount because they were only 62 points of data making too many variables potentially problematic.
B. Limitations of trying to understand what affects crime using a data driven approach: (Note that I'm not an expert on the topic)
- Many of the variables that affect crime aren't easily subjected to this form of analysis. Some of these include citizens attitudes towards crime and policies of other components of the criminal justice system.