Data Collection and Analysis
In order to measure disparate impact, relevant demographic data for our projects and programs needs to be collected and analyzed to see if one protected class is disproportionately impacted compared to other groups. Please refer to Title VI - Types of Discrimination Factsheet for more information on disparate impact and its context within the Title VI program. This section will go over how to collect the data from the United States census, various ways to display and map that data, and how to do some basic entry level analysis of whether there is a disparate impact.
Below are some relevant links to resources from around the web to help users with some basic statistical concepts. These are provided if the practitioner would like more information about some of the methods discussed in this section.
https://gse.gmu.edu/research/tr/tr-comparison#Q&Q - Basic information on the differences between qualitative and quantitative research methods.
https://www.khanacademy.org/math/statistics-probability - Statistics and probability courses from Khan Academy.
https://www.udacity.com/course/statistics--st095 - Free elementary statistics course from San Jose State University.
https://onlinecourses.science.psu.edu/stat501/node/251/ - Statistics overview from Penn State University.
P9 Table and LEP” Microft Excel Plugin - Excel add-in that will generate percentages and summarize the area under review after downloading census information.
Sources of Data - U.S. Census
The US Decennial census is the best and most accurate source for getting demographic data. In this section, we will give a brief overview of what the census is and how it gathers demographic data. When you go to the census website or to American Fact Finder you will likely encounter two different studies. The decennial census is conducted every 10 years. In the census, every person in the U.S. is surveyed and asked a series of ten questions. The American Community Survey is conducted annually using a random sample of American residents and contains a much larger set of questions.
Below is a picture of the 2010 census form with the demographic questions highlighted.
As you can see, the census includes two questions related to ethnicity and race. The first question asks the person whether they are of Hispanic origin and the second asks for the person's race. It is important to note that there are some common issues associated with how these data are reported, particularly in environmental documents. The first issue we often come across is reporting out the responses to these separate questions together. For example, “the area under review is 40% Hispanic, 70% White, 20% Black, and 10% Asian”. This creates confusion as it appears we are talking about one set of responses instead of two and the percentages add up to more than 100. A second issue we encounter is that some documents will only report the data for the race question, ignoring the Hispanic population. Both issues can result in not properly accounting for the minority population in the areas under review.
When analyzing data from the Census, it is important to understand how the data are presented geographically. The Chart below shows the different geographical entities used in the census.
An interactive version of this chart is available at https://www.census.gov/geo/reference/webatlas/
Using the web atlas linked above, the Census provides definitions and examples for each of the listed geographies.
The geographic unit you choose for your analysis will vary based on what it is you are analyzing. Typically, though, for project reviews you will be looking at block groups and tracts (avoid the use of blocks in most situations as the population within them is small and highly variable). And for larger reviews, of statewide programs for instance, you may wish to use a larger geographical entity such as a county.
American Fact Finder
American Fact Finder, available at https://factfinder.census.gov/, is the primary source for gathering the relevant data from the US Census. Since we are looking for demographic data related to race and ethnicity, this section will show how to gather that information. The easiest ways to display this data are through table P9 from the 2010 census and table B03002 from the 5-year American Community Survey (ACS) (The table in the census handout shows when it is appropriate to use what census product). These tables show the Hispanic or Latino population and the not Hispanic or Latino population broken down by race. Using these tables will avoid some of the issues related to the separate race and ethnicity questions discussed earlier.
Selecting Geographies - List Method
The example below demonstrates how to gather this data using the drop-down list option, in it we will gather demographic data at the block group level for Baltimore County, MD from the 2010 census.
Step 1: Click on advanced search and under topic or table name type “P9,” then click “GO.”
Step 2: After pressing Go, you will be taken to the following page. Check the box next to top result. Then click the geography button on the menu to the left (see next step)
Step 3: Use the drop-down menu items on the pop-up to narrow your search. In our case, we will select “block group – 150,” “Maryland,” “Baltimore,” and “All Block Groups within Baltimore County, Maryland.” Then click “Add to Your Selections” at the bottom of the pop-up and then close it. Note: You can also the map under the geographies tab to select a region or study area.
Step 4: After adding to your selections you can proceed to close the ‘select geographies window’ window. Your page should then look like the one pictured below. Click on the first table in the search results.
Step 5: Now click on the download button under the actions menu toward the middle of your screen. Select that you want to ‘use the data’ and make sure that you uncheck the box for ‘merge data and annotations into a single file’. This is important for being able to cleanly analyze the data later. Click the download button to get a zip file with the relevant data. The file you will want to use within the zip file is “DEC_10_SF1_P9.csv”
Step 6: Once you have the data open in excel, you can clean it up to show only the data that we are likely to need for our purposes. In most cases you will need only the following columns: GEO.display-label, D001, D002, D005, D006, D007, D008, D009, D010, D011
Deleting the rest of the columns and selecting wrap text on the second row should give you a sheet that looks like this:
Now we have our data in a format we can use to analyze any relevant Title VI information.
The below GIF covers all the steps outlined in the tutorial as well as how our P9 Tool (available here) formats the data in a way that is easier to analyze.
Selecting Geographies - Map Method
An alternative to selecting the geographic area from the list drop down menu above is to use the map selection within American Fact Finder.
Step 1: Under this method, the first two steps are the same as above but when we get to the select geographies window, click on the tab labeled ‘map.’
Step 2: To get down to the block group level, you will first need to zoom in on the area under review. Then click on the radio button labeled ‘more geography types’ and select block group from the list.
Step 3: To begin selecting block groups (or other geographies), the map tool provides various ways to outline an area on the map. The pointer marked by a dot can be used to select individual geographies, the square and circle will draw those shapes, and the polygon tool will allow you to outline an area however you see fit. The polygon tool may be particularly useful if you are trying to draw an outline that closely resemble the boundaries of your environmental study area.
Step 4: After finishing the shape, the map should automatically highlight the geographies you have selected and provide a list of those selections on the right. If you are satisfied with the geographies selected, you can click the button marked ‘add to your selections’ to add those geographies. You then need to close the window when you are done making your selections. At this point you should be returned to a screen similar to step 4 in the previous example and can follow the subsequent steps listed there.
Using Your Downloaded Excel File
The excel file you have downloaded and cleaned up contains only the raw totals of the populations surveyed. While this can be useful for analysis we often need to know the percentages of the populations within the geographic area. If you wish to calculate the percentages yourself, you can find many helpful articles detailing the process such as https://www.ablebits.com/office-addins-blog/2015/01/14/calculate-percentage-excel-formula/.
Additionally, the Office of Civil Rights has created an excel add-in to automatically calculate certain features from the P9 table provided by the US Census. This add-in will give you the total minority population (all populations except non-Hispanic white), percentages for all populations within the selected geographies, and a summary sheet displaying the information for the entire area you have selected for review.
The Add-in is available here (P9 Table and LEP” Microsoft Excel Plugin). The video linked here, https://www.youtube.com/watch?v=reuU2zUsEPM, describes how to install this add-in to excel.
What the add-in does: When you download the P9 table from American Fact Finder it looks like this.
After you have installed the add-in you can simply click a button to run it. In this case I have put it under P9 in the top right of my toolbar. Running it results in a file that looks like this and contains a summary of the entire area in a separate worksheet that you can click on at the bottom of your excel application.
Sources of Data - EJScreen
EJScreen, located at https://www.epa.gov/ejscreen, is a useful tool for looking at the demographics of a location. This is primarily useful for a ‘first look’ at the demographics of an area, and data should be taken from the Census itself when doing a more detailed analysis. One of the most useful features for our purposes is to be able to easily map data from the 2010 census showing minority populations. To do this load EJScreen in your browser and go to the top menu for “Add maps” and select the option “more demographics.”
This will bring up a window that looks like this:
At the top of this window you have three options for 2010-2014 ACS, 2010 Census, and 2000 census. As with the American Fact Finder section, in most cases we will want to use the 2010 census. Below is an explanation of the menu items seen in this window:
Category: Population will always be our selection here
Variable: This is the variable you wish to map such as percentage Hispanic or percent not Hispanic Asian alone as two examples of demographics we might use. In most cases you will want to use percentages since doing so makes it easier to compare populations.
Method: There are three methods for displaying the data here. This will determine how your information is mapped.
Equal Intervals: Equal interval divides the range of values into equal-sized subranges. This allows you to specify the number of intervals, and EJScreen will automatically determine the class breaks based on the value range.
Quantiles: Each class contains an equal number of features. Assigns the same number of data values to each class. There are no empty classes or classes with too few or too many values. Since we are often dealing with minority populations this will mean the upper most class will have a large range of percentage values. This is the map that will most often give you the best representation for our purposes.
Note: One important thing to consider when using quantiles is that the calculation is based on the national data and is not specific to the area you are looking at.
For example, here is what the legends for equal interval and quintiles look like for the percentage Hispanic population (with 5 breaks chosen see below).
|Natural Breaks: This method will generally not serve our purposes since you can't compare the maps of two different demographic categories with this.
Breaks: This just determines the number of sub –ranges. For instance, if on the equal interval method, you choose 7 instead of 5 then each color will be assigned to ~14 percentage points instead of 20. Use of the default of 5 works for most cases.
The map below shows the percentage of the non-Hispanic black population for the Washington DC area based on five quantiles:
This is done at the block group level. You can click on the individual block groups to gather the actual demographics for it. Next, we will go over some other features of EJScreen although they will be less commonly used than the demographic mapping shown above.
The tool's main feature is to provide mapping of several indices that might be useful for studying environmental justice considerations in a given area. This is how each item is calculated:
Demographic Index == (% low income + % minority) / 2
Supplementary Demographic Index == (% minority + % low-income + % less than high school education + % linguistic isolation + % individuals under age 5 + % individuals over age 64) / 6
EJ index == environmental indicator * (demographic index – US Average demographic index) * block group population
Summary Reports display Indices, Environmental Indicators, and Demographic Indicators in chart form by percentile. Summary Reports should be based on the census block group. You can compare percentiles on a statewide, EPA regional, or national level. It is possible to draw squares, polygons, and produce circular buffers around specific points when developing these reports. However, it is based on GIS algorithms that are not provided in the technical documentation which may make using them for our purposes difficult to justify. Using the tabular view and ‘Standard Report' provides more detailed breakdown including raw numbers, and actual percentage of populations within the block group.
Sources of Data – Other Maps
One way to display this more detailed information on race and ethnicity that practitioners should consider using is a dot map. These maps can visualize demographics related to several variables instead of the single variable visible in the more commonly used heat maps. By using census data and placing a dot for each individual or number of individuals of a certain race/ethnicity these can give a fuller picture of the minority communities in your area. Below are two examples of how this map can be done.
The New York Times – Mapping Segregation
This map creates a color-coded dot for an x number of people for each race/ethnicity within a census tract. It uses one dot for 40 people at its most zoomed in up to several thousand when looking at the national map. The map can be found here.
University of Virginia – The Racial Dot Map
This map creates a color-coded dot for each individual within a census block based on their race/ethnicity. The map can be found here. Additionally the creators of the map have placed their source code on GitHub, here, which should allow recipients to create new maps in GIS to suit their particular needs. Examples of both maps are shown in the pages below.
This section covers some simple and basic ways to conduct an analysis of demographic data so you can determine whether a disparate impact is likely to result from a project or program you are reviewing. These are very basic tests and will only give indication of whether more research should be done to more accurately measure the impacts of the project or program.
How to Analyze Demographic Data – Ratios
The 4/5 test is established in case law and is a basic test we turn to frequently as an initial look at the impacts associated with a project or program. In addition to the example provided below, wikiHow also offers a very simple straight forward instruction on to conduct this test, available here. While sex is used in the example, the steps would be the same for looking at a particular protected class versus all other citizens in the population.
In the following example, we will go through how to perform a 4/5 test with demographic data related to a project.
The chart below shows the total population of eastern Anytown potentially affected by the project.
Title VI applies to both the benefits and the burdens of the project. For this example, we will look at those who are likely to be negatively impacted by the project as result of such things as relocations, noise, etc. We will compare the groups negatively impacted by each alternative against the population of the area.
This is the population negatively impacted by alternative 1:
||500/3000 = 17%
||200/1000 = 20%
||600/1500 = 40%
||100/500 = 20%
One way to compare groups is to look at the most impacted group compared to the least impacted group. In looking at alternative 1, we see that the most impacted group is African-Americans and the least impacted group is the majority White population. Therefore, we will divide 17 by 40 and see that we end up with a ratio that is less than 4/5 or 80%. This means that the selection of Alternative 1 will likely result in a disparate impact to African-Americans in Anytown.
17/40 = 42.5%
This approach may not work if it includes an outlier with a small number of people. Another way to compare populations is by looking at the most impacted group compared to all other races/ethnicities.
(1400-600) /(6000-1500) = 18%
18/40 = 45%
Again, we show a disparate impact to African-Americans in Anytown. Now, let's look at alternative 2.
The population negatively impacted by alternative 2:
||600/300 = 20%
||230/1000 = 23%
||280/1500 = 19%
||110/500 = 22%
19/23 = 82.6%
For alternative 2, the most impacted group is Hispanics and the least impacted group is African-Americans. Dividing 19 by 23 we get a ratio that is greater than 4/5. This means that there is likely not a disparate impact to a minority population in Anytown using alternative 2.
Now comparing the most impacted group to all other populations:
(1220-230)/(6000-1000) = 20%
20/23 = 87%
Again, we show that there is no likely no disparate impact as a result of alternative 2.
You can also express these differences in terms of a risk ratio. This is the same as the 4/5 rule but flipped. A benefit of the risk ratio is that expressing differences this way can make it clearer to a reader. For instance if we go back and look at alternative 1 we would do the following:
Compared to the least impacted: 40/17 = 2.4
This would allow us to say: African Americans were 2.4 times as likely as White Americans to be negatively impacted by this alternative.
Compared to all other groups: 40/18 = 2.2
We could then say: African Americans were 2.2 times as likely as all other racial/ethnic groups to be negatively impacted by this alternative.
How to Analyze Demographic Data – Simple Linear Regression
In this next section, we will provide with a very simple method for conducting a slightly more advanced analysis. The example below will show how to take some information about a program and determine if there is a relationship between demographic variables and a specific outcome.
In your research, you are provided with the following information that shows how much money per capita each county received from the program. You then go to American Fact finder and get the Percent Hispanic for each county:
||Funding Per Capita
Now we will use excel's graphic features to visually display this information as well provide us with a simple linear regression line. First highlight both columns containing the data related to your project. Then select the ‘insert’ tab from the excel menu. You should then go to the charts in the center of that menu. Then choose the scatter plots located in the middle of the third row of charts and select the first chart on the top left.
You should end up with a plot showing the funding per capita on the y (vertical) axis and percent Hispanic population on the x (horizontal) axis. It should look like this:
Already we can begin to see the relationship (or lack thereof) between a county's Hispanic population and how much money they received from this program. To get a line representing that relationship, make sure you have the chart selected and are in the ‘design’ menu. From there select ‘quick layout’ on the left-hand side and choose the third choice on the third row (with the fx symbol in it). This will give you a linear regression line, a formula for that line, and R squared value.
You can click and drag the equation over so that it is out of the way and you should end up with something like this:
The R squared valued measures how tightly the data points observed fit the regression line drawn in the chart. An R^2 value closer to 1 means a tighter fit and a strong relationship, potentially indicating a disparate impact under this particular example. Whereas an R^2 closer to 0 means a weak fit and little relationship between the two variables, indicating that disparate impact is not likely in this example. As we can see above our R^2 value is 0.015, which is very close to zero and we can tell from the graph alone that there is not a strong relationship between a county's Hispanic population and how much money it received from the program. For a counter example, if our chart looked like the one below, a disparate impact would be much more likely and the relationship between the two variables can be easily observed:
In some cases, you may want to review a statewide program that has hundreds or thousands of related projects. It would be time consuming and wasteful to try and review every single one of these projects. We can then turn to reviewing a random sample of the projects. By conducting a random sample, you can review only a fraction of the projects within that program and still be able to say something meaningful about the program. Below we will demonstrate how to obtain a random sample using an online calculator and excel.
In the example that follows, we will assume a population of 1500 projects. This sheet shows how to randomly select a sample to represent these 1500 projects.
One easy way to determine the sample size you need is with an online calculator such as http://www.raosoft.com/samplesize.html, as shown below:
In this example, we will use parameters with the same established statistical criteria FHWA uses for the agency's Compliance Assessment Program reviews:
- 90% Confidence Level – the amount of uncertainty you can tolerate.
- 10% Margin of Error – the amount of error you can tolerate.
- 50% Response Distribution – the expected response to the questions.
Using the above criteria and a population of 1500 projects, our recommended sample size is 65 projects. This is how many projects that need to be reviewed to represent the example population.
The Office of Civil Rights also provides an excel calculator with the CAP criteria locked in if you always wish to use those parameters, available here.
Once the sample size has been determined, you will need to randomly select the 65 projects. Online calculators are available however, Microsoft Excel provides a simple way to randomly select items.
- In Excel, create a spreadsheet with one “Project Number” column, with one line for each of our 1500 projects. Note: an existing spreadsheet or table of projects can be used.
- Next, add a blank column with the label “Random.”
- In the first cell of the Random column, insert the formula =RAND(), and copy it to every cell in the Random column; you will see numbers appear in these cells.
- Next, sort the Random column in ascending order.
- Finally, select the first 65 to look at as part of your review.