R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Introduction

Data was web scraped from the state websites to combine the GIS information like FIPS, latitude, longitude, and County Codes by both number and Mailing Number.

Why?

The final data set in this project was created to be used in my Capstone Project for the Google Data Analytics Professional Certificate. The main dataset that I will be basing the analysis off of in the capstone project only had the county names and county codes by mailing addresses associated with the dataset. So in order to answer some of the questions in regards to the relationship of the data and their respective geographical locations I needed to retrieve some GIS data based on county BUT also needed to Join them together to then later join with the main dataset in the Case Study.

Data Sources

GIS Data with FIPS, Lat and Long

Texas Counties Centroid Map

County Codes referring to Mailing Addresses

Texas Workforce Commission Tax County Codes

Packages Used

  • Tidyverse
  • rvest
  • XML

Beginning the Web Scrape Process

Converting URL to HTML

page <- read_html("https://www.twc.texas.gov/tax-county-codes#countyCodesForEmployersQuarterlyReport")

Creating the tables as a variable in R

tables <- page %>% html_table(fill=TRUE)

Preview the data to confirm extraction

glimpse(tables)
## List of 4
##  $ : tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ County  : chr [1:82] "Anderson" "Andrews" "Angelina" "Aransas" ...
##   ..$ Tax Code: int [1:82] 1 3 5 7 9 11 13 15 17 19 ...
##  $ : tibble [71 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ County  : chr [1:71] "Gaines" "Galveston" "Garza" "Gillespie" ...
##   ..$ Tax Code: int [1:71] 165 167 169 171 173 175 177 179 181 183 ...
##  $ : tibble [48 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ County  : chr [1:48] "McCulloch" "McLennan" "McMullen" "Madison" ...
##   ..$ Tax Code: int [1:48] 307 309 311 313 315 317 319 321 323 325 ...
##  $ : tibble [53 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ County  : chr [1:53] "Sabine" "San Augustine" "San Jacinto" "San Patricio" ...
##   ..$ Tax Code: int [1:53] 403 405 407 409 411 413 415 417 419 421 ...

Since the data was separated into 4 tables alphabetically, the following chunk separates the tables into their own variables to later aggregate them into one.

countiesAF <- tables[[1]]
countiesGL <- tables[[2]]
countiesMR <- tables[[3]]
countiesSZ <- tables[[4]]

Now we could merge the data to create one data frame using the bind function.

countymailingcodes <- rbind(countiesAF, countiesGL, countiesMR, countiesSZ)

To confirm that it was put into one data frame we can do a glimpse function to preview the number of rows.

glimpse(countymailingcodes)
## Rows: 254
## Columns: 2
## $ County     <chr> "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "Ar…
## $ `Tax Code` <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, …

End Web scrape Process

The following website had the option to simply download a .csv file of displayed data.

Texas Counties Centroid Map

Then we could import the CSV using the read function and set as variable countyGISdata.

countyGISdata <- read.csv("Texas_Counties_Centroid_Map.csv")

Joining The Data

A full Join can be performed for the two tables based on County Name since we want all the data in each table to be present.

TexasCountyGISData <- full_join(countymailingcodes, countyGISdata, by=c("County" = "CNTY_NM"))

Preview the Joined dataframe.

glimpse(TexasCountyGISData)
## Rows: 258
## Columns: 9
## $ County                   <chr> "Anderson", "Andrews", "Angelina", "Aransas",…
## $ `Tax Code`               <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25…
## $ X..Lat.                  <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61…
## $ Y..Long.                 <dbl> -95.65252, -102.63788, -94.61174, -96.99259, …
## $ CNTY_NBR                 <int> 1, 2, 3, 4, 5, 6, 7, NA, 9, 10, 11, 12, 13, 1…
## $ FIPS                     <chr> "48,001", "48,003", "48,005", "48,007", "48,0…
## $ Shape_Leng               <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941…
## $ Shape_Area               <dbl> 0.26599196, 0.37226948, 0.21205691, 0.0671183…
## $ County.Centroid.Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-…

Since there are only 254 Counties in Texas I exported TexasCountyGISData to excel file as .csv for some exploratory analysis.

write.table(TexasCountyGISData, file="Texas_County_GIS_Data.csv", row.names=F, sep=",")

After viewing the joined table in excel, their were formatting issues with the following county names which resulted in “NA” values after the join above.

To keep the formatting the same I edited in excel to exclude “County” in the names and left Dewitt as one word, then saved it as Texas_Counties_GIS_Data_Updated.

Austin County -> Austin
DeWitt -> De Witt
Houston County -> Houston
Tyler County -> Tyler

Imported the .csv back to view in Rstudio

UpdatedTexasCountyGISData <- read.csv("Texas_Counties_GIS_Data_Updated.csv")

View the updated and joined dataset info

glimpse(UpdatedTexasCountyGISData)
## Rows: 254
## Columns: 9
## $ County                   <chr> "Anderson", "Andrews", "Angelina", "Aransas",…
## $ Tax.Code                 <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25…
## $ X..Lat.                  <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61…
## $ Y..Long.                 <dbl> -95.65252, -102.63788, -94.61174, -96.99259, …
## $ CNTY_NBR                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ FIPS                     <chr> "48,001", "48,003", "48,005", "48,007", "48,0…
## $ Shape_Leng               <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941…
## $ Shape_Area               <dbl> 0.26599196, 0.37226948, 0.21205691, 0.0671183…
## $ County.Centroid.Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-…

I want to rearrange the columns and rename them for uniformity in the final exported version.

colnames(UpdatedTexasCountyGISData)
## [1] "County"                   "Tax.Code"                
## [3] "X..Lat."                  "Y..Long."                
## [5] "CNTY_NBR"                 "FIPS"                    
## [7] "Shape_Leng"               "Shape_Area"              
## [9] "County.Centroid.Location"

Creating the final table, rearranging and rename the columns for uniformity in one code chunk using the pipe function.

TexasCountyGISDataFinal <- UpdatedTexasCountyGISData %>%
  
  select(County, Tax.Code, CNTY_NBR, FIPS, X..Lat., Y..Long., County.Centroid.Location, Shape_Leng, Shape_Area) %>%
  
  rename(Code_Mailing=Tax.Code, Code_Number=CNTY_NBR, X_Latitude=X..Lat., Y_Longitude=Y..Long., Centroid_Location=County.Centroid.Location, Shape_Length=Shape_Leng)
glimpse(TexasCountyGISDataFinal)
## Rows: 254
## Columns: 9
## $ County            <chr> "Anderson", "Andrews", "Angelina", "Aransas", "Arche…
## $ Code_Mailing      <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 2…
## $ Code_Number       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ FIPS              <chr> "48,001", "48,003", "48,005", "48,007", "48,009", "4…
## $ X_Latitude        <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61539, 34…
## $ Y_Longitude       <dbl> -95.65252, -102.63788, -94.61174, -96.99259, -98.687…
## $ Centroid_Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-102.637…
## $ Shape_Length      <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941220, 1.…
## $ Shape_Area        <dbl> 0.26599196, 0.37226948, 0.21205691, 0.06711832, 0.23…

Now we export to final .CSV file

write.table(TexasCountyGISDataFinal, file="Texas_Counties_GIS_Data_Final.csv", row.names=F, sep=",")

Conclusion

I now have the final version of the dataset combined from both websites and ready for use in my Case Study.

I could have just edited the names and rearranged them in excel but I thought it would be more fun to come back into Rstudio and play around with some of the functions to learn a bit more.

I converted the .csv file to a google spreadsheet to make it accessible via the link below for those that are interested in making a copy into their own drive.

Google Drive

Texas Counties GIS Data

I also posted it on kaggle.com and made it public.

Kaggle

Texas GIS Data By County

Feel free to follow my Github or LinkedIn Profile to see what I end up doing with this Dataset.