This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Data was web scraped from the state websites to combine the GIS information like FIPS, latitude, longitude, and County Codes by both number and Mailing Number.
The final data set in this project was created to be used in my Capstone Project for the Google Data Analytics Professional Certificate. The main dataset that I will be basing the analysis off of in the capstone project only had the county names and county codes by mailing addresses associated with the dataset. So in order to answer some of the questions in regards to the relationship of the data and their respective geographical locations I needed to retrieve some GIS data based on county BUT also needed to Join them together to then later join with the main dataset in the Case Study.
GIS Data with FIPS, Lat and Long
County Codes referring to Mailing Addresses
Texas Workforce Commission Tax County Codes
Converting URL to HTML
page <- read_html("https://www.twc.texas.gov/tax-county-codes#countyCodesForEmployersQuarterlyReport")
Creating the tables as a variable in R
tables <- page %>% html_table(fill=TRUE)
Preview the data to confirm extraction
glimpse(tables)
## List of 4
## $ : tibble [82 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ County : chr [1:82] "Anderson" "Andrews" "Angelina" "Aransas" ...
## ..$ Tax Code: int [1:82] 1 3 5 7 9 11 13 15 17 19 ...
## $ : tibble [71 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ County : chr [1:71] "Gaines" "Galveston" "Garza" "Gillespie" ...
## ..$ Tax Code: int [1:71] 165 167 169 171 173 175 177 179 181 183 ...
## $ : tibble [48 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ County : chr [1:48] "McCulloch" "McLennan" "McMullen" "Madison" ...
## ..$ Tax Code: int [1:48] 307 309 311 313 315 317 319 321 323 325 ...
## $ : tibble [53 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ County : chr [1:53] "Sabine" "San Augustine" "San Jacinto" "San Patricio" ...
## ..$ Tax Code: int [1:53] 403 405 407 409 411 413 415 417 419 421 ...
Since the data was separated into 4 tables alphabetically, the following chunk separates the tables into their own variables to later aggregate them into one.
countiesAF <- tables[[1]]
countiesGL <- tables[[2]]
countiesMR <- tables[[3]]
countiesSZ <- tables[[4]]
Now we could merge the data to create one data frame using the bind function.
countymailingcodes <- rbind(countiesAF, countiesGL, countiesMR, countiesSZ)
To confirm that it was put into one data frame we can do a glimpse function to preview the number of rows.
glimpse(countymailingcodes)
## Rows: 254
## Columns: 2
## $ County <chr> "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "Ar…
## $ `Tax Code` <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, …
The following website had the option to simply download a .csv file of displayed data.
Then we could import the CSV using the read function and set as variable countyGISdata.
countyGISdata <- read.csv("Texas_Counties_Centroid_Map.csv")
A full Join can be performed for the two tables based on County Name since we want all the data in each table to be present.
TexasCountyGISData <- full_join(countymailingcodes, countyGISdata, by=c("County" = "CNTY_NM"))
Preview the Joined dataframe.
glimpse(TexasCountyGISData)
## Rows: 258
## Columns: 9
## $ County <chr> "Anderson", "Andrews", "Angelina", "Aransas",…
## $ `Tax Code` <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25…
## $ X..Lat. <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61…
## $ Y..Long. <dbl> -95.65252, -102.63788, -94.61174, -96.99259, …
## $ CNTY_NBR <int> 1, 2, 3, 4, 5, 6, 7, NA, 9, 10, 11, 12, 13, 1…
## $ FIPS <chr> "48,001", "48,003", "48,005", "48,007", "48,0…
## $ Shape_Leng <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941…
## $ Shape_Area <dbl> 0.26599196, 0.37226948, 0.21205691, 0.0671183…
## $ County.Centroid.Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-…
Since there are only 254 Counties in Texas I exported TexasCountyGISData to excel file as .csv for some exploratory analysis.
write.table(TexasCountyGISData, file="Texas_County_GIS_Data.csv", row.names=F, sep=",")
After viewing the joined table in excel, their were formatting issues with the following county names which resulted in “NA” values after the join above.
To keep the formatting the same I edited in excel to exclude “County” in the names and left Dewitt as one word, then saved it as Texas_Counties_GIS_Data_Updated.
Imported the .csv back to view in Rstudio
UpdatedTexasCountyGISData <- read.csv("Texas_Counties_GIS_Data_Updated.csv")
View the updated and joined dataset info
glimpse(UpdatedTexasCountyGISData)
## Rows: 254
## Columns: 9
## $ County <chr> "Anderson", "Andrews", "Angelina", "Aransas",…
## $ Tax.Code <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25…
## $ X..Lat. <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61…
## $ Y..Long. <dbl> -95.65252, -102.63788, -94.61174, -96.99259, …
## $ CNTY_NBR <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ FIPS <chr> "48,001", "48,003", "48,005", "48,007", "48,0…
## $ Shape_Leng <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941…
## $ Shape_Area <dbl> 0.26599196, 0.37226948, 0.21205691, 0.0671183…
## $ County.Centroid.Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-…
I want to rearrange the columns and rename them for uniformity in the final exported version.
colnames(UpdatedTexasCountyGISData)
## [1] "County" "Tax.Code"
## [3] "X..Lat." "Y..Long."
## [5] "CNTY_NBR" "FIPS"
## [7] "Shape_Leng" "Shape_Area"
## [9] "County.Centroid.Location"
Creating the final table, rearranging and rename the columns for uniformity in one code chunk using the pipe function.
TexasCountyGISDataFinal <- UpdatedTexasCountyGISData %>%
select(County, Tax.Code, CNTY_NBR, FIPS, X..Lat., Y..Long., County.Centroid.Location, Shape_Leng, Shape_Area) %>%
rename(Code_Mailing=Tax.Code, Code_Number=CNTY_NBR, X_Latitude=X..Lat., Y_Longitude=Y..Long., Centroid_Location=County.Centroid.Location, Shape_Length=Shape_Leng)
glimpse(TexasCountyGISDataFinal)
## Rows: 254
## Columns: 9
## $ County <chr> "Anderson", "Andrews", "Angelina", "Aransas", "Arche…
## $ Code_Mailing <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 2…
## $ Code_Number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ FIPS <chr> "48,001", "48,003", "48,005", "48,007", "48,009", "4…
## $ X_Latitude <dbl> 31.81322, 32.30506, 31.25476, 28.12710, 33.61539, 34…
## $ Y_Longitude <dbl> -95.65252, -102.63788, -94.61174, -96.99259, -98.687…
## $ Centroid_Location <chr> "POINT (-95.65251774 31.81321543)", "POINT (-102.637…
## $ Shape_Length <dbl> 3.709381, 2.579602, 3.432799, 6.125350, 1.941220, 1.…
## $ Shape_Area <dbl> 0.26599196, 0.37226948, 0.21205691, 0.06711832, 0.23…
Now we export to final .CSV file
write.table(TexasCountyGISDataFinal, file="Texas_Counties_GIS_Data_Final.csv", row.names=F, sep=",")
I now have the final version of the dataset combined from both websites and ready for use in my Case Study.
I could have just edited the names and rearranged them in excel but I thought it would be more fun to come back into Rstudio and play around with some of the functions to learn a bit more.
I converted the .csv file to a google spreadsheet to make it accessible via the link below for those that are interested in making a copy into their own drive.
Feel free to follow my Github or LinkedIn Profile to see what I end up doing with this Dataset.