install.packages("tidyverse")4 Análisis Exploratorio de datos
R ha evolucionado en los últimos años de forma que podría ser actualmente considerado un lengujae nuevo. Consideramos que la mejor forma de hacer un análisis exploratorio es usando la nueva filosofía de trabajo que se está imponiendo en el universo R desde la aparición de tidyverse.
tidyverse constituye un nuevo paradigma en el análisis de los datos con R. Es realmente un meta-package, es decir, un contenedor de una serie de librerías de R que trabajan siguiendo el mismo esquema.
Usa verbos como nombre de las funciones que expresan las acciones que realizan sobre un dataset.
Y emplea el operador pipe %>%, que permite concatenar acciones de forma más legible que con el R base.
Así pues, si no lo tenemos instalado, debemos hacerlo mediante:
Y a continuación debemos cargarlo en memoria como:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
4.1 Pipes
%>%utiliza la salida del término que hay a la izquierda del símbolo%>%como primer argumento de la función que está a la derecha de dicho símbolo.x %>% f(y)es igual que hacerf(x, y)
4.2 dplyr
Dentro del paquete tidyverse está otro paquete llamado dplyr.
Si hemos cargado tidyverse no hace falta cargar dplyr.
dplyris a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
5 funciones clave de dplyr:
- Seleccionar variables por su nombre (
select()). - Escoger observaciones (filas) según sus valores (
filter()) - filtrar, consultar las filas según condiciones, etc. - Reordenar las filas (
arrange()). - Añadir nuevas variables como función de variables ya existentes (
mutate()). - Encontrar valores representativos de cada variable (
summarise()).
Estas se combinan con: group_by() which allows you to perform any operation “by group”.
Todos estos verbos funcionan de la misma manera:
- El primer argumento es un dataframe.
- Los demás argumentos describen qué hacer con el dataframe, usando los nombres de las variables (columnas) sin necesidad de utilizar comillas.
- El resultado es un nuevo dataframe.
4.3 Explorando un dataset
Dataset: Cyber Security Breaches Data Kaggle
Ofrecer una introducción al lenguaje R usando su más moderna versión.
Extraer conocimiento desde este dataset real Cyber Security Breaches Data usando Tidyverse.
Responderemos a preguntas con la exploración que realicemos del dataset.
Aprenderemos aspectos de dificultad creciente en el uso del Lenguaje R - Exploración de datasets - Análisis de Datos.
Cuestiones a responder:
¿Cuales son las primeras/últimas filas del dataset?
¿Qué datos tenemos almacenados en el dataset?
¿Cuántos incidentes (
Type_of_Breach) distintos hay cubiertos en el dataset- ?¿Hay valores desconocidos (NA) en el dataset?
¿Hay valores nulos (0) en el dataset en alguna columna? Encuentra algunas
¿Cuántas entidades (
Name_of_Covered_Entity) hay almacenadas en el último año de los que aparecen en el dataset?Mostrar todas las filas del dataset con más de 1000 individuos afectados.
Mostrar solo las entidades y el tipo de incidente con más de 1000 individuos afectados (
Individuals_Affected).Mostrar todas las filas del dataset con más de 100 individuos afectados en el último año (
year) de los que aparecen en el dataset .¿Cuáles son las 10 incidentes (
Type_of_Breach) con más individuos afectados?¿Qué porcentaje de compañías afectadas aparecen en el dataset (del total de filas)?
Calcular la media de infividuos afectados. ¿Cuantos incidentes hay con más individuos afectados de esta media calculada? Mostrarlos.
¿Cual son las entidades e incidentes con más y con menos individuos afectados (mostrar además la fecha y la localización?
¿Cuáles son las 3 informes (
Summary) más reportados en el dataset?
4.4 Importar datos
Pasamos a importar los dato. Rstudio nos ayuda. Nota: Import Dataset en el interfaz de Rstudio.
Cyber_Security_Breaches <- read.csv("data/Cyber Security Breaches.csv")
# Para quitar la primera columna:
Cyber_Security_Breaches$X <- NULL4.5 ¿Qué hay en el dataset?
¿Cuales son las primeras/últimas filas del dataset?
Les echamos un vistazo:
head(Cyber_Security_Breaches) Number Name_of_Covered_Entity State
1 0 Brooke Army Medical Center TX
2 1 Mid America Kidney Stone Association, LLC MO
3 2 Alaska Department of Health and Social Services AK
4 3 Health Services for Children with Special Needs, Inc. DC
5 4 L. Douglas Carlson, M.D. CA
6 5 David I. Cohen, MD CA
Business_Associate_Involved Individuals_Affected Date_of_Breach
1 1000 10/16/2009
2 1000 9/22/2009
3 501 10/12/2009
4 3800 10/9/2009
5 5257 9/27/2009
6 857 9/27/2009
Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
1 Theft Paper 2014-06-30
2 Theft Network Server 2014-05-30
3 Theft Other Portable Electronic Device, Other 2014-01-23
4 Loss Laptop 2014-01-23
5 Theft Desktop Computer 2014-01-23
6 Theft Desktop Computer 2014-01-23
Summary
1 A binder containing the protected health information (PHI) of up to 1,272 individuals was stolen from a staff member's vehicle. The PHI included names, telephone numbers, detailed treatment notes, and possibly social security numbers. In response to the breach, the covered entity (CE) sanctioned the workforce member and developed a new policy requiring on-call staff members to submit any information created during their shifts to the main office instead of adding it to the binder. Following OCR's investigation, the CE notified the local media about the breach.
2 Five desktop computers containing unencrypted electronic protected health information (e-PHI) were stolen from the covered entity (CE). Originally, the CE reported that over 500 persons were involved, but subsequent investigation showed that about 260 persons were involved. The ePHI included demographic and financial information. The CE provided breach notification to affected individuals and HHS. Following the breach, the CE improved physical security by installing motion detectors and alarm systems security monitoring. It improved technical safeguards by installing enhanced antivirus and encryption software. As a result of OCR's investigation the CE updated its computer password policy.
3
4 A laptop was lost by an employee while in transit on public transportation. The computer contained the protected health information of 3800 individuals. The protected health information involved in the breach included names, Medicaid ID numbers, dates of birth, and primary physicians. In response to this incident, the covered entity took steps to enforce the requirements of the Privacy & Security Rules. The covered entity has installed encryption software on all employee computers, strengthened access controls including passwords, reviewed and updated security policies and procedures, and updated it risk assessment. In addition, all employees received additional security training. \n\n
5 A shared Computer that was used for backup was stolen on 9/27/09 from the reception desk area of the covered entity. The Computer contained certain electronic protected health information (ePHI) of 5,257 individuals who were patients of the CE. The ePHI involved in the breach included names, dates of birth, and clinical information, but there were no social security numbers, financial information, addresses, phone numbers, or other ePHI in any of the reports on the disks or the hard drive on the stolen Computer. Following the breach, the covered entity notified all 5,257 affected individuals and the appropriate media; added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer; added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet; and added administrative safeguards by requiring annual refresher retraining of CE staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules.\n\n
6 A shared Computer that was used for backup was stolen from the reception desk area, behind a locked desk area, probably while a cleaning crew had left the main door to the building open and the door to the suite was unlocked and perhaps ajar. The Computer contained certain electronic protected health information (ePHI) of 857 patients. The ePHI involved in the breach included names, dates of birth, and clinical information. Following the breach, the covered entity notified all affected individuals and the media, added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer, added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet, and added administrative safeguards by requiring annual refresher retraining staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules, which has already taken place.\n\n
breach_start breach_end year
1 2009-10-16 <NA> 2009
2 2009-09-22 <NA> 2009
3 2009-10-12 <NA> 2009
4 2009-10-09 <NA> 2009
5 2009-09-27 <NA> 2009
6 2009-09-27 <NA> 2009
tail(Cyber_Security_Breaches) Number Name_of_Covered_Entity State
1050 1049 St. Francis Hospital GA
1051 1050 Puerto Rico Health Insurance PR
1052 1051 Hospitalists of Brandon, LLC FL
1053 1052 Santa Rosa Memorial Hospital CA
1054 1053 Group Health Plan of Hurley Medical Center MI
1055 1054 Abrham Tekola, M.D.,INC CA
Business_Associate_Involved Individuals_Affected Date_of_Breach
1050 1175 5/30/2014
1051 American Health Inc 28413 9/20/2013
1052 Doctors First Choice Billings, Inc. 1831 2/11/2014
1053 33702 6/2/2014
1054 2289 5/13/2014
1055 5471 5/27/2014
Type_of_Breach Location_of_Breached_Information
1050 Other E-mail
1051 Theft Other
1052 Hacking/IT Incident Other
1053 Theft, Loss Other Portable Electronic Device
1054 Unauthorized Access/Disclosure E-mail
1055 Theft Desktop Computer
Date_Posted_or_Updated Summary breach_start breach_end year
1050 2014-06-18 2014-05-30 <NA> 2014
1051 2014-06-27 2013-09-20 <NA> 2013
1052 2014-06-27 2014-02-11 <NA> 2014
1053 2014-06-27 2014-06-02 <NA> 2014
1054 2014-06-27 2014-05-13 <NA> 2014
1055 2014-06-27 2014-05-27 <NA> 2014
¿Qué datos tenemos almacenados en el dataset?
Resumen estadístico de las variables del dataset
summary(Cyber_Security_Breaches) Number Name_of_Covered_Entity State
Min. : 0.0 Length:1055 Length:1055
1st Qu.: 263.5 Class :character Class :character
Median : 527.0 Mode :character Mode :character
Mean : 527.0
3rd Qu.: 790.5
Max. :1054.0
Business_Associate_Involved Individuals_Affected Date_of_Breach
Length:1055 Min. : 500 Length:1055
Class :character 1st Qu.: 1000 Class :character
Mode :character Median : 2300 Mode :character
Mean : 30262
3rd Qu.: 6941
Max. :4900000
Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
Length:1055 Length:1055 Length:1055
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Summary breach_start breach_end year
Length:1055 Length:1055 Length:1055 Min. :1997
Class :character Class :character Class :character 1st Qu.:2010
Mode :character Mode :character Mode :character Median :2012
Mean :2011
3rd Qu.:2013
Max. :2014
¿Cuántos incidentes (
Type_of_Breach) distintos hay cubiertos en el dataset?
- mostramos los primeros
- pero ordenados…
- quitamos repetidos
- calculamos cuantos son
head(Cyber_Security_Breaches$Type_of_Breach)[1] "Theft" "Theft" "Theft" "Loss" "Theft" "Theft"
# solo muestro los primeros
head(sort(Cyber_Security_Breaches$Type_of_Breach)) [1] "Hacking/IT Incident" "Hacking/IT Incident" "Hacking/IT Incident"
[4] "Hacking/IT Incident" "Hacking/IT Incident" "Hacking/IT Incident"
amenazas <- unique(Cyber_Security_Breaches$Type_of_Breach)
amenazas [1] "Theft"
[2] "Loss"
[3] "Other"
[4] "Hacking/IT Incident"
[5] "Unauthorized Access/Disclosure "
[6] "Theft, Loss"
[7] "Improper Disposal"
[8] "Loss, Improper Disposal"
[9] "Theft, Other"
[10] "Loss, Other"
[11] "Unauthorized Access/Disclosure"
[12] "Unauthorized Access/Disclosure, Hacking/IT Incident"
[13] "Theft, Loss, Improper Disposal"
[14] "Theft, Unauthorized Access/Disclosure, Hacking/IT Incident"
[15] "Unknown"
[16] "Theft, Unauthorized Access/Disclosure"
[17] "Unauthorized Access/Disclosure, Other"
[18] "Hacking/IT Incident, Other"
[19] "Unknown, Other"
[20] "Loss, Unknown"
[21] "Loss, Unauthorized Access/Disclosure, Unknown"
[22] "Theft, Loss, Unauthorized Access/Disclosure, Unknown"
[23] "Unauthorized Access/Disclosure, Hacking/IT Incident, Other"
[24] "Theft, Hacking/IT Incident"
[25] "Theft, Loss, Other"
[26] "Theft, Unauthorized Access/Disclosure, Other"
[27] "Theft, Improper Disposal, Unauthorized Access/Disclosure"
[28] "Improper Disposal, Unauthorized Access/Disclosure"
[29] "Loss, Unauthorized Access/Disclosure"
length(amenazas)[1] 29
la variable amenazas es un vector
la variable Cyber_Security_Breaches es un data.frame
4.5.1 Información en el dataset
See https://www.kaggle.com/alukosayoenoch/cyber-security-breaches-data:
nombres <- colnames(Cyber_Security_Breaches)
nombres [1] "Number" "Name_of_Covered_Entity"
[3] "State" "Business_Associate_Involved"
[5] "Individuals_Affected" "Date_of_Breach"
[7] "Type_of_Breach" "Location_of_Breached_Information"
[9] "Date_Posted_or_Updated" "Summary"
[11] "breach_start" "breach_end"
[13] "year"
Cogemos los valores de la columna año y lo guardamos en una variable - mostramos los primeros valores.
años <- unique(Cyber_Security_Breaches$year)
head(años)[1] 2009 2010 2008 1997 2003 2011
¿Hay valores nulos (0) en el dataset en alguna columna? Encuentra algunas
any(Cyber_Security_Breaches==0)[1] TRUE
any(Cyber_Security_Breaches$Individuals_Affected==0)[1] FALSE
any(Cyber_Security_Breaches$Number==0)[1] TRUE
indice <- which(Cyber_Security_Breaches$Number==0)
Cyber_Security_Breaches[indice,] Number Name_of_Covered_Entity State Business_Associate_Involved
1 0 Brooke Army Medical Center TX
Individuals_Affected Date_of_Breach Type_of_Breach
1 1000 10/16/2009 Theft
Location_of_Breached_Information Date_Posted_or_Updated
1 Paper 2014-06-30
Summary
1 A binder containing the protected health information (PHI) of up to 1,272 individuals was stolen from a staff member's vehicle. The PHI included names, telephone numbers, detailed treatment notes, and possibly social security numbers. In response to the breach, the covered entity (CE) sanctioned the workforce member and developed a new policy requiring on-call staff members to submit any information created during their shifts to the main office instead of adding it to the binder. Following OCR's investigation, the CE notified the local media about the breach.
breach_start breach_end year
1 2009-10-16 <NA> 2009
¿Hay valores NA (desconocidos) en el dataset?
sum(is.na(Cyber_Security_Breaches))[1] 910
which(is.na(Cyber_Security_Breaches)) [1] 11606 11607 11608 11609 11610 11611 11612 11613 11614 11615 11616 11617
[13] 11618 11619 11620 11621 11622 11623 11624 11625 11626 11627 11628 11629
[25] 11630 11631 11632 11633 11634 11635 11636 11637 11638 11639 11640 11641
[37] 11642 11643 11644 11645 11646 11647 11648 11649 11650 11651 11652 11653
[49] 11654 11655 11656 11657 11658 11659 11660 11661 11662 11663 11664 11665
[61] 11666 11667 11668 11669 11670 11671 11672 11673 11674 11675 11676 11677
[73] 11678 11679 11680 11681 11682 11683 11684 11685 11686 11687 11688 11689
[85] 11690 11691 11692 11693 11694 11695 11696 11697 11698 11699 11700 11701
[97] 11702 11703 11704 11705 11706 11707 11708 11709 11710 11711 11712 11713
[109] 11714 11715 11716 11717 11718 11719 11720 11721 11722 11723 11724 11725
[121] 11726 11727 11728 11729 11730 11731 11732 11733 11734 11735 11736 11737
[133] 11738 11739 11740 11741 11742 11743 11744 11745 11746 11747 11748 11749
[145] 11750 11751 11752 11753 11754 11755 11756 11757 11758 11759 11760 11761
[157] 11762 11763 11764 11765 11766 11767 11768 11769 11770 11771 11772 11773
[169] 11774 11775 11776 11777 11778 11779 11780 11781 11782 11783 11784 11785
[181] 11786 11787 11788 11789 11790 11791 11792 11793 11794 11795 11796 11797
[193] 11798 11799 11800 11801 11802 11803 11804 11805 11806 11807 11808 11809
[205] 11810 11811 11812 11813 11814 11815 11816 11817 11818 11819 11820 11821
[217] 11822 11823 11824 11825 11826 11827 11828 11829 11830 11831 11832 11833
[229] 11834 11835 11836 11837 11838 11839 11840 11841 11842 11843 11844 11845
[241] 11846 11847 11848 11849 11850 11851 11852 11853 11854 11855 11856 11857
[253] 11858 11859 11860 11861 11862 11863 11864 11865 11866 11867 11868 11869
[265] 11870 11871 11872 11873 11874 11875 11876 11877 11878 11879 11880 11881
[277] 11882 11883 11884 11885 11886 11887 11888 11889 11890 11891 11892 11893
[289] 11894 11895 11896 11897 11898 11899 11900 11901 11902 11903 11904 11905
[301] 11906 11907 11908 11909 11910 11911 11912 11913 11914 11915 11916 11917
[313] 11918 11919 11920 11921 11922 11923 11924 11925 11926 11927 11928 11929
[325] 11930 11931 11932 11933 11934 11935 11936 11937 11938 11939 11940 11941
[337] 11942 11943 11944 11945 11946 11947 11948 11949 11950 11951 11952 11953
[349] 11954 11955 11956 11957 11958 11959 11960 11961 11962 11963 11964 11965
[361] 11966 11967 11968 11969 11970 11971 11972 11973 11974 11975 11976 11977
[373] 11978 11979 11980 11981 11982 11983 11984 11985 11986 11987 11988 11989
[385] 11990 11991 11992 11993 11994 11995 11996 11997 11998 11999 12000 12001
[397] 12002 12003 12004 12005 12006 12007 12008 12009 12010 12011 12012 12013
[409] 12014 12015 12016 12017 12019 12020 12021 12022 12023 12024 12025 12026
[421] 12027 12028 12029 12030 12032 12033 12034 12035 12036 12037 12038 12039
[433] 12040 12041 12043 12045 12047 12049 12050 12051 12052 12053 12054 12055
[445] 12056 12057 12058 12059 12060 12062 12064 12065 12066 12067 12068 12069
[457] 12071 12072 12073 12074 12075 12076 12077 12078 12079 12080 12081 12082
[469] 12083 12084 12087 12088 12089 12090 12091 12092 12093 12095 12096 12098
[481] 12101 12102 12104 12105 12106 12107 12109 12110 12111 12113 12114 12115
[493] 12117 12119 12120 12121 12122 12123 12127 12128 12129 12133 12134 12135
[505] 12138 12139 12141 12144 12145 12146 12147 12148 12150 12151 12152 12153
[517] 12154 12155 12156 12157 12158 12161 12162 12163 12164 12165 12166 12167
[529] 12169 12171 12172 12174 12175 12176 12177 12178 12179 12181 12182 12183
[541] 12184 12185 12186 12187 12188 12189 12191 12192 12193 12194 12195 12196
[553] 12197 12199 12200 12201 12202 12203 12204 12207 12208 12209 12211 12212
[565] 12214 12215 12217 12219 12222 12223 12227 12228 12230 12233 12240 12241
[577] 12245 12246 12247 12248 12250 12251 12252 12253 12255 12256 12257 12258
[589] 12260 12263 12264 12265 12266 12267 12268 12271 12272 12273 12274 12275
[601] 12276 12277 12278 12281 12283 12284 12286 12289 12290 12291 12292 12293
[613] 12294 12295 12296 12297 12298 12299 12300 12302 12304 12305 12306 12311
[625] 12314 12315 12316 12317 12321 12323 12324 12325 12326 12327 12328 12329
[637] 12330 12332 12333 12334 12335 12336 12338 12339 12341 12342 12343 12344
[649] 12345 12346 12347 12348 12352 12353 12355 12356 12359 12360 12361 12362
[661] 12363 12364 12366 12367 12370 12372 12375 12376 12377 12378 12379 12381
[673] 12386 12387 12388 12389 12391 12392 12393 12396 12398 12400 12401 12402
[685] 12403 12405 12406 12409 12410 12411 12412 12413 12414 12415 12417 12418
[697] 12422 12425 12426 12427 12429 12431 12432 12434 12435 12437 12438 12439
[709] 12440 12441 12442 12443 12444 12445 12448 12449 12452 12453 12454 12455
[721] 12456 12457 12458 12459 12460 12461 12462 12465 12466 12467 12468 12469
[733] 12470 12471 12472 12473 12474 12476 12477 12479 12480 12481 12482 12483
[745] 12484 12485 12486 12487 12489 12491 12493 12494 12498 12499 12501 12502
[757] 12504 12505 12506 12507 12508 12509 12510 12511 12512 12513 12514 12515
[769] 12516 12517 12519 12520 12521 12522 12523 12524 12526 12527 12528 12529
[781] 12530 12531 12532 12533 12534 12535 12536 12537 12539 12540 12541 12542
[793] 12543 12544 12545 12546 12547 12548 12549 12550 12551 12552 12553 12554
[805] 12555 12556 12557 12558 12559 12560 12561 12562 12563 12564 12565 12566
[817] 12567 12568 12569 12570 12571 12572 12573 12574 12575 12576 12577 12578
[829] 12579 12580 12581 12582 12583 12584 12585 12586 12587 12588 12589 12590
[841] 12591 12592 12593 12594 12595 12596 12597 12598 12599 12600 12601 12602
[853] 12603 12604 12605 12606 12607 12608 12609 12610 12611 12612 12613 12614
[865] 12615 12616 12617 12618 12619 12620 12621 12622 12623 12624 12625 12626
[877] 12627 12628 12629 12630 12631 12632 12633 12634 12635 12636 12637 12638
[889] 12639 12640 12641 12642 12643 12644 12645 12646 12647 12648 12649 12650
[901] 12651 12652 12653 12654 12655 12656 12657 12658 12659 12660
colSums(is.na(Cyber_Security_Breaches)) Number Name_of_Covered_Entity
0 0
State Business_Associate_Involved
0 0
Individuals_Affected Date_of_Breach
0 0
Type_of_Breach Location_of_Breached_Information
0 0
Date_Posted_or_Updated Summary
0 0
breach_start breach_end
0 910
year
0
# x[is.na(x)] <- mean(x, na.rm = TRUE)4.5.2 select
seleccionar las variables en las que estamos interesados - ayuda - select(dataset, col1, col2, col3)
Ejercicio: Seleccionamos las primeras filas del dataset (solo esa información)
- usando posiciones de los atributos - columnas:
Cyber_Security_Breaches <- read.csv("data/Cyber Security Breaches.csv")
breaches <- Cyber_Security_Breaches %>%
select(1:3)- usando nombres de los atributos - columnas:
some_breaches <- Cyber_Security_Breaches %>% select(c("Name_of_Covered_Entity","Business_Associate_Involved","Individuals_Affected"))
head(some_breaches) Name_of_Covered_Entity
1 Brooke Army Medical Center
2 Mid America Kidney Stone Association, LLC
3 Alaska Department of Health and Social Services
4 Health Services for Children with Special Needs, Inc.
5 L. Douglas Carlson, M.D.
6 David I. Cohen, MD
Business_Associate_Involved Individuals_Affected
1 1000
2 1000
3 501
4 3800
5 5257
6 857
Podemos seleccionar intervalos de columnas (de col1 a col4):
s1 <- select(dataset, col1:col4)
s1- ‘deseleccionar’ columnas - NO SELECCIONAR COLUMNAS
select(s1, -c(col1))Otros argumentos interesantes select():
everything(), es mover algunas columnas o variables al principio del dataframe.
select(dataset, col5, col6, everything())starts_with('abc'): encuentra todas las columnas cuyo nombre comienza por “abc”.ends_with('xyz'): encuentra todas las columnas cuyo nombre termina en “xyz”.contains('ijk'): para seleccionar las columnas cuyo nombre contenga la cadena de caracteres “ijk”.
Y otras funciones más complejas (que filtran las columnas por expresiones regulares), que se pueden ver al hacer ?select.
4.5.3 filter
Extraer filas que cumplan una cierta condición.
filter()extrae un subconjunto de las observaciones (filas), basándose en los valores de una o más columnas.
Argumentos: - nombre del dataframe - expresiones (lógicas) para filtrar el dataframe
Ejercicio: ¿Entidades (Name_of_Covered_Entity) almacenadas en el último año de los que aparecen en el dataset?
last_year <- max(Cyber_Security_Breaches$year)
entidades <- Cyber_Security_Breaches %>%
select(c(Name_of_Covered_Entity,year))%>%
filter(year==last_year)
head(entidades) Name_of_Covered_Entity year
1 Kmart Corporation 2014
2 Lewis J. Sims, DPM, PC dba Sims and Associates Podiatry 2014
3 Joseph Michael Benson M.D 2014
4 Banner Health 2014
5 City of Hope 2014
6 University of California, San Francisco 2014
¿Cuántas son? (de dos formas)
entidades %>% count() n
1 56
dim(entidades)[1][1] 56
Mostrar todas las filas del dataset con más de 10000 individuos afectados. ¿Cuantas son?
breaches_1000 <- Cyber_Security_Breaches %>%
filter(Individuals_Affected>10000)
head(breaches_1000) X Number Name_of_Covered_Entity State
1 13 12 Universal American NY
2 22 21 BlueCross BlueShield Association DC
3 23 22 Kaiser Permanente Medical Care Program CA
4 45 44 Shands at UF FL
5 50 49 Ernest T. Bice, Jr. DDS, P.A. TX
6 59 58 Providence Hospital MI
Business_Associate_Involved Individuals_Affected Date_of_Breach
1 Democracy Data & Communications, LLC ( 83000 11/12/2009
2 Merkle Direct Marketing 15000 10/7/2009
3 15500 12/1/2009
4 12580 1/27/2010
5 21000 2/20/2010
6 83945 2/4/2010
Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
1 Other Paper 2014-01-23
2 Theft Paper 2014-04-24
3 Theft Other Portable Electronic Device, Other 2014-01-23
4 Theft Laptop 2014-01-23
5 Theft Other Portable Electronic Device, Other 2014-01-23
6 Other Other 2014-01-23
Summary
1 In its breach report and during the course of OCR's investigation, the covered entity advised that it took various corrective actions to prevent a reoccurrence of the breach. Specifically, the covered entity conducted a risk assessment which revealed that the breach posed a significant risk of financial, reputational, or other harm to the 83,000 members. The covered entity sent notification letters to 83,000 members apologizing for the breach and offered a year of free credit monitoring and a $25,000 insurance policy against identity theft ($10,000 for New York residents). The covered entity also provided training to its call centers on November 29, 2009 to answer inquiries from callers concerned about the breach. In addition, media outlets were contacted to alert of a breach in states in which more than 500 members were impacted by the breach. The covered entity advised that media outlets were identified based on location of membership impacted, as well as ensuring it was a major media outlet and press releases were sent to 21 major media outlets on December 18, 2009. The covered entity also created and implemented a new policy titled 'Personal Health Information and Personal Identifiable Information Data Security and Handling Policy Acknowledgement Form' that centralized all data requests through a 'Team Track' which is an internal electronic submission request that ensures all PHI requested data receives the sign off of the Privacy Officer and Security Officer prior to release. Further, the covered entity also provided a mandatory annual computer-based training to all staff in May 2010.\n\n
2 The covered entity's (CE) business associate (BA) mailed protected health information (PHI) of approximately 15,000 individuals to incorrect addresses due to an error in its quarterly address update process. The mailing contained demographic information, explanations of benefits, clinical information, and diagnoses. Upon discovery of the breach, the CE collected the returned mail and verified that it had not been delivered, and updated its HIPAA policies and procedures. Following OCR's investigation, the CE was able to recover all or nearly all of the misdirected envelopes.
3
4 A laptop containing certain information collected on approximately 12,580 individuals referred to Shands at UF GI Clinical Services was stolen from the private residence of an employee. The stolen information included patient names, social security numbers, and medical record numbers. As a result of the incident, the employee was counseled by her supervisor, issued written corrective action with a 3-day suspension, and provided additional HIPAA training. OCR reviewed Shands at UF's most recent Risk Analysis and Risk Management Plans and they revealed no high risk findings related to encryption, workstation use, or physical security. OCR's investigation found that Shands at UF has implemented appropriate technical safeguards, such as secure VPN network connections and network storage for workforce usage, encrypted USB portable flash drives, and PGP whole disk encryption. \n\n
5 Three unencrypted external back-up drives were stolen from a safe in the covered entity's locked office. The laptop computer contained the protected health information of approximately 21,000 individuals. The protected health information involved in the breach included names, addresses phone numbers, dates of birth, social security numbers, insurance information, and treatment histories. Following the breach, the covered entity moved back-up data offsite and encrypted all workstations. Additionally, OCR's investigation resulted in the covered entity improving their physical safeguards and in retraining employees.\n\n
6
breach_start breach_end year
1 2009-11-12 <NA> 2009
2 2009-10-07 <NA> 2009
3 2009-12-01 <NA> 2009
4 2010-01-27 <NA> 2010
5 2010-02-20 <NA> 2010
6 2010-02-04 <NA> 2010
breaches_1000 %>% count() n
1 185
Mostrar solo las entidades y el tipo de incidente con más de 10000 individuos afectados (
Individuals_Affected).
last_year <- max(Cyber_Security_Breaches$year)
breaches_10000_last_year <- Cyber_Security_Breaches %>%
filter(Individuals_Affected>10000)%>%
select(c(2, 5, 7))
head(breaches_10000_last_year) Number Business_Associate_Involved Date_of_Breach
1 12 Democracy Data & Communications, LLC ( 11/12/2009
2 21 Merkle Direct Marketing 10/7/2009
3 22 12/1/2009
4 44 1/27/2010
5 49 2/20/2010
6 58 2/4/2010
breaches_10000_last_year %>% count() n
1 185
Mostrar las entidades y el tipo de incidente del dataset con más de 10000 individuos afectados en el último año (
year) de los que aparecen en el dataset .
last_year <- max(Cyber_Security_Breaches$year)
breaches_10000_last_year <- Cyber_Security_Breaches %>%
filter(Individuals_Affected>10000)%>%
filter(year==last_year) %>%
select(c(2, 5, 7,13))
head(breaches_10000_last_year) Number Business_Associate_Involved Date_of_Breach breach_end
1 939 1/4/2014 <NA>
2 960 2/21/2014 <NA>
3 979 Sutherland Healthcare Solutions 2/5/2014 <NA>
4 997 Sutherland Healthcare Solutions, Inc. 2/5/2014 <NA>
5 1012 2/17/2014 <NA>
6 1014 2/11/2014 <NA>
breaches_10000_last_year %>% count() n
1 7
4.5.4 mutate
El verbo
mutate()se usa para añadir nuevas columnas al final del dataframe.
Ejercicio: Queremos añadir una columna (logical) que se denomine relevant si se ha identificado la compañía y tiene más de 10000 individuos afectados.
- ya de paso aprendemos a definir funciones
is_relevant("Rick Lawson, Professional Computer Services",100000000)
[1] TRUE
> is_relevant("",100000000)
[1] FALSE
is_relevant <- function(a,b){
# if((str_length(a)!=0&b>10000)){TRUE} else{FALSE}
ifelse(str_length(a) > 0 & b > 10000, TRUE, FALSE)
}
library(stringr)
Cyber_Security_Breaches <- Cyber_Security_Breaches %>%
mutate(relevant= is_relevant(Business_Associate_Involved,Individuals_Affected))
head(Cyber_Security_Breaches) X Number Name_of_Covered_Entity State
1 1 0 Brooke Army Medical Center TX
2 2 1 Mid America Kidney Stone Association, LLC MO
3 3 2 Alaska Department of Health and Social Services AK
4 4 3 Health Services for Children with Special Needs, Inc. DC
5 5 4 L. Douglas Carlson, M.D. CA
6 6 5 David I. Cohen, MD CA
Business_Associate_Involved Individuals_Affected Date_of_Breach
1 1000 10/16/2009
2 1000 9/22/2009
3 501 10/12/2009
4 3800 10/9/2009
5 5257 9/27/2009
6 857 9/27/2009
Type_of_Breach Location_of_Breached_Information Date_Posted_or_Updated
1 Theft Paper 2014-06-30
2 Theft Network Server 2014-05-30
3 Theft Other Portable Electronic Device, Other 2014-01-23
4 Loss Laptop 2014-01-23
5 Theft Desktop Computer 2014-01-23
6 Theft Desktop Computer 2014-01-23
Summary
1 A binder containing the protected health information (PHI) of up to 1,272 individuals was stolen from a staff member's vehicle. The PHI included names, telephone numbers, detailed treatment notes, and possibly social security numbers. In response to the breach, the covered entity (CE) sanctioned the workforce member and developed a new policy requiring on-call staff members to submit any information created during their shifts to the main office instead of adding it to the binder. Following OCR's investigation, the CE notified the local media about the breach.
2 Five desktop computers containing unencrypted electronic protected health information (e-PHI) were stolen from the covered entity (CE). Originally, the CE reported that over 500 persons were involved, but subsequent investigation showed that about 260 persons were involved. The ePHI included demographic and financial information. The CE provided breach notification to affected individuals and HHS. Following the breach, the CE improved physical security by installing motion detectors and alarm systems security monitoring. It improved technical safeguards by installing enhanced antivirus and encryption software. As a result of OCR's investigation the CE updated its computer password policy.
3
4 A laptop was lost by an employee while in transit on public transportation. The computer contained the protected health information of 3800 individuals. The protected health information involved in the breach included names, Medicaid ID numbers, dates of birth, and primary physicians. In response to this incident, the covered entity took steps to enforce the requirements of the Privacy & Security Rules. The covered entity has installed encryption software on all employee computers, strengthened access controls including passwords, reviewed and updated security policies and procedures, and updated it risk assessment. In addition, all employees received additional security training. \n\n
5 A shared Computer that was used for backup was stolen on 9/27/09 from the reception desk area of the covered entity. The Computer contained certain electronic protected health information (ePHI) of 5,257 individuals who were patients of the CE. The ePHI involved in the breach included names, dates of birth, and clinical information, but there were no social security numbers, financial information, addresses, phone numbers, or other ePHI in any of the reports on the disks or the hard drive on the stolen Computer. Following the breach, the covered entity notified all 5,257 affected individuals and the appropriate media; added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer; added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet; and added administrative safeguards by requiring annual refresher retraining of CE staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules.\n\n
6 A shared Computer that was used for backup was stolen from the reception desk area, behind a locked desk area, probably while a cleaning crew had left the main door to the building open and the door to the suite was unlocked and perhaps ajar. The Computer contained certain electronic protected health information (ePHI) of 857 patients. The ePHI involved in the breach included names, dates of birth, and clinical information. Following the breach, the covered entity notified all affected individuals and the media, added technical safeguards of encryption for all ePHI stored on the USB flash drive or the CD used on the replacement computer, added physical safeguards by keeping new portable devices locked when not in use in a secure combination safe in doctor's private office or in a secure filing cabinet, and added administrative safeguards by requiring annual refresher retraining staff for Privacy and Security Rules as well as requiring immediate retraining of cleaning staff in both Rules, which has already taken place.\n\n
breach_start breach_end year relevant
1 2009-10-16 <NA> 2009 FALSE
2 2009-09-22 <NA> 2009 FALSE
3 2009-10-12 <NA> 2009 FALSE
4 2009-10-09 <NA> 2009 FALSE
5 2009-09-27 <NA> 2009 FALSE
6 2009-09-27 <NA> 2009 FALSE
4.5.5 arrange
Para ordenar con
arrange()indicar columnas por las que ordenar ascendentemente, en caso de querer descendente:desc(variable)
Ejercicio: Ordenamos por esta nueva columna (primero las que si son relevantes) y año (de mayor a menor). Reordenamos las columnas
Cyber_Security_Breaches <- Cyber_Security_Breaches %>%
arrange(desc(relevant),desc(year))%>%
select(last_col(),everything()) 4.5.6 summarise
summarize se suele usar con
group_by()
- n() contar
- n_distinct() valores únicos
- mean(), min(), max()
- first(), last(), nth()
¿Cuáles son los incidentes (
Type_of_Breach) que más aparecen (de mayor a menor)?, ¿cuántas veces aparece cada uno?
Breaches <- Cyber_Security_Breaches %>%
group_by(Type_of_Breach)%>%
summarise(cantidad=n()) %>%
arrange(desc(cantidad))
Breaches# A tibble: 29 × 2
Type_of_Breach cantidad
<chr> <int>
1 Theft 516
2 Unauthorized Access/Disclosure 148
3 Other 91
4 Loss 85
5 Hacking/IT Incident 75
6 Improper Disposal 38
7 Theft, Unauthorized Access/Disclosure 26
8 Theft, Loss 15
9 Unknown 10
10 Unauthorized Access/Disclosure, Hacking/IT Incident 9
# ℹ 19 more rows
¿Cuáles son los incidentes (
Type_of_Breach) que más han ocurrido y cuantos individuos han sido afectados?
Breaches <- Cyber_Security_Breaches %>%
group_by(Type_of_Breach)%>%
summarise(sum_individual=sum(Individuals_Affected))
Breaches# A tibble: 29 × 2
Type_of_Breach sum_individual
<chr> <int>
1 Hacking/IT Incident 1878870
2 Hacking/IT Incident, Other 3200
3 Improper Disposal 671594
4 Improper Disposal, Unauthorized Access/Disclosure 10000
5 Loss 7254286
6 Loss, Improper Disposal 5690
7 Loss, Other 34534
8 Loss, Unauthorized Access/Disclosure 33638
9 Loss, Unauthorized Access/Disclosure, Unknown 2533
10 Loss, Unknown 13035
# ℹ 19 more rows
4.5.7 rename, slice
- función
rename(), que, internamente, se comporta comoselect(), pero guardando todas las variables que no se mencionan explícitamente:
¿Cuáles son los 3 incidentes (
Type_of_Breach) que más han ocurrido y cuantos individuos han sido afectados ordenados de mayor a menor por el total de los individuos ?
Breaches <- Cyber_Security_Breaches %>%
group_by(Type_of_Breach)%>%
summarise(sum_individual=sum(Individuals_Affected)) %>%
arrange(desc(sum_individual))%>%
slice(1:3)
Breaches# A tibble: 3 × 2
Type_of_Breach sum_individual
<chr> <int>
1 Theft 16515554
2 Loss 7254286
3 Unknown 1918312
4.6 Operaciones con dos tablas
4.7 Backends
In addition to data frames/tibbles, dplyr makes working with other computational backends accessible and efficient. Below is a list of alternative backends:
dtplyr: for large, in-memory datasets. Translates your dplyr code to high performance data.table code.
dbplyr: for data stored in a relational database. Translates your dplyr code to SQL.
sparklyr: for very large datasets stored in Apache Spark.
5 Ejercicios
- Obtener cuántas entidades han sido afectadas para cada tipo de incidente.
- Hallar la media de individuos afectados para los incidentes Theft:
- Hallar el máximo número de individuos afectados en cada estado ordenados de forma descendente por
Location_of_Breached_Information
- ¿Cuantos incidentes han ocurrido (filas de la tabla) en un Desktop Computer?
Obtener los tipos de incidentes (
Type_of_Breacg) que solo aparecen una vez en el dataset en algún año y ordenarlos de forma descendente por número de individuos afectados.Para cada año extraer las filas que tienen más individuos afectados que la media de los individuos.
¿Qué porcentaje de compañías afectadas aparecen en el dataset (del total de filas)?
- ¿Cual son las entidades e incidentes con más y con menos individuos afectados (mostrar además la fecha y la localización?
9.¿Cuáles son las 3 informes (Summary) más reportados en el dataset?