[R] 문자열(String) 조작 함수 소개 및 예제

해당 포스트에서는 R에서 문자열(String) 조작을 위한 함수 및 방법을 소개합니다.

INTRO

R에서 문자열(String) 조작 방법을 예시와 함께 설명합니다. 기본은 영어를 기준으로 소개하며, 일부 한글 적용 가능 함수는 예시와 함께 추가 설명하고 있으니 참고하여 이해하시면 됩니다.

데이터 생성

이번 설명에 사용할 문자열 데이터는 아래 코드로 생성 가능합니다.

mytext <- c("Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.")

mytext

[1] "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data."

문자열 조작 함수

1. nchar()

nchar() 함수는 문자열(String)을 입력으로 받아 해당 문자열의 문자 수를 계산하여 리턴합니다. 아래 결과를 보면 mytext의 문자 수는 362개임을 확인할 수 있습니다.

nchar(mytext)

[1] 362

nchar() 함수는 한글에도 적용 가능하며, 아래 결과를 보면 영어/한글/공백 모두 1개씩 카운트 되는 것을 알 수 있습니다.

nchar("R은 데이터 분석 언어 중 하나입니다.")

[1] 21

2. toupper()

toupper() 함수는 문자열의 모든 문자를 대문자로 변환하여 리턴합니다.

toupper(mytext)

[1] "DATA SCIENCE IS AN INTERDISCIPLINARY FIELD THAT USES SCIENTIFIC METHODS, PROCESSES, ALGORITHMS AND SYSTEMS TO EXTRACT KNOWLEDGE AND INSIGHTS FROM NOISY, STRUCTURED AND UNSTRUCTURED DATA,[1][2] AND APPLY KNOWLEDGE AND ACTIONABLE INSIGHTS FROM DATA ACROSS A BROAD RANGE OF APPLICATION DOMAINS. DATA SCIENCE IS RELATED TO DATA MINING, MACHINE LEARNING AND BIG DATA."

3. tolower()

tolower() 함수는 문자열의 모든 문자를 소문자로 변환하여 리턴합니다.

tolower(mytext)

[1] "data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. data science is related to data mining, machine learning and big data."

4. chartr()

chartr() 함수는 문자열의 특정 문자 집합을 대체하는 데 사용할 수 있습니다. 첫번째 인자는 대체되어야 하는 문자를, 두번째 인자에는 대체할 문자를, 세번째 인자에는 문자열을 입력하면 됩니다.

아래 코드는 공백() 문자를 언더바(_)로 대체해줍니다.

chartr(" ", "_", mytext)

[1] "Data_science_is_an_interdisciplinary_field_that_uses_scientific_methods,_processes,_algorithms_and_systems_to_extract_knowledge_and_insights_from_noisy,_structured_and_unstructured_data,[1][2]_and_apply_knowledge_and_actionable_insights_from_data_across_a_broad_range_of_application_domains._Data_science_is_related_to_data_mining,_machine_learning_and_big_data."

chartr() 함수는 기호 뿐만 아니라 한글/영어 모두 적용 가능합니다.

chartr("R", "알", "R은 데이터 분석 언어 중 하나입니다.")

[1] "알은 데이터 분석 언어 중 하나입니다."

5. strsplit()

strsplit() 함수를 사용하면 표현식을 사용하여 문자열을 분할할 수 있습니다. 첫번째 인자는 분할하고자 하는 문자열을, 두번째 인자에는 분할에 사용할 표현식을 입력하면 됩니다. 아래 코드는 공백()을 이용하여 문자열을 분할하는 예시입니다.

mylist <- strsplit(mytext," ")
mylist

[[1]]
 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data."

strsplit("R은 데이터 분석 언어 중 하나입니다."," ")

[[1]]
[1] "R은"         "데이터"      "분석"        "언어"        "중"          "하나입니다."

위 출력을 보면 문자열이 공백()을 기준으로 전부 분할된 것을 볼 수 있습니다. 그러나 리턴된 형태가 리스트(list)형이므로 활용성이 높은 벡터로 변환해 보겠습니다. 이때는 unlist()함수를 이용하면 됩니다.

mylist2 <- unlist(mylist)
mylist2

 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data."

unlist(strsplit("R은 데이터 분석 언어 중 하나입니다."," "))

[1] "R은"         "데이터"      "분석"        "언어"        "중"          "하나입니다."

6. sort()

sort()는 벡터를 정렬해 주는 함수로, 위에서 분할한 문자 벡터들을 알파벳 순으로 정렬할 수 있습니다.

sorting <- sort(mylist1)
sorting

 [1] "a"                 "across"            "actionable"        "algorithms"       
 [5] "an"                "and"               "and"               "and"              
 [9] "and"               "and"               "and"               "application"      
[13] "apply"             "big"               "broad"             "data"             
[17] "data"              "Data"              "Data"              "data,[1][2]"      
[21] "data."             "domains."          "extract"           "field"            
[25] "from"              "from"              "insights"          "insights"         
[29] "interdisciplinary" "is"                "is"                "knowledge"        
[33] "knowledge"         "learning"          "machine"           "methods,"         
[37] "mining,"           "noisy,"            "of"                "processes,"       
[41] "range"             "related"           "science"           "science"          
[45] "scientific"        "structured"        "systems"           "that"             
[49] "to"                "to"                "unstructured"      "uses"

sort(unlist(strsplit("오픈소스 R은 데이터 분석 언어 중 하나입니다."," ")))

[1] "R은"         "데이터"      "분석"        "언어"        "오픈소스"    "중"     "하나입니다."

7. paste()

paste() 함수는 문자형 벡터의 요소를 연결하는데 사용할 수 있습니다. 문자열을 연결할 때 연결자는 collapse= 옵션에 적용가능 합니다.

paste(sorting, collapse = " ")

[1] "a across actionable algorithms an and and and and and and application apply big broad data data Data Data data,[1][2] data. domains. extract field from from insights insights interdisciplinary is is knowledge knowledge learning machine methods, mining, noisy, of processes, range related science science scientific structured systems that to to unstructured uses"

paste(unlist(strsplit("오픈소스 R은 데이터 분석 언어 중 하나입니다."," ")), collapse = "-")

[1] "오픈소스-R은-데이터-분석-언어-중-하나입니다."

8. substr()

substr() 함수는 문자열의 지정된 부분을 분리하는 데 사용할 수 있습니다. 2,3번째 인자에 시작 및 끝 인덱스를 입력하면 연속된 문자들이 출력되며, 공백도 1개로 카운트되어 출력됩니다.

subs <- substr(mytext, start = 3, stop = 30)
subs

[1] "ta science is an interdiscip"

substr("R은 데이터 분석 언어 중 하나입니다.", start = 4, stop = 9)

[1] "데이터 분석"

9. trimws()

trimws()는 문자열의 시작과 끝의 공백을 제거할 수 있습니다.

substr("R은 데이터 분석 언어 중 하나입니다.", start = 3, stop = 10)

[1] " 데이터 분석 "

trimws(substr("R은 데이터 분석 언어 중 하나입니다.", start = 3, stop = 10))

[1] "데이터 분석"

10.str_sub()

str_sub()는 하위 문자열을 추출하기 위해 마지막 위치에서 거꾸로 계산할 수도 있습니다. 예를 들어 아래 예시처럼 마지막 5개 문자를 선택하는 경우, stringr패키지의 str_sub() 함수를 사용하면 됩니다.

참고로 거꾸로 계산하는 경우 시작 및 끝점 인수가 모두 음수이며, 결과적으로 시작점은 문자열의 마지막 점에서 다섯 번째 문자이고 끝점은 마지막 문자의 인덱스입니다.

library(stringr)
str_sub(mytext, -5, -1)

[1] "data."

str_sub("R은 데이터 분석 언어 중 하나입니다.", -5, -1)

[1] "나입니다."