[Data Science with R] 1. 데이터 시각화 (202405)

1. 그래프 작성 문법

ggplot(data)를 기반으로 표현하고자 하는 레이어를 추가할 수 있으며 ggplot( )함수는 그림에서의 도화지와 같은 역할을 합니다.
지옴(geom) 함수를 통해 ggplot으로 생성된 틀에 레이어를 추가 할 수 있으며 종류는 https://moogie.tistory.com/43를 참고해주세요.

ggplot(data, mapping=aes(...)) + geom_function(mapping=aes(...))

2. 심미성 (aesthetic)

그래픽 속성을 변수의 값에 따라 달라지도록 표현하기 위해 aes() 내부에서 시각작 요소를 변수와 연결(매핑)시켜야 합니다.
aes외부에서 시각적 요소의 값을 지정하여 속성을 수동으로 설정할 수 있습니다. (= 변수의 값에 따라 속성이 달라지지 않음)
대표적인 aesthetic 속성
- x(위치), y(위치), xend(연결될 마지막 x좌표), yend(연결될 마지막 y좌표)
- color(테두리 색), fill(내부색), alpha(투명도)
- size(크기), stroke(테두리 굵기), shape(모양), width(두께), linetype(선의 모양), linewidth(선의 두께)
- group(그룹화)

[참고1] aes(color = disply < 5)인 경우 disply값이 5미만인 값과 disply값이 5이상인 값의 색상이 다르게 설정합니다.

[참고2] geom 함수 내부에 na.rm 옵션을 사용해 결측치(Missing Value)를 제거하고 시각화를 할 수 있습니다.

[참고3] 모든 geom 함수에는 기본 스탯과 포지션이 존재하며 다른 stat / position 을 사용해 다양한 효과를 부여할 수 있습니다.
(eg. geom_col 함수는 geom_bar 함수의 기본스탯 "count"를 "identity"로 변환한 것임)

[참고4] 미적 매핑이 전역(Global) 수준인 ggplot 함수 내부에 정의된다면 후속 geom 레이어에 전달하지만,
미적 매핑이 로컬(Local) 수준인 geom 함수 내부에서 정의된다면 전달되지 않는 특성을 가지고 있습니다.

## 전역 수준 매핑, 로컬 수준 매핑
library(tidyverse)
library(palmerpenguins)
library(gridExtra)
p1 <- ggplot(data = penguins, mapping=aes(x=flipper_length_mm, y=body_mass_g, color = species)) + 
  geom_point() + geom_smooth(method = "loess") + ggtitle("Global Level") +
  theme(legend.position = "bottom")
p2 <- ggplot(data = penguins, mapping=aes(x=flipper_length_mm, y=body_mass_g)) + 
  geom_point(aes(color = species)) + geom_smooth(method = "loess") + ggtitle("Local Level") +
  theme(legend.position = "bottom")
grid.arrange(p1, p2, ncol=1)

첫번째 플롯에서 ggplot 내부에 color = species가 정의되어 있어 후속 지옴함수에 전달됩니다.
그래서 geom_point, geom_smooth 내부에 아무것도 적지 않아도 자연스럽게 species에 따라 점과 평활곡선을 그려줍니다.
반면에 두번째 플롯에서 geom_point 내부에 color = species를 작성하여 산점도의 점들은 species에 따라 다른 색을 가지지만,
지옴함수는 후속 레이어에 미적 속성을 전달하지 않으므로 geom_smooth는 species를 구분하지 않고 전체 데이터를 사용합니다.

3. 면 분할 (Faceting)

면분할할 변수는 연속형변수가 아닌 이산형변수를 사용할 것을 권장하며 연속형변수를 사용해야하는 경우 구간화(비닝) 진행해주세요.

facet_wrap(formula, [nrow], [ncol], scales) : wraps a 1d sequence of panels into 2d
facet_grid(formula, scales) : forms a matrix of panels defined by row and column faceting variables

두 함수 모두 면 분할을 지원하나 다음과 같은 차이가 존재합니다.
이변량 변수에 대해 facet_wrap은 데이터가 존재하는 facet에 대해서만 노출하지만 facet_grid는 데이터가 존재하지 않더라도 모든 조합에 대한 facet을 노출합니다.

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(drv~class)

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_grid(drv~class)

[파라미터] scales 파라미터 설정을 통해 각 그래프(facet)별로 스케일(x축, y축 범위 등등)을 공유할 것인지 지정할 수 있음
(default는 x축과 y축에 대해 동일한 배율과 범위를 공유하는 "fixed"이고 "free_x", "free_y", "free"로 변경 가능)

[파라미터] labeller 파라미터를 통해 각 면(facet)의 이름을 변경할 수 있다.
(labeller = labeller(범주형변수 = c("범주1"="new범주1", ...))와 같이 사용 가능

mpg |> ggplot(mapping=aes(x=displ, y=hwy, color = drv)) + geom_point() + 
  facet_grid(~drv) 
mpg |> ggplot(mapping=aes(x=displ, y=hwy, color = drv)) + geom_point() + 
  facet_grid(~drv, scales = "free", labeller = labeller(drv = c("4"="four", "f"="front", "r"="reverse")))

4. 일변량 변수의 분포 시각화

geom_bar : 일변량 범주형 변수의 분포를 막대그래프로 확인
geom_density : 일변량 연속형 변수의 커널밀도 (kernel density) 추정치를 확인
geom_histogram : 일변량 연속형 변수의 분포를 대략적으로 확인 (bins, binwidth)
geom_freqpoly : 일변량 연속형 변수의 분포를 대략적으로 확인 (bins, binwidth)
(참고로 geom_histogram, geom_freqpoly 함수는 다양한 값의 bins(binwidth)를 통해 여러 방면으로 탐색)

p1 <- mpg |> ggplot(mapping=aes(x=manufacturer)) + geom_bar() + ggtitle("bar plot")
p2 <- mpg |> ggplot(mapping=aes(x=displ)) + geom_density() + ggtitle("kernel density plot")
p3 <- mpg |> ggplot(mapping=aes(x=displ)) + geom_histogram() + ggtitle("histogram")
p4 <- mpg |> ggplot(mapping=aes(x=displ)) + geom_freqpoly() + ggtitle("freqpoly")
p1 + p2 + p3 + p4

[참고] 빈도가 아닌 비율 막대그래프 그리기

ggplot( data ) + geom_bar(aes(x=x, y=after_stat(prop), group=1)) (*after_stat : delayed evaluation)
비율 막대그래프인 경우 scale_y_continuous(label = scales::label_percent())를 추가하면 y축 눈금이 퍼센트로 변경됨

p5 <- mpg |> ggplot() + 
  geom_bar(aes(x=manufacturer, y=after_stat(prop), group=1)) +
  scale_y_continuous(labels = scales::label_percent()) +
  ggtitle("proportion")
p1 + p5

5. 이변량 변수의 분포 시각화

범주형변수-수치형변수 시각화1 : geom_boxplot, geom_density, geom_violin, lvplot::geom_lv

상자수염그림(Box-Whisker Plot)은 데이터의 4분위수와 IQR(사분위수 범위)를 기반으로 분포를 시각화하는 방법으로 많이 사용되며 이상치를 식별하는데 유용합니다.
밀도플럿(Density Plot)은 커널 밀도 함수(KDE)를 통해 데이터의 분포를 시각화하며 범주의 수준에 따른 밀도 함수를 그릴 수 있습니다.
바이올린플럿(Violin Plot)은 상자수염그림과 밀도함수를 결합(Blending)한 시각화 방법으로 데이터의 전체 분포를 시각적으로 표현합니다.
LV플럿(Letter-Value plot)은 boxplot을 확장한 형태로 대규모 데이터 셋에서 나타나는 상자수염그림의 단점을 극복한 시각화 방법입니다.

## 범주 vs 수치 : geom_boxplot, geom_density, geom_violin, lvplot::geom_lv
p1 <- penguins |> ggplot(mapping=aes(x=species, y=body_mass_g, fill=species)) + geom_boxplot() +
  theme(legend.position = "none") + ggtitle("Box-Plot")
p2 <- penguins |> ggplot(mapping=aes(x=body_mass_g, color=species, fill=species)) + geom_density(alpha=0.6) +
  theme(legend.position = "none") + ggtitle("Kernel Density Estimate")
p3 <- penguins |> ggplot(aes(x=species, y=body_mass_g, fill=species)) + geom_violin() + 
  theme(legend.position = "none") + ggtitle("Violin-Plot")
p4 <- penguins |> ggplot(aes(x=species, y=body_mass_g, fill=species)) + lvplot::geom_lv() + 
  theme(legend.position = "none") + ggtitle("LV-Plot")
grid.arrange(p1, p2, p3, p4, ncol=2)

범주형변수-수치형변수 시각화2 : ggridges::geom_density_ridges2

diamonds |> 
  ggplot(aes(x=carat, y=cut, fill = cut)) +
  ggridges::geom_density_ridges2()

범주형변수-범주형변수 시각화 : geom_bar, geom_count

## 범주 vs 범주 : geom_bar, geom_count
p1 <- penguins |> ggplot(aes(x=island, fill=species)) + geom_bar(position = "stack") + ggtitle("stacked bar plot")
p2 <- penguins |> ggplot(aes(x=island, fill=species)) + geom_bar(position = "fill") + ggtitle("stacked 100% bar plot")
p3 <- penguins |> ggplot(aes(x=island, fill=species)) + geom_bar(position = "dodge") + ggtitle("dodged bar plot")
p4 <- penguins |> ggplot(aes(x=island, y=species)) + geom_count() + ggtitle("geom_count")
(p1/p2)|(p3/p4)

[참고] geom_bar 함수의 position 파라미터 (default="stack")

"identity" : 각 객체를 범주에 맞게 정확한 높이에 배치
"dodge": 각 범주별로 구분객체를서로옆에배치
"fill" : 각 범주 내에서 그룹 간 비율 (Stacked 100% bar chart)

수치형변수-수치형변수 시각화 : geom_point, geom_smooth, geom_bin2d, geom_hex

p1 <- diamonds |> ggplot(mapping=aes(x=carat, y=price)) + geom_point(alpha=0.1) + ggtitle("point")
p2 <- diamonds |> ggplot(mapping=aes(x=carat, y=price, color=cut)) +
  geom_point(alpha=0.2, shape=21) +
  geom_smooth(aes(color = cut), se=F, fullrange=T) + ggtitle("point + smoothing")
p3 <- diamonds |> ggplot(mapping=aes(x=carat, y=price)) + geom_bin2d() + ggtitle("bin2d")
p4 <- diamonds |> ggplot(mapping=aes(x=carat, y=price)) + geom_hex() + ggtitle("hex")
(p1 / p2) | (p3 / p4)

6. 삼변량 변수의 시각화 예시

수치형-수치형-범주형 시각화 : ggforce::geom_mark_hull

library(ggforce)
ggplot(iris, aes(Petal.Length, Petal.Width)) +
  geom_mark_hull(aes(fill = Species, label = Species)) +
  geom_point()

7. 좌표계

coord_cartesian : 기본값(데카르트 좌표계)
coord_flip : x축과 y축을 변환
coord_polar : 극좌표계

8. 기타 제어

제목 : ggtitle(title), labs(title)
부제목 : ggtitle(subtitle), labs(subtitle)
캡션 : labs(caption)
축(axis/ticks/range)
- 축(axis) 제목 : xlab, ylab, labs(x, y)
- 축 눈금(ticks) 위치 : scale_x(y)_* (breaks, n.breaks) (breaks를 NULL로 설정하면 눈금 라벨과 격자 선을 제거할 수 있음)
- 축 눈금(ticks) 라벨 : scale_x(y)_* (labels)
  (labels를 NULL로 설정하면 눈금에 숫자나 범주가 표시되지 않음 theme(axis.text.y = element_blank()와 유사)
  (scales::label_dollar, label_percent 등의 함수를 통해 라벨을 변경할 수 있음)
  (특히, scale_x(y)_date(datetime)인 경우 파라미터 date_labels를 사용해 원하는 연도, 월, 일만 표기가능)
- 축 격자 선 제거 : theme(panel.grid = element_blank())
- 축 범위 조정 : xlim, ylim, expand_limits(x, y), coord_cartesian(xlim, ylim), scale_x(y)_*(limits)
  (coord_cartesian은 이미 그려진 그래프를 확대하지만 xlim, ylim을 사용하면 먼저 범위를 고정하고 레이어를 추가)

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  scale_x_continuous(n.breaks = 4,
                     labels = c("bad", "not bad", "good", "very good")) +
  scale_y_continuous(breaks = seq(15, 55, by = 15),
                     labels=c("low","middle","high")) + 
  theme(legend.position = "none") + labs(x=NULL, y=NULL)

coord_cartesian과 xlim/ylim은 작동하는 방식이 조금 다릅니다.

coord_cartesian의 경우 돋보기처럼 전체 그래프에서 원하는 부분을 확대 및 축소하여 보여주지만,

xlim/ylim은 해당 영역의 데이터를 필터링하여 그래프를 생성하므로 다른 결과가 나올 수 있습니다.

아래 예시에서 평활곡선의 형태가 조금 다른데 xlim/ylim를 사용하면 평활곡선을 적합할때, 영역에 포함되는 데이터만 사용하기 때문입니다.

# 1. xlim 사용 - 범위 밖 데이터가 제거되어 평균선이 변경됨
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth() + xlim(3, 5)

# 2. coord_cartesian 사용 - 데이터는 그대로, 보이는 영역만 변경됨
p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() + geom_smooth() + coord_cartesian(xlim = c(3, 5))
  
p1 + p2

범례 (미적 속성 group, color, fill, linetype, shape에 매핑된 다양한 범례 설정)
- 범례 타이틀 : labs(color, fill, ...)
- 범례 위치 : theme(legend.poisition) (예시 : right, left, bottom, top, none)
- 범례 키 조정
  1. scale_color(fill)_discrete(labels = c("범례범주1" = "표시할값1", ...))
  2. guides(속성 = guide_legend(override.aes(list=c("범례범주1"="표시할값1", ...))))

# Plot
iris |> ggplot(mapping=aes(x=Sepal.Length, y=Petal.Length, color=Species)) + 
  geom_point() + ggforce::geom_mark_hull() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL) +
  scale_color_discrete(labels = c("setosa"="sp1", "versicolor"="sp2", "virginica"="sp3"))

스케일
- 로그변환 : scale_x_log10, scale_y_log10
- 색상변경
  - 1. 이산형컬러 : scale_color_brewer(palette), scale_color_manual(values=c(범주1="color1", 범주2="color2"))
  - 2. 연속형컬러 : scale_color_gradient(low, high), scale_color_viridis_*
기타
- 수식포함 : latex2exp::TeX("$수식$")
- 테마 : theme_* 함수 또는 ggthemes 패키지 사용

참고로, ggplot2에서는 축과 범례를 통칭하여 가이드(Guide)라고 합니다

'Data Science > Manipulation' 카테고리의 다른 글

[Data Science With R] 6. 파싱(Parsing) (202405) (0)	2023.03.31
[Data Science With R] 5. readr로 파일 읽기 (202503) (0)	2023.03.31
[Data Science With R] 4. 티블(Tibble) (202406) (1)	2023.03.30
[Data Science With R] 3. 탐색적 데이터 분석 (Exploratory Data Analysis) (202405) (0)	2023.03.29
[Data Science With R] 2. 데이터 변형 (202405) (1)	2023.03.29

1. 그래프 작성 문법

2. 심미성 (aesthetic)

3. 면 분할 (Faceting)

4. 일변량 변수의 분포 시각화

5. 이변량 변수의 분포 시각화

6. 삼변량 변수의 시각화 예시

7. 좌표계

8. 기타 제어

'Data Science > Manipulation' 카테고리의 다른 글

티스토리툴바