今回は相関についてメモします。

相関

２つ変数があるときに，
「片方の変数の値が大きいほど，もう一方の変数の値も大きい」という傾向がある時，２つの変数間には正の相関があるという。
「片方の変数の値が大きいほど，もう一方の変数の値が小さい」という傾向がある時，２つの変数間には負の相関があるという。
（本橋，2015，『Rで学ぶ統計データ分析』）

相関の指標

f:id:nigimitama:20180302101026p:plain

↑アンスコムの例 - Wikipedia

２変数間の関係を見たいとき，変数が多い場合などは相関係数というひとつの指標にまとめたほうが便利なときもあります。

ただ，やっぱり目視で確認するのが一番良くて，相関係数だけを見て判断するのは危険だということを留意した上で使う必要はあります（上の図のようなケースもあるので）。

ピアソンの積率相関係数

一般的に，「相関係数」といったときはこれを指します。

数式的な定義は以下の通り。

２つの変数 $x,y$ があるとき，その平均を $\bar{x},\bar{y}$ ，共分散を $s_{xy}$ ，標準偏差を $s_x,s_y$ とおく。このとき相関係数は

$r_{xy} = \displaystyle\frac{s_{xy}}{s_x s_y} = \frac{\displaystyle \sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\displaystyle \sum^n_{i=1}(x_i-\bar{x})^2}\sqrt{\displaystyle \sum^n_{i=1}(y_i-\bar{y})^2}}$

上の図のデータ（anscombe）をピアソンの積率相関係数を使って算出すると以下のようになります

pearson = c(cor(anscombe$x1, anscombe$y1),
            cor(anscombe$x2, anscombe$y2),
            cor(anscombe$x3, anscombe$y3),
            cor(anscombe$x4, anscombe$y4))
> pearson
[1] 0.8164205 0.8162365 0.8162867 0.8165214

ピアソンの相関係数は２つの変数の直線的な関係を測る指標であり，非線形な関係は評価出来ません。

スピアマンの順位相関係数

変数の線形性を仮定しないノンパラメトリックな相関係数もあります。

経済学分野では見かけることがなく，私の手元にある統計学の入門書にも載っていない相関係数です（分野によってはたくさん使うのかな？）。

２つの変数を順位に変換し，順位の差などを使って計算を行なうそうです。 $n$ はペアの数。

$\rho = 1-\displaystyle\frac{6\displaystyle\sum^n_{i=1}(x_i-y_i)^2}{n^3-n}$

上の図のデータ（anscombe）をスピアマンの順位相関係数を使って算出すると以下のようになります。

spearman = c(cor(anscombe$x1, anscombe$y1, method = "spearman"),
             cor(anscombe$x2, anscombe$y2, method = "spearman"),
             cor(anscombe$x3, anscombe$y3, method = "spearman"),
             cor(anscombe$x4, anscombe$y4, method = "spearman"))
> spearman
[1] 0.8181818 0.6909091 0.9909091 0.5000000

ケンドールの順位相関係数

$(x_i,y_i), (x_j,y_j)$ という2個のペアのデータについて， $(x_i$ < $x_j), (y_i$ < $y_j)$ のように順位の大小関係が一致している場合，すなわち $(x_i − x_j)(y_i − y_j)>0$ のペアの数を $K$ ，
$(x_i$ < $x_j), (y_i>y_j)$ のように順位の大小関係が不一致の場合，すなわち $(x_i − x_j)(y_i − y_j)$ < $0$ のペアの数を $L$ として $K-L$ を ${}_n\mathrm{C}_2$ で割った値をケンドールの順位相関係数 $\tau$ とする。

$\displaystyle \tau =\frac{K-L}{{}_n\mathrm{C}_2} = \frac{K-L}{\frac{1}{2}n(n-1)}$

上の図のデータ（anscombe）をケンドールの順位相関係数を使って算出すると以下のようになります。

kendall = c(cor(anscombe$x1, anscombe$y1, method = "kendall"),
             cor(anscombe$x2, anscombe$y2, method = "kendall"),
             cor(anscombe$x3, anscombe$y3, method = "kendall"),
             cor(anscombe$x4, anscombe$y4, method = "kendall"))
> kendall
[1] 0.6363636 0.5636364 0.9636364 0.4264014

また，最近登場した指標では，MICやHSICといったものもあるらしいです。

MIC ( Maximal information coefficient )

MICの解説は，林先生のブログやLogics of Blueさんの解説が分かり易かったです。

"相関"の話＆そのついでに"21世紀の相関（MIC）"の話（ややマニア向け） - Take a Risk：林岳彦の研究メモ

http://takehiko-i-hayashi.hatenablog.com/entry/20130426/1366948560

MICはピアソンの相関係数のようには単純な数式では表せず、コンピュータによってゴリゴリと計算されます。基本的なアルゴリズムとしては、データ散布図を様々な数のグリッド（＝解像度）で区切っていきながら、様々な解像度の値において相互情報量が最大（＝各グリッド内に含まれるデータ密度のコントラストが最大となるようなイメージ）となるような区切り方を決定し、それらを規格化したのちの最大の情報量をMICの値として選択しているようです

MICの解説

https://www.slideshare.net/logics-of-blue/mic-31810194

相互情報量（Mutual Information）は

$I(X;Y)=\displaystyle \sum_{y\in{Y}}\sum_{x\in{X}}p(x,y)\log{\frac{p(x,y)}{p(x)p(y)}}$

であり，独立 $p(x,y)=p(x)p(y)$ のときには相互情報量はゼロになります。

上の図のデータ（anscombe）の算出結果

library(minerva)
mic = c(mine(anscombe$x1, anscombe$y1)$MIC,
        mine(anscombe$x2, anscombe$y2)$MIC,
        mine(anscombe$x3, anscombe$y3)$MIC,
        mine(anscombe$x4, anscombe$y4)$MIC)
> mic
[1] 0.9940302 0.6394725 0.9940302 0.4394970

ただし，MICはノイズに弱いという欠点があるようです。 - 21世紀の手法対決 (MIC vs HSIC) https://www.slideshare.net/motivic/tokyo-r-lt-25759212

HSIC ( Hilbert-Schmidt Independence Criteria )

データを再生核ヒルベルト空間(RKHSs)に飛ばし、そこでの関係性を見ることで非線形でも関係性を見出す手法のようです。

私は数学の勉強不足により理論のほうは理解できていません…

HSICは相関係数というよりも独立性の検定として使われるようです。

参考

Measuring Statistical Dependence with Hilbert-Schmidt Norms (Gretton et al., Algorithmic Learning Theory, 2005) http://www.gatsby.ucl.ac.uk/~gretton/papers/GreBouSmoSch05.pdf
「21世紀の相関」HSICの原論文"Measuring Statistical Dependence with Hilbert-Schmidt Norms" (Gretton et al., Algorithmic Learning Theory, 2005)メモランダム - 六本木で働くデータサイエンティストのブログ
http://tjo.hatenablog.com/entry/2015/01/15/190000

Rで行う場合，{dHSIC}や{hsicCCA}や{EMD}といったパッケージがあるようです。

# dHSICのp.value
library(dHSIC)
hsic_test_pvalue = c(dhsic.test(anscombe$x1, anscombe$y1, kernel = 'gaussian')$p.value,
                     dhsic.test(anscombe$x2, anscombe$y2, kernel = 'gaussian')$p.value,
                     dhsic.test(anscombe$x3, anscombe$y3, kernel = 'gaussian')$p.value,
                     dhsic.test(anscombe$x4, anscombe$y4, kernel = 'gaussian')$p.value)
> hsic_test_pvalue
[1] 0.008991009 0.000999001 0.000999001 0.087912088

HSIC以降の独立性の検定

HSICも新たな手法に上回られているようで，より良い指標の研究はどんどん進んでいるようですね

第50回R勉強会@東京で発表してきた - INPUTしたらOUTPUT!

http://estrellita.hatenablog.com/entry/2015/09/05/232731

MICの課題

ノイズに弱い

グリッドでやるやり方はオワコン

これからは再生核ヒルベルト空間

と思ってた時代がありました(2年前)

HSIC/dCorよりもTICを使った方がよい

SDDPも良いかも

An Empirical Comparison of the Maximal and Total Information Coefficients to Leading Measures of Dependence

Regarding equitability, our analysis finds that $MIC_e$ is the most equitable method on functional relationships in most of the settings we considered.

Regarding power against independence, we find that $TIC_e$ and Heller and Gorne's $S^{DDP}$ [Heller et al., 2016] share state-of-the-art performance, with several other methods achieving excellent power as well.

$TIC_e$

total information coefficient（総情報係数）で $TIC_e$ らしいです。

原文

Yakir A. Reshef, David N. Reshef, Hilary K. Finucane, Pardis C. Sabeti, and Michael Mitzenmacher. Measuring dependence powerfully and equitably. Journal of Machine Learning Research, 17(212):1-63, 2016. http://jmlr.org/papers/v17/15-308.html

…理論はいつか時間のあるときに勉強してみたいと思います。

Rでは{minerva}でmine()を使えばMICなどと一緒に出力されるようなので，使う事自体は簡単にできます。

library(minerva)
tic = c(mine(anscombe$x1, anscombe$y1)$TIC,
        mine(anscombe$x2, anscombe$y2)$TIC,
        mine(anscombe$x3, anscombe$y3)$TIC,
        mine(anscombe$x4, anscombe$y4)$TIC)
> tic
[1] 0.9940302 0.6394725 0.9940302 0.4394970

アンスコムの例くらいのデータではノイズが少ないのか結果はMICと同じでした

> data.frame(mic, tic, 'equal' = (mic == tic))
        mic       tic equal
1 0.9940302 0.9940302  TRUE
2 0.6394725 0.6394725  TRUE
3 0.9940302 0.9940302  TRUE
4 0.4394970 0.4394970  TRUE

$S^{DDP}$

原文

Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54 http://www.jmlr.org/papers/volume17/14-441/14-441.pdf
The HHG Package - Multivariate and Univariate non-parametric Independence and K-Sample tests
https://cran.r-project.org/web/packages/HHG/vignettes/HHG.html

Rでの実装はlibrary(HHG)でできます。これもいつかちゃんと理論を理解して実行したいところです。

まとめ：アンスコムのデータと各手法

ちょこちょことアンスコムのデータの相関係数を算出していたものをまとめると以下のようになります。

> data.frame(pearson, spearman, kendall, mic, tic)
    pearson  spearman   kendall       mic       tic
1 0.8164205 0.8181818 0.6363636 0.9940302 0.9940302
2 0.8162365 0.6909091 0.5636364 0.6394725 0.6394725
3 0.8162867 0.9909091 0.9636364 0.9940302 0.9940302
4 0.8165214 0.5000000 0.4264014 0.4394970 0.4394970

f:id:nigimitama:20180302163703p:plain

f:id:nigimitama:20180302164859p:plain

x2,y2のグラフは非線形ながらはっきりした傾向があるように思いますがMICはおもったより低いですね。

おまけ：もう少し複雑なデータで相関係数を比較

f:id:nigimitama:20180302175605p:plain

アンスコムのほうでは順位相関係数もいい感じでしたが，こちらだとだめですね。 MICの凄さが際立ちます。

（もしグラフのコードに関心のある方がいたらこちらをどうぞ）

棒グラフ

anscor <- data.frame(pearson, spearman, kendall, mic, tic)
anscor['datasets'] <- as.factor(rownames(anscor))
library(ggplot2)
library(reshape2)
anscor_melted <- melt(anscor)
colnames(anscor_melted) <- c('dataset','method','value')
ggplot(anscor_melted, aes(x = dataset, y = value, fill = method))+
  geom_col(position = 'dodge')+
  geom_label(aes(label = round(value, digits = 3), y = value + 0.01), alpha = 0.3, position = position_dodge(0.9), vjust = 0)+
  labs(y = 'correlation coefficient')

グラフを４つまとめて表示する関数： http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

アンスコムのグラフ

p1 <- ggplot(anscombe, aes(x = x1, y = y1))+
  geom_point(color = 'orange', size = 3)+
  geom_smooth(method = "lm", se = F)+
  annotate("text", x = 6, y = 11, label = "pearson : 0.816")+
  annotate("text", x = 6, y = 10.5, label = "spearman : 0.818", color = 'darkorange')+
  annotate("text", x = 6, y = 10, label = "kendall : 0.636", color = 'darkgreen')+
  annotate("text", x = 6, y = 9.5, label = "MIC : 0.994", color = 'darkblue')
p2 <- ggplot(anscombe, aes(x = x2, y = y2))+
  geom_point(color = 'orange', size = 3)+
  geom_smooth(method = "lm", se = F)+
  annotate("text", x = 12, y = 4.5, label = "pearson : 0.816")+
  annotate("text", x = 12, y = 4, label = "spearman : 0.691", color = 'darkorange')+
  annotate("text", x = 12, y = 3.5, label = "kendall : 0.564", color = 'darkgreen')+
  annotate("text", x = 12, y = 3, label = "MIC : 0.639", color = 'darkblue')
p3 <- ggplot(anscombe, aes(x = x3, y = y3))+
  geom_point(color = 'orange', size = 3)+
  geom_smooth(method = "lm", se = F)+
  annotate("text", x = 6, y = 13,  label = "pearson : 0.816")+
  annotate("text", x = 6, y = 12.5, label = "spearman : 0.991", color = 'darkorange')+
  annotate("text", x = 6, y = 12,  label = "kendall : 0.964", color = 'darkgreen')+
  annotate("text", x = 6, y = 11.5, label = "MIC : 0.994", color = 'darkblue')
p4 <- ggplot(anscombe, aes(x = x4, y = y4))+
  geom_point(color = 'orange', size = 3)+
  geom_smooth(method = "lm", se = F)+
  annotate("text", x = 10, y = 13, label = "pearson : 0.816")+
  annotate("text", x = 10, y = 12.5, label = "spearman : 0.500", color = 'darkorange')+
  annotate("text", x = 10, y = 12, label = "kendall : 0.426", color = 'darkgreen')+
  annotate("text", x = 10, y = 11.5, label = "MIC : 0.439", color = 'darkblue')
multiplot(p1, p3, p2, p4, cols=2)