闇のExcelに対する防衛術 - ★データ解析備忘録★

この記事は R Advent Calendar 2019 の25日目の記事です。

Rユーザにとっての闇、それはデータソースとして渡されるエクセルファイルでしょう(異論は認める)。

tidyverseの中にある readxl パッケージによってだいぶ楽になったとはいえ、まだまだ手元でエクセルファイルを直さないとまともに読み込めない、というのが現状でした（セル結合とかキツイですよね...）。

そんななか、そのあたりを面倒見てくれるパッケージが登場していますので、それを紹介したいと思います。本記事で紹介するのは以下のパッケージです。

tidyxl
unpivotr

この2つはペアで使うことが多いようです。

なお、tidyxl パッケージと unpivotr パッケージについては本記事の内容はほぼ全て以下のWeb Bookに詳しく書かれています。

nacnudus.github.io

本日のゴール

本日のゴールは、上記のWeb Book の9.1で紹介されている、以下のような、オーストラリアで実施された結婚についてのの調査結果が記録されたエクセルファイルのシートを tidyに（ここ重要） 読み込むこととします。

このファイルは、こちらからダウンロードできます。

Table 1

エクセルファイルの2シート目、"Table 1" シートには2つのテーブルがあります。1つは「Response Clear」というテーブル、もう1つは「Eligible Participants」です。ぱっと見ただけでセル結合が複数あるわ、そもそもテーブルが1シートで複数あるわで、 readxl パッケージでは太刀打ちできそうにありません。しかし、このようなエクセルシートは世にあふれており、我々Rユーザがこのような闇と対峙しなければならない場面は少なくありません。

パッケージのインストール

remotes::install_github("nacnudus/tidyxl")
remotes::install_github("nacnudus/unpivotr")

注意: 上記のコマンドを実行すると、GitHubの最新版（つまり開発版）がインストールされます。上記Web Bookやドキュメントとは一部書き方が異なります（特にunpivotrパッケージ）。本記事では開発版で採用されている書き方で解説を行います。

tidyxl

tidyxl パッケージの作者が考えたのが、「すべてのセルの情報（空白か、そうでないのか、結合されているのか、色がついているのか、、、etc）を1つデータフレームにしてしまう」ことです。

データフレームにしてしまえば、僕らがよく知る tidyverse を使った変形に持っていくことができそうですね。

それを実現するのが、 tidyxl::xlsx_cells() です。上記のエクセルファイルの情報をこの関数を使って取得してみましょう。

library(magrittr) # パイプ演算子のため

FILE <- "/path/to/ozmarriage.xlsx"

cells <- tidyxl::xlsx_cells(FILE)
head(cells)

sheet	address	row	col	is_blank	data_type	error	logical	numeric	date	character	character_formatted	formula	is_array	formula_ref	formula_group	comment	height	width	style_format	local_format_id
Contents	A1	1	1	FALSE	character	NA	NA	NA	NA	Australian Bureau of Statistics	list(character = " Australian Bureau of Statistics", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	60	7.832031	Normal	138
Contents	B1	1	2	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	60	7.832031	Normal	138
Contents	C1	1	3	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	60	140.832031	Normal	138
Contents	D1	1	4	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	60	52.000000	Normal	90
Contents	E1	1	5	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	60	8.500000	Normal	90
Contents	F1	1	6	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	60	8.332031	Normal	90

これ、なかなかすごくないですか？なんとフォントの色、フォントの種類、セルが塗られているかまで読み取っています。

ちなみに、セル結合はどんな感じになるかというと、例えば「"Table 1" シートのB5〜C5」であれば、

sheet	address	row	col	is_blank	data_type	error	logical	numeric	date	character	character_formatted	formula	is_array	formula_ref	formula_group	comment	height	width	style_format	local_format_id
Table 1	B5	5	2	FALSE	character	NA	NA	NA	NA	Response clear	list(character = "Response clear", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	18.75	10.50000	Normal	141
Table 1	C5	5	3	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	18.75	10.83203	Normal	141
Table 1	D5	5	4	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	18.75	10.50000	Normal	141
Table 1	E5	5	5	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	18.75	10.83203	Normal	141
Table 1	F5	5	6	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	18.75	11.50000	Normal	141
Table 1	G5	5	7	TRUE	blank	NA	NA	NA	NA	NA	NULL	NA	FALSE	NA	NA	NA	18.75	10.83203	Normal	141

のように、「結合されたセルは最初の1セルにだけ値が入っている状態で読み込まれる」事がわかります。

unpivotr

さて、 tidyxl パッケージの役割はエクセルファイルの構造を抽出することですので、加工は unpivotr の担当です。なぜこのパッケージが必要かというと、上記のようにエクセルのセル構造を抽出できたはいいものの、値の入っているセルだけ抜き出してもテーブルの名前や、表の項目とうまく紐付いてくれないからです。

ここで問題になってくるのは、読み込みたい表には複数のヘッダーがあることです。例えば、2つあるテーブルのうち「Response Clear」のほうであれば、「"Response Clear"」「"Yes", "No", "Total"」「"no.", "%"」という3つのヘッダーがあります。これをうまい具合にセルの値と対応させなければなりません。 f:id:songcunyouzai:20191225014218p:plain

そこで、unpivotr::behead() 関数を使います。この関数は「ヘッダーの位置を指定すると、そのヘッダーにで構成されるセルを自動で紐付けてくれる」関数です。

具体的にはこんな感じです。

table1  <-
  cells %>%
  dplyr::filter(sheet == "Table 1", row >= 5L, !is_blank) %>% # 5行目以降で、値が入っているセル
  dplyr::mutate(character = stringr::str_trim(character)) %>% # セル中の文字列の空白を除去
  unpivotr::behead("up-left", "population") 

head(table1)

sheet	address	row	col	is_blank	data_type	error	logical	numeric	date	character	character_formatted	formula	is_array	formula_ref	formula_group	comment	height	width	style_format	local_format_id	population
Table 1	B6	6	2	FALSE	character	NA	NA	NA	NA	Yes	list(character = "Yes", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	23.25	10.50000	Style3	143	Response clear
Table 1	D6	6	4	FALSE	character	NA	NA	NA	NA	No	list(character = "No", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	23.25	10.50000	Normal 2	140	Response clear
Table 1	F6	6	6	FALSE	character	NA	NA	NA	NA	Total	list(character = "Total", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	23.25	11.50000	Normal 2	140	Response clear
Table 1	B7	7	2	FALSE	character	NA	NA	NA	NA	no.	list(character = "no.", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	11.25	10.50000	Style3	60	Response clear
Table 1	C7	7	3	FALSE	character	NA	NA	NA	NA	%	list(character = "%", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	11.25	10.83203	Style3	60	Response clear
Table 1	D7	7	4	FALSE	character	NA	NA	NA	NA	no.	list(character = "no.", bold = NA, italic = NA, underline = NA, strike = NA, vertAlign = NA, size = NA, color_rgb = NA, color_theme = NA, color_indexed = NA, color_tint = NA, font = NA, family = NA, scheme = NA)	NA	FALSE	NA	NA	NA	11.25	10.50000	Style3	60	Response clear

この時点で、最終列に population という列名で「"Response Clear"」が入っていることがわかります（実際は下の方には「"Eligible Participants"」も入ってます）。

これがすなわち、ヘッダーがセルと紐付いた状態です。

この unpivotr::behead() の書き方は少し特殊で、上記コードで "up-left" というのは 「データから見たヘッダーの位置」 です。これはもとのエクセルと照らし合わせなきゃわからないのですが、これを指定するといい感じに表のヘッダーを認識できます。

どの方向を指定すればいいかは、 ?unpivotr::direction を実行するとヘルプに書いてあります。以下はその抜粋。

f:id:songcunyouzai:20191225015934p:plain f:id:songcunyouzai:20191225020045p:plain

これに沿って残りのヘッダーをつけてやって、必要な列を選択すれば tidy な読み込みは完成で、以下のようになります。

table1  <-
  cells %>%
  dplyr::filter(sheet == "Table 1", row >= 5L, !is_blank) %>%
  dplyr::mutate(character = str_trim(character)) %>%
  unpivotr::behead("up-left", "population") %>%
  unpivotr::behead("up", "response") %>%
  unpivotr::behead("up", "unit") %>%
  unpivotr::behead("left", "state") %>%
  dplyr::arrange(row, col) %>%
  dplyr::select(row, data_type, numeric, state, population, response, unit) %>%
  unpivotr::spatter(unit) %>%
  dplyr::select(-row)

table1

state	population	response	%	no.
New South Wales	Eligible Participants	Non-responding	20.5	1065445
New South Wales	Eligible Participants	Response clear	79.2	4111200
New South Wales	Eligible Participants	Response not clear(a)	0.2	11036
New South Wales	Eligible Participants	Total	100.0	5187681
New South Wales	Response clear	No	42.2	1736838
New South Wales	Response clear	Total	100.0	4111200
New South Wales	Response clear	Yes	57.8	2374362
Victoria	Eligible Participants	Non-responding	18.3	743634
Victoria	Eligible Participants	Response clear	81.4	3306727
Victoria	Eligible Participants	Response not clear(a)	0.3	11028
Victoria	Eligible Participants	Total	100.0	4061389
Victoria	Response clear	No	35.1	1161098
Victoria	Response clear	Total	100.0	3306727
Victoria	Response clear	Yes	64.9	2145629
Queensland	Eligible Participants	Non-responding	22.1	695710
Queensland	Eligible Participants	Response clear	77.7	2448075
Queensland	Eligible Participants	Response not clear(a)	0.2	7088
Queensland	Eligible Participants	Total	100.0	3150873
Queensland	Response clear	No	39.3	961015
Queensland	Response clear	Total	100.0	2448075
Queensland	Response clear	Yes	60.7	1487060
South Australia	Eligible Participants	Non-responding	20.3	242027
South Australia	Eligible Participants	Response clear	79.5	948775
South Australia	Eligible Participants	Response not clear(a)	0.2	2778
South Australia	Eligible Participants	Total	100.0	1193580
South Australia	Response clear	No	37.5	356247
South Australia	Response clear	Total	100.0	948775
South Australia	Response clear	Yes	62.5	592528
Western Australia	Eligible Participants	Non-responding	21.6	346333
Western Australia	Eligible Participants	Response clear	78.3	1257499
Western Australia	Eligible Participants	Response not clear(a)	0.2	3188
Western Australia	Eligible Participants	Total	100.0	1607020
Western Australia	Response clear	No	36.3	455924
Western Australia	Response clear	Total	100.0	1257499
Western Australia	Response clear	Yes	63.7	801575
Tasmania	Eligible Participants	Non-responding	20.3	77020
Tasmania	Eligible Participants	Response clear	79.5	301603
Tasmania	Eligible Participants	Response not clear(a)	0.2	805
Tasmania	Eligible Participants	Total	100.0	379428
Tasmania	Response clear	No	36.4	109655
Tasmania	Response clear	Total	100.0	301603
Tasmania	Response clear	Yes	63.6	191948
Northern Territory(b)	Eligible Participants	Non-responding	41.6	57496
Northern Territory(b)	Eligible Participants	Response clear	58.2	80376
Northern Territory(b)	Eligible Participants	Response not clear(a)	0.2	229
Northern Territory(b)	Eligible Participants	Total	100.0	138101
Northern Territory(b)	Response clear	No	39.4	31690
Northern Territory(b)	Response clear	Total	100.0	80376
Northern Territory(b)	Response clear	Yes	60.6	48686
Australian Capital Territory(c)	Eligible Participants	Non-responding	17.6	50595
Australian Capital Territory(c)	Eligible Participants	Response clear	82.3	236979
Australian Capital Territory(c)	Eligible Participants	Response not clear(a)	0.2	534
Australian Capital Territory(c)	Eligible Participants	Total	100.0	288108
Australian Capital Territory(c)	Response clear	No	26.0	61520
Australian Capital Territory(c)	Response clear	Total	100.0	236979
Australian Capital Territory(c)	Response clear	Yes	74.0	175459
Australia	Eligible Participants	Non-responding	20.5	3278260
Australia	Eligible Participants	Response clear	79.3	12691234
Australia	Eligible Participants	Response not clear(a)	0.2	36686
Australia	Eligible Participants	Total	100.0	16006180
Australia	Response clear	No	38.4	4873987
Australia	Response clear	Total	100.0	12691234
Australia	Response clear	Yes	61.6	7817247