2019-09-06

コンテキストスイッチをいかにして減らしたか？

スクラム開発プロセス Scrum

スクラムマスターの渡部です。

以前、とあるイベントに参加した際、「スクラムをやり始めたチーム向けの、ネタ帳的な情報ってあまり見かけないよね」という話をしていたのですが、それならばと思い、私たちのチームで実際にやったことの一部を紹介していこうと思います。

今回のテーマは、コンテキストスイッチのコスト削減です。

※今になって思えば、「スクラム現場ガイド」がまさにネタ帳的な内容でしたので、興味がある方はぜひお手に取っていただければと思います。

book.mynavi.jp

本記事で解説する内容

コンテキストスイッチとは？
コンテキストスイッチを体感してみよう！
コンテキストスイッチを減らすためにやったこと3つ

想定読者

（スクラムを導入しているか否かに関わらず）個人・チームのパフォーマンスを上げたいと考えている方

コンテキストスイッチとは？

コンテキストスイッチとは一言でいうと、作業Aから、異なる文脈の作業Bに思考を切り替えることです。

ただ作業を切り替えているだけのようにも思えますが、作業Bを終えて作業Aに戻ったとき、「作業Aがどこまで進んでいたのか？」、「この後何をするべきだったのか？」を思い出して復帰する工程が必要になり、ムダが発生してしまっているのです。

思い出すことができればまだ良いのですが、酷い場合は全く思い出すことができず、「昔話したアレ…結局何をすれば良かったんだっけ…？」と確認するハメに…

皆さんの中にも、「たくさんミーティングが入っている日は、他の作業が全然進まないなぁ…」というシチュエーションに身に覚えがある方はいませんか？

恐らくその時、「ミーティング」と「作業」の間でコンテキストスイッチが発生することにより、「元々の作業を思い出して復帰する」ことに思考のリソースを割いてしまっている可能性があります。

コンテキストスイッチを体感してみよう！

そうは言ったものの、「簡単な作業なら、少しくらい平気でしょ！」という気持ちも分かるので、簡単なワークでコストを体感してみたいと思います。

私がチームで実施しているのは下記の方法です。5分あれば体験できるので、良ければ是非試してみてください。

準備するもの

紙
ペン
ストップウォッチ（スマホでOK）

概要

アルファベット10文字「a〜j」、ひらがな10文字「あ〜こ」、数字10文字「0〜9」を2通りの順番で書いていただき、書き終えるまでの時間を計測・比較します。

「アルファベットを書く作業」、「ひらがなを書く作業」、「数字を書く作業」の、3種類の作業が存在するイメージですね。

手順

まずは、3種類の文字を1文字ずつ順番に書きます。（a → あ → 1 → b → い → 2…）
次に、1つの文字種をまとめて書いて、次の文字種に移ります。（a〜j → あ〜こ → 0 → 9）
1,2で書き終えるまでの時間を比較します。

いかがでしたでしょうか。

「文字を書く」という作業は日常で慣れているはずですし、特に難しい文字を書いている訳でも無いはずですが、書きにくさを感じませんでしたか？

簡単に思える作業ですらそうなのですから、当然、コードを書いたり、テストをしたり、その他開発以外のあらゆる作業も影響を受けます。

ですので、個人やチームのパフォーマンスを向上させるためには、できる限りコンテキストスイッチを減らし、同じコンテキストの作業を継続できる状況を作ることが大切になります。

次のセクションでは、私たちのチームがコンテキストスイッチ減らすために実際にやってみたことを、エンジニアからのラフな感想付きでいくつか紹介していきます。

やったこと①：運用系作業（差し込み作業）のコントロール

困っていたこと

運用系作業（差し込み作業）と、それに関連する確認相談が頻繁に発生していた
スプリント内の全工数の内、3〜6割ほどが運用系作業で占められていた

何をしたのか

運用系作業を専門で行うスタッフをアサインして、他スタッフが確認・対応する時間を減らした
複数の作業依頼が発生することもあったので、運用系作業のためだけのカンバンを作成し、着手すべき優先順位順に1列で並べるルールにした

どうだったか

エンジニア曰く「これは本当に良かった、助かった」とのこと。

差し込み（コンテキストスイッチ）を減らしたことと、単純に運用系作業に対応できる量にも制限が出来たことで、もともとは3〜6割程度を占めていた運用系作業の割合が、多くて1割程度に落ち着きました。

この施策は、チームで実施してきたカイゼンの中でも特にインパクトが大きいものでした。が、その分、チームに適したフローを整えるのに頭を悩ませましたし、影響範囲も大きいので、沢山の方にご理解ご協力をいただきました。

プロジェクトとチームの状況をご理解いただき、ご協力いただいた関係者の皆様、本当にありがとうございます。特に、フロー検討時に相談に乗っていただいたY岸さん、K林さん、グループ全体にスムーズに展開していただいたT中さん、そして何より、一手に引き受けてくださったSさん、本当にありがとうございます。

実施に際してチームごとに課題はあるかと思いますが、同じような問題に悩まれているチームは多いかと思いますので、是非試してみていただければと思います。

因みに、スクラム現場ガイド 14章では「専任チーム」として同様の事例が紹介されています。

やったこと②：プロダクトオーナーの席移動

困っていたこと

（詳しい事情は皆様のご想像にお任せしますが）不明なことが非常に多い中で探りながら開発を進める必要があり、都度の確認・相談のために作業の手が止まっていた

何をしたのか

プロダクトオーナーにコンテキストスイッチによるコストを説明・理解いただき、チームの近くに移動してもらい、気軽に確認相談できる環境を整えた

どうだったか

本当にややこしいものは、口頭のみで済ませてしまうと後で困るので、結局テキストに残すことになるのですが、「簡単な相談や、作ってその場で方向性のジャッジができることは良かった」と、エンジニアからはまずまずの評価でした。

因みに、プロダクトオーナーが近くにいることで過干渉のリスクがあると言われていましたが、私たちのチームではそのような問題はありませんでした。

やったこと③：MTGの調整

困っていたこと

MTGがたくさんある
MTGの開催時間が点在していて、長時間集中できる時間が無い

何をしたのか

参加マストなMTG以外は、欠席 or 任意参加にしてもらえるよう、関係者へ交渉した
参加マストなMTGで、時間調整可能なものは、朝に移動して午後はできる限り空けた

どうだったか

エンジニア曰く「MTGまであと30分くらいだから簡単な作業をしよう…とムダに考えなくて良くなったので進めやすくなった」とのこと。

ちなみに

コンテキストスイッチによる作業効率の低下は、作業単位のみならず、プロジェクト単位でも発生することがわかっています。

ざっくりとした例えですが、次のようなプロダクトA,B,Cのための3つのプロジェクトがあったとします。（必要な作業 A1,A2,A3が達成できれば、プロダクトAが出来上がるイメージです）

f:id:unifa_tech:20190903175619p:plain

まずは、全てを優先、つまり、チームが複数のプロジェクトを掛け持つ場合のスケジュールを見てみましょう。

作業間でコンテキストスイッチが発生するため、作業間に余白を入れて、下記のようなスケジュールになります。

f:id:unifa_tech:20190903170856p:plain

次に、1つずつ順番に対応する場合のスケジュールを見てみましょう。

コンテキストスイッチが発生するのは、A→Bの切り替え時、B→Cの切り替え時のみとなるので、そこに余白を入れて、下記のスケジュールとなります。

f:id:unifa_tech:20190903174725p:plain

可能な限りコンテキストスイッチを抑え、1つずつ順番に対応した場合の方がムダ（余白）が無いため、トータルで早く完了しそうだということがわかります。

（コンテキストスイッチの話とは少し脱線しますが、各プロダクトA,B,Cが早くリリースでき、多くの価値を提供できる利点もあります）

プロジェクトの同時並行に関してはやむを得ない場合もあると思われますので、可能な場合には考慮いただくとよろしいかと思います。

補足

下記ページにある表では、同時並行のプロジェクトが増えるごとに、コンテキストスイッチによってロスが生じ、1つ1つのプロジェクトに使える時間の割合が減っていくことを説明していますので、良ければ見てみてください。

www.scruminc.com

掛け持ちが3つ以上になると、1つ1つのプロジェクトに費やせる割合よりも、コンテキストスイッチによるコスト（ムダ）の割合の方が多くなるのは感慨深いです。

さいごに

いかがでしたでしょうか？

チームのパフォーマンスをできるだけ高めたいと考えている方にとってのヒントとなれば幸いです。

既に何らかの施策を実施されている方は「うちのチームはこんなことをやってみたよ！」とコメントいただけると涙を流して喜びます。

今回は、「コンテキストスイッチのコスト削減」にフォーカスしてお話しましたが、いずれ別のテーマでもネタ帳的な内容で記事を書ければと思います。

このように、私たちのチーム・会社では、効率的に目的を達成するために全員が一丸となって日々カイゼンと繰り返しています。

そんな働き方に少しでも興味を持っていただけるようでしたら、是非下記も覗いてみていただけると嬉しいです。

herp.careers

2019-09-03

not 0, but 1

デザインチームの三好です。

今回はデザイナーっぽいこと書きます。

「アートは自己表現、デザインは問題解決」という言葉をよく聞きます。個人的には100%そうだとは思っていませんが（稀にデザインがアートに、アートが問題解決になることもある）、やはり基本的には明確な問題に対してアプローチしていくものであることに間違い無いかと思います。

一般的にデザイナーとはセンスに長けている必要があると思われがちですが、あくまで感性とは問題解決というゴールに向けてより達成しやすくするための補助であり、最も重要なのは「問題解決能力に長けている」ことだと私は考えます。

オリジナリティを0から生み出す創造性というよりは、培った知識や経験を駆使して1から組み立てていくというほうが正しいかと思います。

では最近作成した開発チームTシャツの制作過程を例に説明してみたいと思います。

以下デザイン案です。

f:id:unifa_tech:20190903170026j:plain

社内デザイン物の中で今回の開発チームのTシャツはかなり自由度の高いデザインが許されるものです。かといってただビジュアル要素のみ重視して制作しても問題解決にはなりません。まず最善の答えを設定します。例の場合、最初に課せられた効果は以下になります。

①エンジニアを感じさせるもの　②イベント登壇時に着るイメージ

まず登壇時に着用する際のアピールとして前面プリントを選択しました。かつイベント着席時に後方の参加者からも認識してもらえるように社名ロゴを背面に配置。

開発チームのテーマである「保育をハックする」をメインに置いて（いくつか違うものも混ざってますが…）、インパクトの強い言葉に共鳴させるためボディは黒にしました。

後は「ハックする」を軸にしてイメージを具現化していきます。今回は主に図形を使って効果の強度を高めていきました。

例えば、図形を壁に置き換えてそれを線で打ち破っていく、”既存のルールを壊していくハッカー”を表現してみたり。

f:id:unifa_tech:20190903165826j:plain

最終的には女性が着用することも考慮し、ユニセックスな要素のあるデザインで落ち着きました。多色なラインのデザインはエンジニアに馴染みのあるターミナルカラーをイメージしています。

デザインは制作した全ての行為を言葉で説明できなければいけません。明確な目的がある限り全てに意味がなければならない。

デザインは絶対的に他人に向けて発信されるものである為、説得力が必要になります。自己犠牲なくして成り立たない仕事だと思っています。

1つの制作物を作るまでには思考する時間とチーム内での協力（客観的視点やレビュー）が必須です。特に私の場合は油断すると抽象的な方向へ飛んでいってしまうので、チームからの冷静な指摘があってようやくデザインとして成り立っていきます。

では、次回はデザインがいかに地味な作業の積み重ねかということをお話ししたいと思います。

2019-08-27

イメージキャプショニング入門

AI Python 深層学習

研究開発部の浅野です。深層学習で熱い分野の一つに、自動で画像の説明文を作成するイメージキャプショニングがあります。画像を解釈するコンピュータビジョンと適切な言語表現を生成する自然言語処理、どちらも手掛けたい！という欲張りなあなたにぴったりです。保育園では日誌や連絡帳などたくさんの書類作業がありますが、写真を一枚撮っておけば簡単な情景描写までは機械で済ませてくれるようになると、そうした書類作成の負荷が軽減できるかもしれません。

基本的なアプローチ

f:id:unifa_tech:20190823120903j:plain:w450 — イメージキャプショニングを行うネットワークの基本構造

対象の画像をConvolutional Neural Network(CNN)に入力して特徴空間でのベクトルに変換し、作成途中の説明文(単語列)をRecurrent Neural Network (RNN)に入力して同様に特徴ベクトルにする。それらを全結合ネットワーク(Fully Connected Network, FC)に入力して次の単語を推定する、というのが基本的な流れです。

学習の実際

f:id:unifa_tech:20190823122511j:plain:w250:left

例えばこの画像(Source: PhotoAC)に対して「サングラスをかけた赤ちゃんが水辺でくつろいでいる」という説明文を生成するように学習を行う場合、まず画像をCNNに、文の開始を意味する単語をRNNに入力し、出力される単語が「サングラス」になるように各ネットワークの重みを再計算します。

次のステップでは、CNNへの入力は変わらず、RNNへの入力は「サングラス」にします。その出力が「を」になるようにネットワークの重みを修正します（下図）。このようにしていろいろな画像に対して正しい文を生成するようなモデルを作成すべく、たくさんの画像と説明文の正解データをもとに学習を行っていきます。

f:id:unifa_tech:20190823125144j:plain:w550 — ネットワークの学習における入出力の例

実装

基本の構造はかなりシンプルなのでKerasによるモデル部分の記述も下記のように簡単です。今回は学習時間を短縮するため、CNN部分にはImageNetで学習済みのInceptionV3を使って事前に各画像の特徴ベクトルを作成しました。RNN部分にはLSTM(Long Short-Term Memory)を使用しています。学習用のデータにはFlickr8kを使いました。

from keras.layers import Input, Dense, LSTM, Embedding, Dropout
from keras.layers.merge import concatenate

def define_model(vocab_size, max_length):
    #photo feature extractor
    inputs1 = Input(shape=(2048, ))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)

    #sequence model
    inputs2 = Input(shape=(max_length, ))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    
    #decoder model
    decoder1 = concatenate([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
        
    return model

結果

f:id:unifa_tech:20190823134658j:plain:w250:left

学習の様子(左)を見ると、学習時の損失(青線)は順調に下がっていますが、評価時の損失(橙線)はEpochが進むとすぐに下げ止まっています。今回は非常に単純な構成かつCNN部分も学習済みのものを使って重みの更新をしていないため、それほど汎化性能がよくないのはやむを得ないところです。

f:id:unifa_tech:20190823122323j:plain:w350:right

学習に使用していない画像(Source: COCO)で実際にどのような説明文が生成されるかみてみましょう（右図）。画像の上部にある"man in red shirt is riding bike on the street"がモデル自動でつけた説明です。当たらずとも遠からず、という感じですね。

まとめ

イメージキャプショニングの大まかな構成と流れについて見てきました。画像(や動画)と言語が交わる分野には、イメージキャプショニングの他にもビデオキャプショニング、動画のアクション理解、ビジュアル質問応答、映像要約など、保育の世界でも役に立つ可能性がある技術がたくさんあります。引き続き注目していきたいと思います。

2019-08-16

基本的なサンプリングアルゴリズムである棄却サンプリングを試してみた

はじめに

こんにちわ、研究開発部の島田です。今回は統計的学習で基本的なサンプリングアルゴリズムを一つ紹介します。

統計的学習におけるサンプリング手法はいくつかありますが、大別するとマルコフ連鎖モンテカルロ法（MCMC）を使わないサンプリング手法とMCMCを使ったサンプリング手法に分けられます。

MCMCを使わないサンプリング手法の最も基本的なアルゴリズムとして、棄却サンプリングがあげられます。このアルゴリズムは難しい数式を必要としないので、直感的にもわかりやすいです。

今回はこの棄却サンプリングについて簡単な説明と実装を行います。

棄却サンプリング

ベイズ統計でやりたいことは、ある複雑な事後分布からモンテカルロ法を使ってサンプリングをしたいということになります。

そして棄却サンプリングでは、直接事後分布からサンプリングすることは難しいのでもっと単純な分布（これを提案分布と言います）を利用しようという手法です。

ここで、予測分布を $p(z)$ 、提案分布 $q(z)$ とすると、

$p(z) \leq Mq(z)$

上の条件を満たす $M$ および $q(z)$ を求める必要があります。この条件は予測分布が提案分布を $M$ 倍した分布に覆われていることを意味しています。

棄却サンプリングの手順は非常にシンプルで、下記のようになります。

提案分布 $q(z)$ に従う乱数 $y$ を発生
$[0, Mq(y)$ ]に従う乱数 $s$ を発生
$s \leq p(y)$ の場合は受理。そうでない場合は棄却。
1から3を $N$ 回繰り返す。

ベータ分布

今回は予測分布をベータ分布として棄却サンプリングを試してみます。ベータ分布とは、確率密度関数が下記の式で表される確率分布のことです。

$f(x|α,β) = \dfrac{x^ {α-1}(1-x)^ {β-1}}{B(α, β)}$

ただし、 $α, β$ は正の実数のパラメータです。また、ベータ分布は区間[0, 1]上の連続型の確率分布であることが特徴です。では、ベータ分布の形を確認するためにScipyというPythonライブラリを使って確認してみます。

まずは必要なライブラリのインポートです。

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize as opt
from scipy.stats import beta, uniform

次にベータ分布を描画してみます。必要なコードはたったこれだけです。

np.random.seed()

a, b = 1.8, 2.9
x = np.linspace(beta.ppf(0.001, a, b), beta.ppf(0.999, a, b), 100)
plt.plot(x, beta.pdf(x, a, b))

f:id:unifa_tech:20190816113156p:plain

棄却サンプリングの実行

では、いよいよ棄却サンプリングを試してみます。

提案分布はどんな分布であっても構わないのですが、今回はサンプリングが簡単で理解もしやすい一様分布とします。

まずは定数 $M$ を求める必要があるのですが、下の図からわかるように $M$ はベータ分布の最大値 $p(y max)$ と求めることが出来ます。

そして、横軸方向について一様分布 $q(z)$ からランダムな値 $y$ を決めます。具体的には0から1の間で一つ値を決定することになります。

次に縦軸方向については $[0, Mq(y)$ ]の一様分布に従ってランダムな値 $s$ を決めます。

この処理によって横軸と縦軸のスケールの異なる一様分布から生成された二つの乱数によって座標が一点決まることになります。

そしてこの乱数によって求まった値 $s$ が予測分布よりも下の範囲であれば採用します。とてもシンプルですね！

当然ながらサンプリング数が多ければ多いほど、予測分布に近しい結果になります。

f:id:unifa_tech:20190816153734p:plain:w600

では、実装して結果を見ていきます。

まずは $M$ を求めますが、今回はScipyのOptimize関数を使います。ScipyのOptimize関数は目的関数を最適化（最小値もしくは最大値を求める）する便利な関数です。

f = beta(a=a, b=b).pdf
res = opt.fmin(lambda x: -f(x), 0.1)
M = f(res)
print("M : ", M)

f:id:unifa_tech:20190816113912p:plain:w300

$f(x)$ にマイナスをつけているのは、ベータ分布の最大値を求めるために最小値問題に帰着させるためです。

結果を見てみると上手くベータ分布の最大値を求められていることがわかります。

次に乱数を生成して提案分布が予測分布に近しい結果になっていくことを確認します。

まずは乱数を1000個生成した場合を見てみます。

N_MC = 1000

y = uniform.rvs(size=N_MC)
mq = M * uniform.rvs(size=N_MC)
accept = y[mq <= f(y)]

plt.hist(accept, normed=True, bins=35, rwidth=0.8, label="rejection sampling")
x = np.linspace(beta.ppf(0.001, a, b), beta.ppf(0.999, a, b), 100)
plt.plot(x, beta.pdf(x, a, b), label="target distribution")
plt.legend()

f:id:unifa_tech:20190816114630p:plain:w400

1000個の乱数だと、まだ少し予測分布（ベータ分布）に対して少しバラツキがありますね。

では、乱数を50000個に増やしてみましょう。

f:id:unifa_tech:20190816114751p:plain:w400

すると、乱数が1000個に比べてベータ分布に対してほぼ同じような曲線を描いていることがわかります。

最後に

今回は基本的なサンプリングアルゴリズムである棄却サンプリングを試してみました。異なる分布から求めたい分布のサンプルを非常にシンプルな考えで得ることが出来て面白いですね。サンプリングアルゴリズムは他にもたくさんあるので、是非試してみたいと思います。

2019-08-15

Test Driving the Proposed Vue.js Function API

By Robin Dickson, software engineer at UniFa.

An RFC (request for comments) for Vue.js was published that explains the plan for a new Function API. Following that, a plugin was created that allows the proposed Function API to be used in current Vue applications: vue-function-api.

I thought I would experiment with the Function API by building a mini app.

Function API Installation

The base app was created using vue cli, and the vue-function-api plugin installed using yarn:

$ vue create janken
$ yarn add vue-function-api

Then the plugin installed explicitly:

import Vue from 'vue'
import { plugin } from 'vue-function-api'

Vue.use(plugin)

The current API (Standard API) still works as usual, and it is even possible to use a hybrid approach (Function API + Standard API).

The app I decided to build was Janken, or in English: Rock, Scissors, Paper.

In this app there are 4 features:

The player can choose their hand
The computer chooses their hand and a winner is calculated
The total amount of points (wins) for each player are shown
The player can change their name

The initial design was:

<template>
  <div>
    <div class="score">
      <div>Player</div>
      <div>0 - 0</div>
      <div>Computer</div>
    </div>
    <div class="player-hands">
      <div>✊</div>
      <div>✊</div>
    </div>
    <ul class="hand-choices">
      <li>✊</li>
      <li>✌️</li>
      <li>🖐️</li>
    </ul>
  </div>
</template>

f:id:unifa_tech:20190814123651p:plain

Vue Implementation

Setup, Data and Value

The first change from the Standard API is that a setup option is used to set up the component logic. If you need to use props they are passed to setup as an argument (more info).

The score data would have previously be stored in the data option, which is not used in the Function API. Instead the data is stored by using the value API. Data and functions that are used in the template are returned from the setup option.

import { value } from "vue-function-api";

export default {
  setup() {
    const playerScore = value(0);
    const computerScore = value(0);

    return {
      playerScore,
      computerScore
    };
  }
};

<div class="score">
  <div>Player</div>
  <div>{{ playerScore }} - {{ computerScore }}</div>
  <div>Computer</div>
</div>

Methods

The next task was to enable the player to choose their hand. Using the Standard API this can be done using the methods option. In the Function API the same can be done using a function.

export default {
  setup() {
    // ...
    const playerHand = value(null);

    function submitHand(hand) {
      playerHand.value = hand;
    }
    return {
      playerScore,
      computerScore,
      playerHand,
      submitHand
    };
  }
};

To set (and also get) the value of playerHand within setup playerHand.value must be used.

Computed

The hands are displayed in the UI using emoji. The hand data stored as a string ('rock') is converted to an emoji ('✊') with the computed API (similar to the Standard API's computed option). Again this is stored to a variable and returned from setup to be used in the template.

import { value, computed } from "vue-function-api";

export default {
  setup() {
    const handsToEmoji = {
      rock: "✊",
      scissors: "✌️",
      paper: "🖐️" 
    };
    
    const isShowGameHands = value(false);
    // ...
    const playerHand = value(null);

    // ...
    const playerDisplayEmoji = computed(() =>
      isShowGameHands.value
        ? handsToEmoji[playerHand.value]
        : handsToEmoji["rock"]
    );
    // ...
    return {
      // ...
      playerDisplayEmoji,
      computerDisplayEmoji
    };
  }
};

If there are many methods or computed values in the returned object, these can be grouped into a single object and destructured in the object returned from setup.

// example code (not from to the Janken app)
const methods = {
  methodA() {
    // ...
  },
  methodB(arg) {
    // ...
  }
}

const computeds = {
  computedA: computed(() => 'a'),
  computedB: computed(() => 'b')
};

return {
  ...computeds,
  ...methods
};

(see in this example)

Composition Functions

After adding the logic for the game and editing the player name I tried refactoring using a technique made possible in the Function API. Using a composition function the logic (variables and methods) could be extracted to a separate function, and then included in the object returned from the main setup option.

function useName() {
  const playerName = value("Player");
  const isEditingName = value(false);

  function editName() {
    isEditingName.value = true;
  }
  function submitName() {
    isEditingName.value = false;
  }
  return { playerName, isEditingName, editName, submitName };
}

export default {
  setup() {
    // ...

    return {
      isShowGameHands,
      playerScore,
      computerScore,
      playerHand,
      computerHand,
      submitHand,
      ...computeds,
      ...useName()
    };
  }
}

By doing this the code can be organised more clearly, collecting related code together rather than it being separated between different options(data, computed, methods, etc) which can happen in the Standard API. It is also possible to reuse the logic in other components.

Although not used in this app lifecycle hooks and watchers are used in a similar way to value and computed and can also be extracted.

The Janken app and code can be seen and used below. It only took a few steps to get started with the Function API, and there are various features I did not use that I'm looking forward to trying. For more information check out the RFC and try it yourself!

<html>
  <head>
    <script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>
    <script src="https://unpkg.com/vue-function-api@1.0.4/dist/vue-function-api.umd.js"></script>
    <title></title>
  </head>
  <body>
    <div id="app">
      <main>
          <div>
    <div v-if="isEditingName" class="emoji-button" @click="submitName">🆗</div>
    <div v-else class="emoji-button" @click="editName">✏️</div>

    <div class="score">
      <div class="input-wrapper" v-if="isEditingName">
        <input v-model="playerName" @keyup.enter="submitName" type="text" maxlength="7" />
      </div>
      <div v-else>{{ playerName }}</div>
      <div>{{ playerScore }} - {{ computerScore }}</div>
      <div>Computer</div>
    </div>
    <div class="player-hands">
      <div>{{ playerDisplayEmoji }}</div>
      <div>{{ computerDisplayEmoji }}</div>
    </div>
    <ul class="hand-choices">
      <li :class="{ selected: playerHand === 'rock' }" @click="sumbmitHand('rock')">✊</li>
      <li :class="{ selected: playerHand === 'scissors' }" @click="sumbmitHand('scissors')">✌️</li>
      <li :class="{ selected: playerHand === 'paper' }" @click="sumbmitHand('paper')">🖐️</li>
    </ul>
  </div>
      </main>
    </div>
  </body>
</html>

main {
  font-family: "Courier New", Courier, monospace;
  width: 280px;
  margin: auto;
}
.score {
  display: flex;
  justify-content: space-between;
  margin-top: 40px;
  font-size: 18px;
  font-weight: bold;
}
.score div:nth-child(1) {
  width: 35%;
  text-align: right;
}
.score div:nth-child(2) {
  width: 30%;
  text-align: center;
}
.score div:nth-child(3) {
  width: 35%;
  text-align: left;
}
.player-hands {
  display: flex;
  justify-content: space-between;
  margin-top: 80px;
  font-size: 20vh;
}
.hand-choices {
  display: flex;
  justify-content: space-between;
  margin-top: 60px;
  padding: 0;
}
.hand-choices li {
  font-size: 15vh;
  list-style: none;
  cursor: pointer;
}
.hand-choices li.selected {
  text-decoration: lightblue underline;
  transition: font-size 0.2s;
}
.score div {
  padding: 5px 0;
}
.score .input-wrapper {
  padding: 0;
}
input[type="text"] {
  width: 80px;
  padding: 5px;
  margin: 0 0 0 15px;
  border: 2px solid #ccc;
  border-radius: 5px;
  font-family: "Courier New", Courier, monospace;
}
.emoji-button {
  position: fixed;
  top: 10px;
  left: 10px;
  cursor: pointer;
}

const { plugin, value, computed } = vueFunctionApi;

Vue.config.productionTip = false;
Vue.use(plugin);

function useName() {
  const playerName = value("Player");
  const isEditingName = value(false);
  function editName() {
    isEditingName.value = true;
  }
  function submitName() {
    isEditingName.value = false;
  }
  return { playerName, isEditingName, editName, submitName };
}

var app = new Vue({
  el: '#app',
  setup() {
    const handsToEmoji = { rock: "✊", scissors: "✌️", paper: "🖐️" };
    const isShowGameHands = value(false);
    const playerScore = value(0);
    const computerScore = value(0);
    const playerHand = value(null);
    const computerHand = value(null);
    const computeds = {
      playerDisplayEmoji: computed(() =>
        isShowGameHands.value
          ? handsToEmoji[playerHand.value]
          : handsToEmoji["rock"]
      ),
      computerDisplayEmoji: computed(() =>
        isShowGameHands.value
          ? handsToEmoji[computerHand.value]
          : handsToEmoji["rock"]
      )
    };
    function sumbmitHand(hand) {
      playerHand.value = hand;
      runGame();
    }
    function randomHand() {
      const hands = Object.keys(handsToEmoji);
      return hands[Math.floor(Math.random() * hands.length)];
    }
    function addPointToWinner() {
      const handToWeakness = {
        rock: "paper",
        scissors: "rock",
        paper: "scissors"
      };
      if (handToWeakness[computerHand.value] === playerHand.value) {
        playerScore.value++;
      } else if (handToWeakness[playerHand.value] === computerHand.value) {
        computerScore.value++;
      }
    }
    function runGame() {
      computerHand.value = randomHand();
      isShowGameHands.value = true;
      addPointToWinner();
    }
    return {
      isShowGameHands,
      playerScore,
      computerScore,
      playerHand,
      computerHand,
      sumbmitHand,
      ...computeds,
      ...useName()
    };
  }
}).$mount('#app')

2019-08-08

赤ちゃん１人とルーター２つ

こんにちは。サーバーサイドエンジニアの柿本です。

赤ちゃんとルーター、一見関係なさそうな二者ですが実は深い関係があることをご存知でしょうか？

我が家には昨年末に子供が生まれ、結果的にルーターが１つ増えました。

ユニファではリモートワークを取り入れていますが、自宅で作業を行うにはそれなりの環境が必要となります。私は自宅のリビングが好きなので、リモートワークの時はいつもリビングで作業しています。

子供が生まれる前はWEB会議もリビングで行っていましたが、会議中にホギャーホギャー泣き声が聞こえると他の参加者の迷惑になるので、WEB会議をする時は寝室に移動することにしています。

しかし、寝室には電波が届きません！我が家はびっくりするくらい狭いのですが、NTTから貸し出されているルーター（ONUと一体型のもの）の電波はさらにびっくりするくらい弱いのです！そのため、電波が強めのルーターをAPとして使うために買いました。

しかし、この春から子供が保育園に通い始めたために寝室に移動する必要がなくなり、ルーターは黒光りするただの飾りものとなってしまいました。

使わなくなった電子機器、といえば一番に思い浮かぶのは改造です！調べてみるとDD-WRTというルーター用Linuxディストリビューションがあるみたいなので、早速入れてみることにしました。

やること

今回利用するルーターはBuffaloの「WZR-HP-AG300H」です。

ググってみると手順はすごく簡単で、

DD-WRTをサイトからダウンロード
ルーターの管理画面( http://192.168.11.1 )にアクセス
ファームウェアの更新メニューからファイルを選んで更新
『dd-wrt』というSSIDに接続
ルーターの管理画面( http://192.168.1.1 )に接続
諸々の設定

といった感じです。

1. DD-WRTのダウンロード

DD-WRTのサイト( https://dd-wrt.com )のRouter Databasesから「WZR-HP-AG300H」で検索すればすぐに見つかりました。

選択肢が二つありますが、「DD-WRT: Factory flash」の方がそれっぽいのでこれをダウンロードします。

2. ルーターの管理画面( http://192.168.11.1 )にアクセス

http://192.168.11.1 で管理画面が開くはずですが、なぜかアクセスできません。。。pingも通りません。。。

色々調べてみると、ネットワーク内には以下のIPしか存在しておりませんでした。

MACアドレスがわからないので順番にアクセスを試してみると「192.168.1.8」がBuffaloのルーターでした！よくよく考えるとNTTから貸し出されているルーターでルーティングしているので、Buffaloのルーターが想定と違うIPでもそりゃそうか、という感じですね。

APとして使っている場合はハマりポイントかもしれません。

3. ファームウェアを更新

少し緊張する瞬間ですが、迷わず「設定」ボタンを押します。

ファームウェアの更新が始まります。

4. 『dd-wrt』というSSIDに接続

10分ほどドキドキする時間が続きましたが、「dd-wrt」というSSIDが出てきたのでひとまず安心です！

5. ルーターの管理画面( http://192.168.1.1 )に接続

パスワード設定画面のあと、管理画面に無事に到達しました。

6. 諸々の設定

APとして使う設定

現状はAPとして使うので、WAN側とDHCP機能をDisabledにして、Subnet MaskやGatewayをルーターの設定に合わせます。さすがに100台の機器をルーターにつなげることはないと思うので、Local IPアドレスは「192.168.1.100」にしました。

「Apply Setting」を押すとルーターの再起動がかかりますが、再起動後は設定画面が http://192.168.1.100 になるため、注意が必要です。

SSIDとWPA

「WZR-HP-AG300H」というデバイスは2.4GHzと5GHzで通信できますが、初期状態だとどちらも「dd-wrt」というSSIDになるので見分けがつきません。なので、それぞれ別のSSIDを設定してあげます。

このままだとご近所さんにフリーWi-Fiを提供して我が家を危険に晒してしまうので、WPAの設定も行います。

ssh接続

せっかくLinuxマシーンを手に入れたので、ssh接続できなくてはつまらない！ということで設定します。

「Apply Setting」を押してしばらくすると、ssh接続できました！

まとめ

ルーターの管理画面のアドレスがコロコロ変わるので迷子になりかけますが、そこさえ突破できれば、ただのルーターがちょっとしたLinuxサーバーに変身です！

外部からアクセスできるようにすればVPNもできるようになるので、この次はONU一体型のルーターをなんとかしてDDNSをなんとかして、VPNライフ（多分それほど快適ではない）を目指します！

UniFaでは使わなくなった電子機器を改造せずにいられないエンジニアを募集中です！ご応募お待ちしております！

www.wantedly.com

2019-08-08

Experimental method for Bio-Data augmentation using only two observations for deep learning applications.

By Matthew Millar R&D Scientist at ユニファ

This blog will show a new experimental method for data augmentation geared towards bio-science for deep learning. This is important for several reasons. 1: Collecting data is time-consuming especially in collecting large enough observations for training deep learning models. 2: It can be difficult to collect or sample enough observations due to the lack of access or chances to make collections. 3: Collecting observations can only be done at certain times or during certain periods, or the period of time for sampling has passed so the collection of further/more observations are impossible. 4: There are few species available to collect samples from. These are just 4 simple reasons why data augmentation is needed for biological studies.

Methods for Data Augmentation

The simplest method for data augmentation is to match the generated data both statistically and logically to the observed data. This means that the data that is generated should have a similar look and feel of the real-world data. The two data sets should have similar distributions, mean, modes, etc. to ensure that the data truly simulates the observed sequences. The simulated data should also be logically like the data that is observed. This means that the simulated data should not have outliers model into it as this will confuse any model. The augmented data should flow alongside the observations and almost mirror each observation. But, just copying the real observations is not an appropriate method for data augmentation. The observations should change slightly. For example, common methods for data augmentations in CNN are image rotation, flipping, cropping, changing color, etc. to create “new” unseen images for a CNN to be trained on. This is also true for numerical data, but not as easy as just flipping the numbers from 10 to 01 as they are not the same.

There are very few methods that exist for data augmentation for numerical data. There are even fewer geared specifically towards biodata or biostudies. This blog will show a new method for generating near-infinite observations based simply on the minimum and maximum observations in a data set.
The data set that I am using is a publicly available data set of Body Measurements (BDIMS)(Heinz, Peterson, Johnson, & Kerk, 2003). This data set is the girth and skeletal measurement of 247 men and 260 women.

Now let's get into the coding aspect of it:

CODE

First, let's get all the import statements out of the way.

import numpy as np 
import pandas as pd 
%matplotlib inline
import matplotlib.pyplot as plt
import pymc3 as pm
import theano
from statsmodels.formula.api import glm as glm_sm
import statsmodels.api as sm
from pandas.plotting import scatter_matrix
from random import randint

Next, we need to do some quick examination of the data we downloaded.

# Read the data in from the csv file
data = pd.read_csv("bdims.csv")
print(data.columns)

Index(['bia.di', 'bii.di', 'bit.di', 'che.de', 'che.di', 'elb.di', 'wri.di',
       'kne.di', 'ank.di', 'sho.gi', 'che.gi', 'wai.gi', 'nav.gi', 'hip.gi',
       'thi.gi', 'bic.gi', 'for.gi', 'kne.gi', 'cal.gi', 'ank.gi', 'wri.gi',
       'age', 'wgt', 'hgt', 'sex'],
      dtype='object')

Now we know the colum names. Lets get rid of some of the data we dont want to make it simpler and easier to use.

filter_data = data.filter(['sex','hgt','wgt', 'che.gi','hip.gi', 'kne.gi','thi.gi', 'ank.gi', 'wri.gi', 'wai.gi' ], axis=1)
print(filter_data.head())

   sex    hgt   wgt  che.gi  hip.gi  kne.gi  thi.gi  ank.gi  wri.gi  wai.gi
0    1  174.0  65.6    89.5    93.5    34.5    51.5    23.5    16.5    71.5
1    1  175.3  71.8    97.0    94.8    36.5    51.5    24.5    17.0    79.0
2    1  193.5  80.7    97.5    95.0    37.0    57.3    21.9    16.9    83.2
3    1  186.5  72.6    97.0    94.0    37.0    53.0    23.0    16.6    77.8
4    1  187.2  78.8    97.5    98.5    37.7    55.4    24.4    18.0    80.0

Much nicer. Now we only want to look at one subject as this is biological data. So we will filter out females from males and just look at males. This process will work on both sexes as the steps will be the same, but doing both at the same time will yield poor results as there are biological differences between males and females in general.

# Split between male and female 
male_mask = filter_data['sex'] > 0
male = filter_data[male_mask]
female = filter_data[~male_mask]
# After sperating the two exes lets drop the sex collumn as we dont need it
male = male.drop(['sex'], axis=1)
male.describe()

                 hgt	wgt	che.gi	hip.gi	kne.gi	thi.gi	ank.gi	wri.gi	wai.gi
count	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000
mean	177.745344	78.144534	100.989879	97.763158	37.195547	56.497976	23.159109	17.190283	84.533198
std	7.183629	10.512890	7.209018	6.228043	2.272999	4.246667	1.729088	0.907997	8.782241
min	157.200000	53.900000	79.300000	81.500000	31.100000	46.800000	16.400000	14.600000	67.100000
25%	172.900000	70.950000	95.950000	93.250000	35.750000	53.700000	22.000000	16.500000	77.900000
50%	177.800000	77.300000	101.000000	97.400000	37.000000	56.000000	23.000000	17.100000	83.400000
75%	182.650000	85.500000	106.050000	101.550000	38.450000	59.150000	24.300000	17.850000	90.000000
max	198.100000	116.400000	118.700000	118.700000	45.700000	70.000000	29.300000	19.600000	113.200000

Now with the first step of preprocessing, we can get into the process of creating the dataset from only two points! These two points will be the minimum and maximum based on height. Height is chosen because this variable is the dominating variable in biology and bio-mass. Weight is normally heavily dependant on height (pun intended). The dependent variable will be weight. (X = height Y = weight).
So let's find the smallest and largest person in the dataset.

# Find the smallest item based on height 
# Create a new dataframe of the smallest and larget
min_max_male = pd.DataFrame(male[male.hgt == male.hgt.max()]) 
min_max_male = min_max_male.append(male[male.hgt == male.hgt.min()])
# Sort by height
sort_min_mix_male = min_max_male.sort_values('hgt')
print(sort_min_mix_male)
           hgt   wgt  che.gi  hip.gi  kne.gi  thi.gi  ank.gi  wri.gi  wai.gi
105  157.2  58.4    91.6    91.3    35.5    55.0    20.8    16.4    80.6
126  198.1  85.5    96.9    94.9    39.2    54.4    27.5    17.9    82.5
ax1 = min_max_male.plot.scatter(x='hgt',y='wgt',c='DarkBlue')

f:id:unifa_tech:20190805162245p:plain — Max Min Plot

So the first and simplest method to interpolation is linear regersion. This will give us a few extra points of missing data.

# Now use linear regression to fill in some of the missing points
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([min_max_male.hgt.min(),min_max_male.hgt.max()]).reshape((-1, 1))
y = np.array([min_max_male.wgt.min(), min_max_male.wgt.max()])
# Define a linear regerssion model
model = LinearRegression()
model.fit(x, y)
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)

coefficient of determination: 1.0
intercept: -45.75941320293397
slope: [0.66259169]

Now to make new points.

prediction = []
gen_height = []
for i in range(int(min_max_male.hgt.min()), int(min_max_male.hgt.max())):
    new_x = np.array(i).reshape((-1, 1))
    gen_height.append(i)
    pred = model.predict(new_x)
    prediction.append(pred[0])

print(len(prediction))
print(len(gen_height))
print(prediction[0])
print(gen_height[0])
41
41
58.267481662591706
157
# Lets plot the results
import matplotlib.pyplot as plt

old_min_hgt = min_max_male.hgt.min()
old_max_hgt = min_max_male.hgt.max()
old_min_wgt = min_max_male.wgt.min()
old_max_wgt = min_max_male.wgt.max()

plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()

f:id:unifa_tech:20190805162711p:plain — Linear Regression

Ok looks fine so far. The blue dots are the original data (min and max) and the red dots are the newly generated data. This makes sense as weight should increase as height increases. But, not really. There are variations in weight because of other factors. Also, 41 new points don't make a deep learning set.
Lets create a few more points:

# Now lets fine tune the hieght veriable by a float instead of a int
# We can resue the linerar regression model to generate more data
# Go from 41 observations to 409000 observatsions 
# All equally possible to occure in the real world
current_hgt = min_max_male.hgt.min()
count = 0
large_hgt = []
while current_hgt <= min_max_male.hgt.max():
    # increase the height by 0.1 cm
    current_hgt +=0.0001
    large_hgt.append(current_hgt)
    count +=1
print(len(large_hgt))

409000

# Now using the newlly generated fine scale height lets get the weight
large_pred = []
for h in large_hgt:
    new_x = np.array(h).reshape((-1, 1))
    pred = model.predict(new_x)
    large_pred.append(pred[0])

print(len(large_pred))

409000
# Now lest plot everything again

plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()

f:id:unifa_tech:20190805163104p:plain — Larger Dataset

As you can see perfectly overlaps and each observation makes sense and is logical.
The blue dots are the original, the red is the first step, and the green is fine-tuned steps.
This jumps from 2 observations (min and max) to 41 observations (fully synthetic) to 409000 observations.
But in the real world, biology does not always follow a linear line
Let's introduce some variability into the data generation!

# Define a new line using all the data from the real data set
# Define a linear regerssion model
X = np.array(male.hgt).reshape(-1, 1)
Y = np.array(male.wgt).reshape(-1, 1)

model2 = LinearRegression()
model2.fit(X,Y)
r_sq2 = model2.score(X,Y)
print('coefficient of determination:', r_sq2)
print('intercept:', model2.intercept_)
print('slope:', model2.coef_)

coefficient of determination: 0.28594874074704446
intercept: [-60.95336414]
slope: [[0.78256845]]

# Linear regresion using real data
y_pred = model2.predict(X)
# Now plot all the data
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.plot(male.hgt, male.wgt, 'yo')
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()

f:id:unifa_tech:20190805163445p:plain — Real Data

As you can see the regression line is some what close to the line of data that is generated. It is not perfect and there will be a lot of variability between the two datasets. But seeing that this is only based on two observations, (the min and max) the lines are pretty close. The intercept and slope are close enough to use the ones found from the two points only. So let us continue and make a fully synthetic deep learning dataset from two observations.

# The slope of the line is b, and a is the intercept found from Sklenar linear model
# Simple Linear regressoin model Y = a + bX that will be the model for out MCMC
alpha =  -45.75941320293397 # Intercept
beta = [0.66259169] # Slope
X = np.array(large_hgt)
Y = np.array(large_pred)
print(len(X))
print(len(Y))

409000
409000

# Weight Histogram
hist = male.hist(column='wgt')

f:id:unifa_tech:20190805163840p:plain — Real Data Histogram

#Normal distribution. mu is the mean, and sigma is the standard deviation.
# Seeing that the weight is normally distributed (basically) we can use that knowledge to generate new data via a normally
# Distrubuted method

#for random.normalvariate(mu, sigma)
std = np.std(X, axis=0)
real_std = np.std(male.wgt, axis=0)
print(std)
print(real_std)
11.806813005284504
10.491587167890629

temp_min_max = []
temp_min_max.append(male.wgt.max())
temp_min_max.append(male.wgt.min())
mean = np.mean(temp_min_max)
real_mean = np.mean(male.wgt)
print(mean)
print(real_mean)
85.15
78.14453441295547

Looking at the mean and standard deviation they are close enough for this example. Lets make a Million data points for our new dataset! That should be enough for any deep learning dataset.

new_X = []
new_Y = []
for i in range(0,1000000):
    index = randint(0, len(X) -1)
    new_X.append(X[index]) 
    new_Y.append(np.random.normal(mean,std))
plt.plot(new_X, new_Y, 'go',marker='^')
plt.plot(male.hgt, male.wgt, 'yo')
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()

f:id:unifa_tech:20190805164134p:plain — A Million Data points!

Well thats no good. Now to be fair, given a infinate number of samples, it is highly likely that at least for each point there would have been someone that mathces the height and weight on this chart, but that is like using a shotgun to fish. It is not as accurate and not really following the regression line of the real data which means that the dataset is not useful and cannot be used in a deep learning model as it wont learn anything.
So how can we fix this?
Let's perform some rejections by using a concept of banding. So if the observation falls outside the bands it won't get plotted. The bands themselves set up an upper and lower limit so that all predictions will have to fall within these limits. To form these limit expert knowledge of the observed phenomenon is needed especially for only two observations, luckily for us, we have more than two observations so we can define out limits based on the full real dataset.

# Use upper and lower limits to reject samples
def make_sample(lower, upper, mean, std):
    sample = np.random.normal(mean,std)
    if lower < sample < upper:
        return sample
    else:
        make_sample(lower, upper, mean, std)

# Define bands for each interval
# The more bands the finer the level of rejection
# Each item in the array is defined as
# [band lower, band upper, lower limit, upper limit]
band1 = [0, 155, 50, 70]
band2 = [156,160, 55, 70]
band3 = [161, 165, 56, 75]
band4 = [166, 170, 57, 80]
band5 = [171, 175, 60, 88]
band6 = [176, 180, 60, 94]
band7 = [181, 185, 60, 100]
band8 = [186, 190, 63, 105]
band9 = [191, 195, 64, 110]
band10 = [196, 299, 65, 110]
# Put all the bands into a single array for easy use
bands = []
bands.append(band1)
bands.append(band2)
bands.append(band3)
bands.append(band4)
bands.append(band5)
bands.append(band6)
bands.append(band7)
bands.append(band8)
bands.append(band9)
bands.append(band10)

new_X = []
new_Y = []
for i in range(0, 1000000):
    index = randint(0, len(X) -1)
    for band in bands:
        if band[0] <= X[index] <= band[1]:
            new_X.append(X[index]) 
            new_Y.append(make_sample(band[2], band[3], mean, std))
                    
    
plt.plot(new_X, new_Y, 'go',marker='^')
plt.plot(male.hgt, male.wgt, 'yo')
plt.plot(large_hgt, large_pred, 'go')
plt.plot(gen_height, prediction, 'ro')
plt.plot(old_min_hgt, old_min_wgt, 'bo')
plt.plot(old_max_hgt, old_max_wgt, 'bo')
plt.show()

Which gives us this!

f:id:unifa_tech:20190805164757p:plain — Banded Data

There are still a million points, but some my be repeated. But, the general flow is far more similar to the real data which is perfect now for training a deep learning model.

Conclusion

From this blog, we saw how to use only two observations, the minimum and maximum, and how to create a fully synthetic dataset that can be used for deep learning.
The main idea when building a fully synthetic dataset is to ensure it is statistically and logically similar to that of the observed/real dataset. This gives the benefit of creating a large training dataset and then using the real data as a testing set. This can give very good results when creating a deep learning model as you won't have to train the model on the very limited (and precious) real data that can be very difficult to capture or collect.

This approach can be improved significantly, especially in the banding section. By adding a larger number of bands, smoothing out the lower and upper limits, and even using more complex algorithms like a random walk can improve the final results. But, this method still needs to be vetted before use in different models and/or real-world applications. The next step would be to model more independent variables, other phenomenons, and improve the generation steps.

References:

Heinz G, Peterson LJ, Johnson RW, Kerk CJ. 2003. Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2).