人間の文字列認識テスト

英語に限らず、人間は書き言葉を読むときに、逐字的に読んでいるわけではなく、単語や単語の塊の視覚的パターンを捉えているのではないかということを示す実験が以下。

まったく難なく読めて、かなり驚いた。これを自分でも試してみようと思ってRubyでスクリプトを書いた。単語の先頭と末尾はそのままに、その間の文字列をランダムにシャッフルする。

class String
  def shake
    unless self.size <= 3
      str = ""
#      begin_letter = split(//).shift
#      end_letter = split(//).pop
      begin_letter = self[0].chr
      end_letter = self[self.size-1].chr
      self.split(//)[1,self.size - 2].rand_each{|c|
        str.concat(c)
      }
      begin_letter + str + end_letter
    else
      self
    end
  end
end

class Array
  def rand_each
    while(self.size > 0)
      r = rand(self.size)
      e = self[r]
      self.delete_at(r)
      yield(e)
    end
  end
end

while line = gets
  line.chomp!
  line.split.each{|word|
    word =~ /([\"\.\,\?\!\;]?)([a-zA-Z]+)([\"\.\,\?\!\;]?)/
    print $1, $2.shake, $3, " "
  }
  print "\n"
end

カンマやピリオドなどの記号を無視すればもっとシンプルだけど、いちおう代表的なところは正規表現を書いた。で、「man man」としたときの最初の段落をこのスクリプトにかけると以下のようになる。

man is the system's manual pager. Each page argument given to man is normally the name of a program, utility or function. The manual page associated with each of these arguments is then found and displayed. A section, if provided, will direct man to look only in that section of the manual. The default action is to search in all of the available sections, following a pre-defined order and to show only the first page found, even if page exists in several sections.

man is the sytsem manaul pegar. Each page amuegrnt gevin to man is nallmory the name of a poargrm, uiitlty or ftuoicnn. The muanal pgae aeaocitssd wtih each of tehse aetgrmuns is then fnuod and dieaslpyd. A scetion, if povdierd, will decrit man to look only in taht seocitn of the mnuaal. The duealft actoin is to sercah in all of the avlalaibe sonitecs, folniowlg a pre oderr and to show olny the fisrt page fuond, eevn if pgae eistxs in sveaerl snctioes.

意図したとおりの結果にはなっているけど、どうも変換後の文章は読みづらい。だいたいオッケーだけど、「sections」が「sonitecs」とか「snctioes」となると、かなりきつい。まったくランダムにシャッフルしてしまってはダメなようだ。母音・子音の登場順や、出現頻度の高い文字の組み合わせあたりを保存するようなことをすればだいぶ違うのではないかと思う。あるいは、非常に簡単な文意で登場する語彙も想定の範囲内のものなら読みやすいはず。manって最近あまり使わないから「section」という単語がきついのかも。いずれにしても、話はそう単純でもなくて、上の画像には騙されたってわけだ。

最初、文字列の先頭の文字を取り出すのに「str.split(//).shift」、お尻の文字を取り出すのに「str.split(//).pop」としていたけど、素直に添字アクセスのほうがやっぱりいいのじゃないかと思って、書き換えた。ちょっと計測したら、添字アクセスのほうが3倍ぐらい処理が速い。だけど、100万回ぐらいループを回してやっと計測できるような違いなんだから、貧乏くさいことを言っても仕方ない気がする。しかも、意味的には「pop」のほうがわかりやすい。

end_letter = split(//).pop
end_letter = self[self.size-1].chr