[openpgp] Crowdsourcing Base214

On Wed, Apr 29, 2015 at 6:08 AM, Neal H. Walfield 
<neal(_at_)walfield(_dot_)org> wrote:

I wonder if less if not more.

If you look at the diceware list, it has "easy to remember words" like
"aaaa", "abner" and "adair".  And, this list is just 7776 words long.
These are not only hard for a native speaker to memorize, but also for
those who speak english as a second language.

If we are going to make a new word list, I would recommend using
something based on the voice of america simply word list.  This
includes 1500 simple words, which all english speakers with basic
proficiency are familiar with.

Alternatively, there is the PGP Biometric word list [1], which aren't
as simple, but are phonetically distinct.

[1] https://en.wikipedia.org/wiki/Biometric_word_list



The larger the alphabet, the shorter the fingerprint. Since there is
no need to keep the images/words on the device, the size of the
dictionary is not that critical.

Fingerprints with the PGP biometric list are rather too long. Looking
at the options, it seems like somewhere between 13 and 16 (inclusive)
is the sweet spot. Above 64K entries, curating the list is just too
hard.

Back in 1995, memory constraints were very different.


I would very much like to keep the size of the fingerprint within the
7+/-2 working memory limit and provide at least 100 effective bits.
That requires each glyph encode at least 14 bits.

Presenting images in two sets of four seems to work quite well on an
Apple Watch. And a smartphone seems to be able to present eight at
once without too much hassle.

The big advantage to 14 bits is that it then allows a direct mapping
to the CJK unified characters in Unicode.



This looks to me to be an excellent opportunity to engage the wider
community and to crowdsource parts of the process. There are hundreds
of people willing to help. Give each person a part of the image space
to curate and we can have the process done pretty quick.

So lets say someone has 'road motor transport' for 256 entries. She
then breaks that down into 'cars', 'trucks', 'buses', 'motorcycles'
and then within each category finds 64 distinctly different examples.
Someone else does the same for 'unpowered transport', 'marine
transport', etc.

A wiki is probably sufficient for the necessary collaboration.


The purpose of this isn't just to get the best result. Engage the
community and they become advocates and early adopters. And we need
advocates who are not from the crypto community.


For the word lists, I am thinking that the best approach is to start
off with a fairly large dictionary and filter it by putting it through
google translate and seeing what distinct words survive translation
from English to French and back.

Then take the dictionary and machine translate it into 16 odd
different languages as a starting point and compute Merkle trees over
each individual corpus.


Probably the thing to do is begin with a Base 2^14 scheme which could
be expanded if desired to 2^16.

_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp