Example of parameter analysis

Effect of pre-processing parameters on encoding patterns

%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

from matplotlib.ticker import MaxNLocator

Pre-processing our data is a necessary step to determine how one maps language tokens to quantum states for encoding. We can choose a variety of methods to associate tokens with one another with variable degrees of control. While it is possible to fully saturate and encode data using the entirely available pattern-set, fine tuning of the problem space during pre-processing can be beneficial.

Here we explore the differences observed in available sentences, and subsequently unique encoding patterns by controlling the following set of variables:

Number of basis elements for noun and verb sets (NUM_BASIS_NOUN and NUM_BASIS_VERB)
Basis-to-composite token cutoff distance for association (BASIS_NOUN_DIST_CUTOFF and BASIS_VERB_DIST_CUTOFF)
Composite verb to composite noun cutoff distance for association (VERB_NOUN_DIST_CUTOFF)

As measurable outputs we can observe the number of sentences of composite tokens the preprocessing steps create, as well as the number of unique patterns, based upon tensoring the composite tokens basis element sets.

While encoding a large number of patterns may be possible, the upper limit of which is determined by the number of encoding patterns available for the chosen schema (number of noun patterns^2 * number of verb patterns), often it is more intructive to examine a sparsely occupied set of state.

Assuming the pre-processing stage has successfully chosen the token order mapping for the given sets of basis elements, few encoded tokens relative to the total available token-space will allow the same observation of the method for comparison. As such, we opt to choose parameters that follow this approach. It should also more easily allow comparisons with data that has not been encoded, where meaning is still preserved due to the methods proposed to set up the procedure.

"""
Here we give a file, with each line being a dictionary of the parameters

NUM_BASIS_NOUN, NUM_BASIS_VERB, BASIS_NOUN_DIST_CUTOFF, BASIS_VERB_DIST_CUTOFF, VERB_NOUN_DIST_CUTOFF

where each parameter above is a key with associated value of the variable run with.

Additionally, the result values 
sentences, patterns

hold the number of observed values for both quantities from the run.
"""
dict_file = "path_to_file.out"

We begin by loading the data file line by line and encoding it into a pandas dataframe object. We do this to easily perform filtering and selection of the data .

df = pd.DataFrame({})
with open(dict_file) as file:
    for l in file:
        dict_line = eval(l)
        dict_line = {k:[v] for k,v in dict_line.items()}
        df_tmp = pd.DataFrame(dict_line)
        df = df.append(df_tmp)

df = df.reset_index(drop=True)
df

	sentences	patterns	NUM_BASIS_NOUN	NUM_BASIS_VERB	BASIS_NOUN_DIST_CUTOFF	BASIS_VERB_DIST_CUTOFF	VERB_NOUN_DIST_CUTOFF
0	3	3	2	2	1	1	1
1	14	7	2	2	1	1	2
2	54	8	2	2	1	1	3
3	76	8	2	2	1	1	4
4	111	8	2	2	1	1	5
...	...	...	...	...	...	...	...
3120	116	384	10	10	5	5	1
3121	1872	967	10	10	5	5	2
3122	7534	1000	10	10	5	5	3
3123	14149	1000	10	10	5	5	4
3124	23424	1000	10	10	5	5	5

3125 rows Ã— 7 columns

With the data loaded, we may now slice and view it in any way to observe the relationships between the parameters and recorded sentences and patterns. Given we have a 5D parameter space, we can choose to observe flattened data over the non-observed, as below:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "sentences"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])

<IPython.core.display.Javascript object>

Text(0.5, 0, 'sentences')

or be more clever, and encode additional information into the size and colour of the scatter plot, as so:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "sentences"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    s=df["BASIS_VERB_DIST_CUTOFF"]*10,
    c=df[plot_order[3]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)

<IPython.core.display.Javascript object>

As above sentences, we can mirror the procedure for unique patterns as:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "VERB_NOUN_DIST_CUTOFF", "patterns"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    s=df[plot_order[3]]*10,
    c=df[plot_order[4]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)

<IPython.core.display.Javascript object>

As a quick and naiive analysis, we can observe that the spread of patterns tends to become much larger with increases in the number of available basis tokens in each space. This indicates that the expressiveness of the problem increases with increased bases, which makes sense. However, with this increase, we also require additional work in encoding more patterns, which can quickly become unfeasible when considering the dependence on the number of gate calls in both NISQ devices and simulators in general.

Therefore, having many basis elements, but maintaing a low list of unique patterns will allow a good compromise of to demonstrate the proposed method’s ability to encode, compare and demonstrate similarity of sentences.

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "patterns"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])

<IPython.core.display.Javascript object>

Text(0.5, 0, 'patterns')

To aid the eye, draw connecting lines between points can allow for assistance in determining where points are based relative to the grid, and one another.

max_patterns = 30
df_sub = df.where(df["patterns"] < max_patterns).dropna()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
    df_sub[plot_order[0]],
    df_sub[plot_order[1]],
    df_sub[plot_order[2]],
    c = df_sub[plot_order[3]],    
    s = df_sub[plot_order[4]]*10,
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
ax.plot(
    df_sub[plot_order[0]], 
    df_sub[plot_order[1]], 
    df_sub[plot_order[2]],
    'k--',
    alpha=0.8,
    linewidth=0.5
)

<IPython.core.display.Javascript object>

[<mpl_toolkits.mplot3d.art3d.Line3D at 0x1128834150>]

As discussed earlier, choosing a sparse set of data patterns to encode can be best, and so we can determine the optimal set of parameters for this within a range as follows:

max_patterns = 30
min_patterns = 25

#Filter within range, and ensure patterns is less than sentences
df_sub = df.where(df["patterns"] < max_patterns)
df_sub = df_sub.where(df["patterns"] > min_patterns)
df_sub = df_sub.where(df["patterns"] < df["sentences"])
df_sub = df_sub.dropna()

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

#Use only integer values
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.zaxis.set_major_locator(MaxNLocator(integer=True))


ax.set_proj_type('ortho')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
    df_sub[plot_order[0]],
    df_sub[plot_order[1]],
    df_sub[plot_order[2]],
    c = df_sub[plot_order[3]],    
    s = df_sub[plot_order[4]]*10,
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])

cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
cb.set_ticks(range(min_patterns, max_patterns+1))

ax.plot(
    df_sub[plot_order[0]], 
    df_sub[plot_order[1]], 
    df_sub[plot_order[2]],
    'k--',
    alpha=0.8,
    linewidth=0.5
)

<IPython.core.display.Javascript object>

[<mpl_toolkits.mplot3d.art3d.Line3D at 0x11288c3790>]

We can now get the list of parameters that match the contraints given above as:

df_sub

	sentences	patterns	NUM_BASIS_NOUN	NUM_BASIS_VERB	BASIS_NOUN_DIST_CUTOFF	BASIS_VERB_DIST_CUTOFF	VERB_NOUN_DIST_CUTOFF
377	133.0	28.0	2.0	8.0	1.0	1.0	3.0
381	60.0	26.0	2.0	8.0	1.0	2.0	2.0
386	66.0	29.0	2.0	8.0	1.0	3.0	2.0
391	69.0	29.0	2.0	8.0	1.0	4.0	2.0
401	222.0	26.0	2.0	8.0	2.0	1.0	2.0
406	254.0	29.0	2.0	8.0	2.0	2.0	2.0
490	86.0	26.0	2.0	8.0	5.0	4.0	1.0
495	86.0	26.0	2.0	8.0	5.0	5.0	1.0
501	56.0	29.0	2.0	10.0	1.0	1.0	2.0
610	86.0	26.0	2.0	10.0	5.0	3.0	1.0
615	86.0	28.0	2.0	10.0	5.0	4.0	1.0
620	86.0	28.0	2.0	10.0	5.0	5.0	1.0
629	146.0	28.0	4.0	2.0	1.0	1.0	5.0
632	151.0	29.0	4.0	2.0	1.0	2.0	3.0
636	71.0	26.0	4.0	2.0	1.0	3.0	2.0
637	168.0	29.0	4.0	2.0	1.0	3.0	3.0
641	84.0	28.0	4.0	2.0	1.0	4.0	2.0
642	193.0	29.0	4.0	2.0	1.0	4.0	3.0
646	90.0	28.0	4.0	2.0	1.0	5.0	2.0
652	480.0	28.0	4.0	2.0	2.0	1.0	3.0
690	65.0	27.0	4.0	2.0	3.0	4.0	1.0
695	65.0	27.0	4.0	2.0	3.0	5.0	1.0
701	560.0	28.0	4.0	2.0	4.0	1.0	2.0
715	86.0	29.0	4.0	2.0	4.0	4.0	1.0
720	86.0	29.0	4.0	2.0	4.0	5.0	1.0
740	99.0	29.0	4.0	2.0	5.0	4.0	1.0
745	99.0	29.0	4.0	2.0	5.0	5.0	1.0
751	41.0	26.0	4.0	4.0	1.0	1.0	2.0
805	64.0	27.0	4.0	4.0	3.0	2.0	1.0
830	82.0	29.0	4.0	4.0	4.0	2.0	1.0
925	52.0	28.0	4.0	6.0	3.0	1.0	1.0
1050	52.0	28.0	4.0	8.0	3.0	1.0	1.0
1310	60.0	26.0	6.0	2.0	3.0	3.0	1.0
1335	81.0	28.0	6.0	2.0	4.0	3.0	1.0
1355	88.0	26.0	6.0	2.0	5.0	2.0	1.0
1450	71.0	29.0	6.0	4.0	4.0	1.0	1.0
1925	44.0	28.0	8.0	2.0	3.0	1.0	1.0
1950	56.0	29.0	8.0	2.0	4.0	1.0	1.0
2155	29.0	28.0	8.0	6.0	2.0	2.0	1.0
2540	32.0	29.0	10.0	2.0	2.0	4.0	1.0
2545	32.0	29.0	10.0	2.0	2.0	5.0	1.0
2550	44.0	28.0	10.0	2.0	3.0	1.0	1.0
2655	31.0	27.0	10.0	4.0	2.0	2.0	1.0
3025	28.0	27.0	10.0	10.0	2.0	1.0	1.0

If we wish to maintain a list of parameters that maxmises the number of nouns available, we can sub-select the range where NUM_BASIS_NOUN=10 as

df_sub.where(df_sub["NUM_BASIS_NOUN"] == 10).dropna()

	sentences	patterns	NUM_BASIS_NOUN	NUM_BASIS_VERB	BASIS_NOUN_DIST_CUTOFF	BASIS_VERB_DIST_CUTOFF	VERB_NOUN_DIST_CUTOFF
2540	32.0	29.0	10.0	2.0	2.0	4.0	1.0
2545	32.0	29.0	10.0	2.0	2.0	5.0	1.0
2550	44.0	28.0	10.0	2.0	3.0	1.0	1.0
2655	31.0	27.0	10.0	4.0	2.0	2.0	1.0
3025	28.0	27.0	10.0	10.0	2.0	1.0	1.0

Selecting the id=2655 row, we can encode 27 unique patterns, using the following parameters:

df.iloc[2655]

sentences                 31
patterns                  27
NUM_BASIS_NOUN            10
NUM_BASIS_VERB             4
BASIS_NOUN_DIST_CUTOFF     2
BASIS_VERB_DIST_CUTOFF     2
VERB_NOUN_DIST_CUTOFF      1
Name: 2655, dtype: int64

Aside: submission script for above job output.

scr="""
#!/bin/bash
#SBATCH -J qubits_full
#SBATCH -N 1

#SBATCH -p GpuQ
#SBATCH -t 18:00:00
#SBATCH -A "ichec001"
#SBATCH --mail-user=lee.oriordan@ichec.ie
#SBATCH --mail-type=ALL

#no extra settings
cd /ichec/work/ichec001/loriordan_scratch/intel-qnlp-iqs2/build
module load intel/2019u5 gcc cmake3
source ../load_env.sh

for nbn in $(seq 2 2 10);do
for nbv in $(seq 2 2 10);do
for bndc in $(seq 1 1 5);do
for bvdc in $(seq 1 1 5);do
for vndc in $(seq 1 1 5);do

echo "NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc}"
NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc} srun --ntasks 32 -c 1 --cpu-bind=cores -m plane=32 python ../modules/py/scripts/QNLP_EndToEnd_MPI.py
echo ""

done
done
done
done
done
"""

note="""
The ../modules/py/scripts/QNLP_EndToEnd_MPI.py file was modified to exit upon calculating the 
vec_to_encode (i.e. the patterns) list, and prints the following:

    d ={"sentences" : len(sentences),
        "patterns" : len(vec_to_encode),
        "NUM_BASIS_NOUN" : NUM_BASIS_NOUN,
        "NUM_BASIS_VERB" : NUM_BASIS_VERB,
        "BASIS_NOUN_DIST_CUTOFF" : BASIS_NOUN_DIST_CUTOFF,
        "BASIS_VERB_DIST_CUTOFF" : BASIS_VERB_DIST_CUTOFF,
        "VERB_NOUN_DIST_CUTOFF" : VERB_NOUN_DIST_CUTOFF
    }
    print(d)

"""

Last updated on 1 Mar 2020
Published on 1 Mar 2020
Edit on GitHub