Example of parameter analysis
Effect of pre-processing parameters on encoding patterns
%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from matplotlib.ticker import MaxNLocator
Pre-processing our data is a necessary step to determine how one maps language tokens to quantum states for encoding. We can choose a variety of methods to associate tokens with one another with variable degrees of control. While it is possible to fully saturate and encode data using the entirely available pattern-set, fine tuning of the problem space during pre-processing can be beneficial.
Here we explore the differences observed in available sentences, and subsequently unique encoding patterns by controlling the following set of variables:
- Number of basis elements for noun and verb sets (
NUM_BASIS_NOUN
andNUM_BASIS_VERB
) - Basis-to-composite token cutoff distance for association (
BASIS_NOUN_DIST_CUTOFF
andBASIS_VERB_DIST_CUTOFF
) - Composite verb to composite noun cutoff distance for association (
VERB_NOUN_DIST_CUTOFF
)
As measurable outputs we can observe the number of sentences
of composite tokens the preprocessing steps create, as well as the number of unique patterns
, based upon tensoring the composite tokens basis element sets.
While encoding a large number of patterns may be possible, the upper limit of which is determined by the number of encoding patterns available for the chosen schema (number of noun patterns^2 * number of verb patterns), often it is more intructive to examine a sparsely occupied set of state.
Assuming the pre-processing stage has successfully chosen the token order mapping for the given sets of basis elements, few encoded tokens relative to the total available token-space will allow the same observation of the method for comparison. As such, we opt to choose parameters that follow this approach. It should also more easily allow comparisons with data that has not been encoded, where meaning is still preserved due to the methods proposed to set up the procedure.
"""
Here we give a file, with each line being a dictionary of the parameters
NUM_BASIS_NOUN, NUM_BASIS_VERB, BASIS_NOUN_DIST_CUTOFF, BASIS_VERB_DIST_CUTOFF, VERB_NOUN_DIST_CUTOFF
where each parameter above is a key with associated value of the variable run with.
Additionally, the result values
sentences, patterns
hold the number of observed values for both quantities from the run.
"""
dict_file = "path_to_file.out"
We begin by loading the data file line by line and encoding it into a pandas dataframe object. We do this to easily perform filtering and selection of the data .
df = pd.DataFrame({})
with open(dict_file) as file:
for l in file:
dict_line = eval(l)
dict_line = {k:[v] for k,v in dict_line.items()}
df_tmp = pd.DataFrame(dict_line)
df = df.append(df_tmp)
df = df.reset_index(drop=True)
df
sentences | patterns | NUM_BASIS_NOUN | NUM_BASIS_VERB | BASIS_NOUN_DIST_CUTOFF | BASIS_VERB_DIST_CUTOFF | VERB_NOUN_DIST_CUTOFF | |
---|---|---|---|---|---|---|---|
0 | 3 | 3 | 2 | 2 | 1 | 1 | 1 |
1 | 14 | 7 | 2 | 2 | 1 | 1 | 2 |
2 | 54 | 8 | 2 | 2 | 1 | 1 | 3 |
3 | 76 | 8 | 2 | 2 | 1 | 1 | 4 |
4 | 111 | 8 | 2 | 2 | 1 | 1 | 5 |
... | ... | ... | ... | ... | ... | ... | ... |
3120 | 116 | 384 | 10 | 10 | 5 | 5 | 1 |
3121 | 1872 | 967 | 10 | 10 | 5 | 5 | 2 |
3122 | 7534 | 1000 | 10 | 10 | 5 | 5 | 3 |
3123 | 14149 | 1000 | 10 | 10 | 5 | 5 | 4 |
3124 | 23424 | 1000 | 10 | 10 | 5 | 5 | 5 |
3125 rows × 7 columns
With the data loaded, we may now slice and view it in any way to observe the relationships between the parameters and recorded sentences and patterns. Given we have a 5D parameter space, we can choose to observe flattened data over the non-observed, as below:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "sentences"]
p3d = ax.scatter3D(
df[plot_order[0]],
df[plot_order[1]],
df[plot_order[2]],
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
<IPython.core.display.Javascript object>
Text(0.5, 0, 'sentences')
or be more clever, and encode additional information into the size and colour of the scatter plot, as so:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "sentences"]
p3d = ax.scatter3D(
df[plot_order[0]],
df[plot_order[1]],
df[plot_order[2]],
s=df["BASIS_VERB_DIST_CUTOFF"]*10,
c=df[plot_order[3]],
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)
<IPython.core.display.Javascript object>
As above sentences, we can mirror the procedure for unique patterns as:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "VERB_NOUN_DIST_CUTOFF", "patterns"]
p3d = ax.scatter3D(
df[plot_order[0]],
df[plot_order[1]],
df[plot_order[2]],
s=df[plot_order[3]]*10,
c=df[plot_order[4]],
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)
<IPython.core.display.Javascript object>
As a quick and naiive analysis, we can observe that the spread of patterns tends to become much larger with increases in the number of available basis tokens in each space. This indicates that the expressiveness of the problem increases with increased bases, which makes sense. However, with this increase, we also require additional work in encoding more patterns, which can quickly become unfeasible when considering the dependence on the number of gate calls in both NISQ devices and simulators in general.
Therefore, having many basis elements, but maintaing a low list of unique patterns will allow a good compromise of to demonstrate the proposed method’s ability to encode, compare and demonstrate similarity of sentences.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "patterns"]
p3d = ax.scatter3D(
df[plot_order[0]],
df[plot_order[1]],
df[plot_order[2]],
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
<IPython.core.display.Javascript object>
Text(0.5, 0, 'patterns')
To aid the eye, draw connecting lines between points can allow for assistance in determining where points are based relative to the grid, and one another.
max_patterns = 30
df_sub = df.where(df["patterns"] < max_patterns).dropna()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
df_sub[plot_order[0]],
df_sub[plot_order[1]],
df_sub[plot_order[2]],
c = df_sub[plot_order[3]],
s = df_sub[plot_order[4]]*10,
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
ax.plot(
df_sub[plot_order[0]],
df_sub[plot_order[1]],
df_sub[plot_order[2]],
'k--',
alpha=0.8,
linewidth=0.5
)
<IPython.core.display.Javascript object>
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x1128834150>]
As discussed earlier, choosing a sparse set of data patterns to encode can be best, and so we can determine the optimal set of parameters for this within a range as follows:
max_patterns = 30
min_patterns = 25
#Filter within range, and ensure patterns is less than sentences
df_sub = df.where(df["patterns"] < max_patterns)
df_sub = df_sub.where(df["patterns"] > min_patterns)
df_sub = df_sub.where(df["patterns"] < df["sentences"])
df_sub = df_sub.dropna()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
#Use only integer values
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.zaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_proj_type('ortho')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
df_sub[plot_order[0]],
df_sub[plot_order[1]],
df_sub[plot_order[2]],
c = df_sub[plot_order[3]],
s = df_sub[plot_order[4]]*10,
cmap="RdBu_r",
edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
cb.set_ticks(range(min_patterns, max_patterns+1))
ax.plot(
df_sub[plot_order[0]],
df_sub[plot_order[1]],
df_sub[plot_order[2]],
'k--',
alpha=0.8,
linewidth=0.5
)
<IPython.core.display.Javascript object>
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x11288c3790>]
We can now get the list of parameters that match the contraints given above as:
df_sub
sentences | patterns | NUM_BASIS_NOUN | NUM_BASIS_VERB | BASIS_NOUN_DIST_CUTOFF | BASIS_VERB_DIST_CUTOFF | VERB_NOUN_DIST_CUTOFF | |
---|---|---|---|---|---|---|---|
377 | 133.0 | 28.0 | 2.0 | 8.0 | 1.0 | 1.0 | 3.0 |
381 | 60.0 | 26.0 | 2.0 | 8.0 | 1.0 | 2.0 | 2.0 |
386 | 66.0 | 29.0 | 2.0 | 8.0 | 1.0 | 3.0 | 2.0 |
391 | 69.0 | 29.0 | 2.0 | 8.0 | 1.0 | 4.0 | 2.0 |
401 | 222.0 | 26.0 | 2.0 | 8.0 | 2.0 | 1.0 | 2.0 |
406 | 254.0 | 29.0 | 2.0 | 8.0 | 2.0 | 2.0 | 2.0 |
490 | 86.0 | 26.0 | 2.0 | 8.0 | 5.0 | 4.0 | 1.0 |
495 | 86.0 | 26.0 | 2.0 | 8.0 | 5.0 | 5.0 | 1.0 |
501 | 56.0 | 29.0 | 2.0 | 10.0 | 1.0 | 1.0 | 2.0 |
610 | 86.0 | 26.0 | 2.0 | 10.0 | 5.0 | 3.0 | 1.0 |
615 | 86.0 | 28.0 | 2.0 | 10.0 | 5.0 | 4.0 | 1.0 |
620 | 86.0 | 28.0 | 2.0 | 10.0 | 5.0 | 5.0 | 1.0 |
629 | 146.0 | 28.0 | 4.0 | 2.0 | 1.0 | 1.0 | 5.0 |
632 | 151.0 | 29.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 |
636 | 71.0 | 26.0 | 4.0 | 2.0 | 1.0 | 3.0 | 2.0 |
637 | 168.0 | 29.0 | 4.0 | 2.0 | 1.0 | 3.0 | 3.0 |
641 | 84.0 | 28.0 | 4.0 | 2.0 | 1.0 | 4.0 | 2.0 |
642 | 193.0 | 29.0 | 4.0 | 2.0 | 1.0 | 4.0 | 3.0 |
646 | 90.0 | 28.0 | 4.0 | 2.0 | 1.0 | 5.0 | 2.0 |
652 | 480.0 | 28.0 | 4.0 | 2.0 | 2.0 | 1.0 | 3.0 |
690 | 65.0 | 27.0 | 4.0 | 2.0 | 3.0 | 4.0 | 1.0 |
695 | 65.0 | 27.0 | 4.0 | 2.0 | 3.0 | 5.0 | 1.0 |
701 | 560.0 | 28.0 | 4.0 | 2.0 | 4.0 | 1.0 | 2.0 |
715 | 86.0 | 29.0 | 4.0 | 2.0 | 4.0 | 4.0 | 1.0 |
720 | 86.0 | 29.0 | 4.0 | 2.0 | 4.0 | 5.0 | 1.0 |
740 | 99.0 | 29.0 | 4.0 | 2.0 | 5.0 | 4.0 | 1.0 |
745 | 99.0 | 29.0 | 4.0 | 2.0 | 5.0 | 5.0 | 1.0 |
751 | 41.0 | 26.0 | 4.0 | 4.0 | 1.0 | 1.0 | 2.0 |
805 | 64.0 | 27.0 | 4.0 | 4.0 | 3.0 | 2.0 | 1.0 |
830 | 82.0 | 29.0 | 4.0 | 4.0 | 4.0 | 2.0 | 1.0 |
925 | 52.0 | 28.0 | 4.0 | 6.0 | 3.0 | 1.0 | 1.0 |
1050 | 52.0 | 28.0 | 4.0 | 8.0 | 3.0 | 1.0 | 1.0 |
1310 | 60.0 | 26.0 | 6.0 | 2.0 | 3.0 | 3.0 | 1.0 |
1335 | 81.0 | 28.0 | 6.0 | 2.0 | 4.0 | 3.0 | 1.0 |
1355 | 88.0 | 26.0 | 6.0 | 2.0 | 5.0 | 2.0 | 1.0 |
1450 | 71.0 | 29.0 | 6.0 | 4.0 | 4.0 | 1.0 | 1.0 |
1925 | 44.0 | 28.0 | 8.0 | 2.0 | 3.0 | 1.0 | 1.0 |
1950 | 56.0 | 29.0 | 8.0 | 2.0 | 4.0 | 1.0 | 1.0 |
2155 | 29.0 | 28.0 | 8.0 | 6.0 | 2.0 | 2.0 | 1.0 |
2540 | 32.0 | 29.0 | 10.0 | 2.0 | 2.0 | 4.0 | 1.0 |
2545 | 32.0 | 29.0 | 10.0 | 2.0 | 2.0 | 5.0 | 1.0 |
2550 | 44.0 | 28.0 | 10.0 | 2.0 | 3.0 | 1.0 | 1.0 |
2655 | 31.0 | 27.0 | 10.0 | 4.0 | 2.0 | 2.0 | 1.0 |
3025 | 28.0 | 27.0 | 10.0 | 10.0 | 2.0 | 1.0 | 1.0 |
If we wish to maintain a list of parameters that maxmises the number of nouns available, we can sub-select the range where NUM_BASIS_NOUN=10
as
df_sub.where(df_sub["NUM_BASIS_NOUN"] == 10).dropna()
sentences | patterns | NUM_BASIS_NOUN | NUM_BASIS_VERB | BASIS_NOUN_DIST_CUTOFF | BASIS_VERB_DIST_CUTOFF | VERB_NOUN_DIST_CUTOFF | |
---|---|---|---|---|---|---|---|
2540 | 32.0 | 29.0 | 10.0 | 2.0 | 2.0 | 4.0 | 1.0 |
2545 | 32.0 | 29.0 | 10.0 | 2.0 | 2.0 | 5.0 | 1.0 |
2550 | 44.0 | 28.0 | 10.0 | 2.0 | 3.0 | 1.0 | 1.0 |
2655 | 31.0 | 27.0 | 10.0 | 4.0 | 2.0 | 2.0 | 1.0 |
3025 | 28.0 | 27.0 | 10.0 | 10.0 | 2.0 | 1.0 | 1.0 |
Selecting the id=2655 row, we can encode 27 unique patterns, using the following parameters:
df.iloc[2655]
sentences 31
patterns 27
NUM_BASIS_NOUN 10
NUM_BASIS_VERB 4
BASIS_NOUN_DIST_CUTOFF 2
BASIS_VERB_DIST_CUTOFF 2
VERB_NOUN_DIST_CUTOFF 1
Name: 2655, dtype: int64
Aside: submission script for above job output.
scr="""
#!/bin/bash
#SBATCH -J qubits_full
#SBATCH -N 1
#SBATCH -p GpuQ
#SBATCH -t 18:00:00
#SBATCH -A "ichec001"
#SBATCH --mail-user=lee.oriordan@ichec.ie
#SBATCH --mail-type=ALL
#no extra settings
cd /ichec/work/ichec001/loriordan_scratch/intel-qnlp-iqs2/build
module load intel/2019u5 gcc cmake3
source ../load_env.sh
for nbn in $(seq 2 2 10);do
for nbv in $(seq 2 2 10);do
for bndc in $(seq 1 1 5);do
for bvdc in $(seq 1 1 5);do
for vndc in $(seq 1 1 5);do
echo "NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc}"
NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc} srun --ntasks 32 -c 1 --cpu-bind=cores -m plane=32 python ../modules/py/scripts/QNLP_EndToEnd_MPI.py
echo ""
done
done
done
done
done
"""
note="""
The ../modules/py/scripts/QNLP_EndToEnd_MPI.py file was modified to exit upon calculating the
vec_to_encode (i.e. the patterns) list, and prints the following:
d ={"sentences" : len(sentences),
"patterns" : len(vec_to_encode),
"NUM_BASIS_NOUN" : NUM_BASIS_NOUN,
"NUM_BASIS_VERB" : NUM_BASIS_VERB,
"BASIS_NOUN_DIST_CUTOFF" : BASIS_NOUN_DIST_CUTOFF,
"BASIS_VERB_DIST_CUTOFF" : BASIS_VERB_DIST_CUTOFF,
"VERB_NOUN_DIST_CUTOFF" : VERB_NOUN_DIST_CUTOFF
}
print(d)
"""