QNLP

Quantum Natural Language Processing

Example of parameter analysis

Effect of pre-processing parameters on encoding patterns

%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd

from matplotlib.ticker import MaxNLocator

Pre-processing our data is a necessary step to determine how one maps language tokens to quantum states for encoding. We can choose a variety of methods to associate tokens with one another with variable degrees of control. While it is possible to fully saturate and encode data using the entirely available pattern-set, fine tuning of the problem space during pre-processing can be beneficial.

Here we explore the differences observed in available sentences, and subsequently unique encoding patterns by controlling the following set of variables:

  • Number of basis elements for noun and verb sets (NUM_BASIS_NOUN and NUM_BASIS_VERB)
  • Basis-to-composite token cutoff distance for association (BASIS_NOUN_DIST_CUTOFF and BASIS_VERB_DIST_CUTOFF)
  • Composite verb to composite noun cutoff distance for association (VERB_NOUN_DIST_CUTOFF)

As measurable outputs we can observe the number of sentences of composite tokens the preprocessing steps create, as well as the number of unique patterns, based upon tensoring the composite tokens basis element sets.

While encoding a large number of patterns may be possible, the upper limit of which is determined by the number of encoding patterns available for the chosen schema (number of noun patterns^2 * number of verb patterns), often it is more intructive to examine a sparsely occupied set of state.

Assuming the pre-processing stage has successfully chosen the token order mapping for the given sets of basis elements, few encoded tokens relative to the total available token-space will allow the same observation of the method for comparison. As such, we opt to choose parameters that follow this approach. It should also more easily allow comparisons with data that has not been encoded, where meaning is still preserved due to the methods proposed to set up the procedure.

"""
Here we give a file, with each line being a dictionary of the parameters

NUM_BASIS_NOUN, NUM_BASIS_VERB, BASIS_NOUN_DIST_CUTOFF, BASIS_VERB_DIST_CUTOFF, VERB_NOUN_DIST_CUTOFF

where each parameter above is a key with associated value of the variable run with.

Additionally, the result values 
sentences, patterns

hold the number of observed values for both quantities from the run.
"""
dict_file = "path_to_file.out"

We begin by loading the data file line by line and encoding it into a pandas dataframe object. We do this to easily perform filtering and selection of the data .

df = pd.DataFrame({})
with open(dict_file) as file:
    for l in file:
        dict_line = eval(l)
        dict_line = {k:[v] for k,v in dict_line.items()}
        df_tmp = pd.DataFrame(dict_line)
        df = df.append(df_tmp)
df = df.reset_index(drop=True)
df

sentences patterns NUM_BASIS_NOUN NUM_BASIS_VERB BASIS_NOUN_DIST_CUTOFF BASIS_VERB_DIST_CUTOFF VERB_NOUN_DIST_CUTOFF
0 3 3 2 2 1 1 1
1 14 7 2 2 1 1 2
2 54 8 2 2 1 1 3
3 76 8 2 2 1 1 4
4 111 8 2 2 1 1 5
... ... ... ... ... ... ... ...
3120 116 384 10 10 5 5 1
3121 1872 967 10 10 5 5 2
3122 7534 1000 10 10 5 5 3
3123 14149 1000 10 10 5 5 4
3124 23424 1000 10 10 5 5 5

3125 rows × 7 columns

With the data loaded, we may now slice and view it in any way to observe the relationships between the parameters and recorded sentences and patterns. Given we have a 5D parameter space, we can choose to observe flattened data over the non-observed, as below:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "sentences"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
<IPython.core.display.Javascript object>
Text(0.5, 0, 'sentences')

or be more clever, and encode additional information into the size and colour of the scatter plot, as so:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "sentences"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    s=df["BASIS_VERB_DIST_CUTOFF"]*10,
    c=df[plot_order[3]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)
<IPython.core.display.Javascript object>

As above sentences, we can mirror the procedure for unique patterns as:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "VERB_NOUN_DIST_CUTOFF", "patterns"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    s=df[plot_order[3]]*10,
    c=df[plot_order[4]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[-1], rotation=90)
<IPython.core.display.Javascript object>

As a quick and naiive analysis, we can observe that the spread of patterns tends to become much larger with increases in the number of available basis tokens in each space. This indicates that the expressiveness of the problem increases with increased bases, which makes sense. However, with this increase, we also require additional work in encoding more patterns, which can quickly become unfeasible when considering the dependence on the number of gate calls in both NISQ devices and simulators in general.

Therefore, having many basis elements, but maintaing a low list of unique patterns will allow a good compromise of to demonstrate the proposed method’s ability to encode, compare and demonstrate similarity of sentences.

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "patterns"]
p3d = ax.scatter3D(
    df[plot_order[0]],
    df[plot_order[1]],
    df[plot_order[2]],
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])

<IPython.core.display.Javascript object>
Text(0.5, 0, 'patterns')

To aid the eye, draw connecting lines between points can allow for assistance in determining where points are based relative to the grid, and one another.

max_patterns = 30
df_sub = df.where(df["patterns"] < max_patterns).dropna()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
    df_sub[plot_order[0]],
    df_sub[plot_order[1]],
    df_sub[plot_order[2]],
    c = df_sub[plot_order[3]],    
    s = df_sub[plot_order[4]]*10,
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])
cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
ax.plot(
    df_sub[plot_order[0]], 
    df_sub[plot_order[1]], 
    df_sub[plot_order[2]],
    'k--',
    alpha=0.8,
    linewidth=0.5
)
<IPython.core.display.Javascript object>
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x1128834150>]

As discussed earlier, choosing a sparse set of data patterns to encode can be best, and so we can determine the optimal set of parameters for this within a range as follows:

max_patterns = 30
min_patterns = 25

#Filter within range, and ensure patterns is less than sentences
df_sub = df.where(df["patterns"] < max_patterns)
df_sub = df_sub.where(df["patterns"] > min_patterns)
df_sub = df_sub.where(df["patterns"] < df["sentences"])
df_sub = df_sub.dropna()

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

#Use only integer values
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.zaxis.set_major_locator(MaxNLocator(integer=True))


ax.set_proj_type('ortho')
plot_order = ["NUM_BASIS_NOUN", "NUM_BASIS_VERB", "BASIS_NOUN_DIST_CUTOFF", "patterns", "BASIS_VERB_DIST_CUTOFF"]
p3d = ax.scatter3D(
    df_sub[plot_order[0]],
    df_sub[plot_order[1]],
    df_sub[plot_order[2]],
    c = df_sub[plot_order[3]],    
    s = df_sub[plot_order[4]]*10,
    cmap="RdBu_r",
    edgecolors='k'
)
ax.set_xlabel(plot_order[0])
ax.set_ylabel(plot_order[1])
ax.set_zlabel(plot_order[2])

cb = plt.colorbar(p3d)
cb.set_label(plot_order[3], rotation=90)
cb.set_ticks(range(min_patterns, max_patterns+1))

ax.plot(
    df_sub[plot_order[0]], 
    df_sub[plot_order[1]], 
    df_sub[plot_order[2]],
    'k--',
    alpha=0.8,
    linewidth=0.5
)
<IPython.core.display.Javascript object>
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x11288c3790>]

We can now get the list of parameters that match the contraints given above as:

df_sub

sentences patterns NUM_BASIS_NOUN NUM_BASIS_VERB BASIS_NOUN_DIST_CUTOFF BASIS_VERB_DIST_CUTOFF VERB_NOUN_DIST_CUTOFF
377 133.0 28.0 2.0 8.0 1.0 1.0 3.0
381 60.0 26.0 2.0 8.0 1.0 2.0 2.0
386 66.0 29.0 2.0 8.0 1.0 3.0 2.0
391 69.0 29.0 2.0 8.0 1.0 4.0 2.0
401 222.0 26.0 2.0 8.0 2.0 1.0 2.0
406 254.0 29.0 2.0 8.0 2.0 2.0 2.0
490 86.0 26.0 2.0 8.0 5.0 4.0 1.0
495 86.0 26.0 2.0 8.0 5.0 5.0 1.0
501 56.0 29.0 2.0 10.0 1.0 1.0 2.0
610 86.0 26.0 2.0 10.0 5.0 3.0 1.0
615 86.0 28.0 2.0 10.0 5.0 4.0 1.0
620 86.0 28.0 2.0 10.0 5.0 5.0 1.0
629 146.0 28.0 4.0 2.0 1.0 1.0 5.0
632 151.0 29.0 4.0 2.0 1.0 2.0 3.0
636 71.0 26.0 4.0 2.0 1.0 3.0 2.0
637 168.0 29.0 4.0 2.0 1.0 3.0 3.0
641 84.0 28.0 4.0 2.0 1.0 4.0 2.0
642 193.0 29.0 4.0 2.0 1.0 4.0 3.0
646 90.0 28.0 4.0 2.0 1.0 5.0 2.0
652 480.0 28.0 4.0 2.0 2.0 1.0 3.0
690 65.0 27.0 4.0 2.0 3.0 4.0 1.0
695 65.0 27.0 4.0 2.0 3.0 5.0 1.0
701 560.0 28.0 4.0 2.0 4.0 1.0 2.0
715 86.0 29.0 4.0 2.0 4.0 4.0 1.0
720 86.0 29.0 4.0 2.0 4.0 5.0 1.0
740 99.0 29.0 4.0 2.0 5.0 4.0 1.0
745 99.0 29.0 4.0 2.0 5.0 5.0 1.0
751 41.0 26.0 4.0 4.0 1.0 1.0 2.0
805 64.0 27.0 4.0 4.0 3.0 2.0 1.0
830 82.0 29.0 4.0 4.0 4.0 2.0 1.0
925 52.0 28.0 4.0 6.0 3.0 1.0 1.0
1050 52.0 28.0 4.0 8.0 3.0 1.0 1.0
1310 60.0 26.0 6.0 2.0 3.0 3.0 1.0
1335 81.0 28.0 6.0 2.0 4.0 3.0 1.0
1355 88.0 26.0 6.0 2.0 5.0 2.0 1.0
1450 71.0 29.0 6.0 4.0 4.0 1.0 1.0
1925 44.0 28.0 8.0 2.0 3.0 1.0 1.0
1950 56.0 29.0 8.0 2.0 4.0 1.0 1.0
2155 29.0 28.0 8.0 6.0 2.0 2.0 1.0
2540 32.0 29.0 10.0 2.0 2.0 4.0 1.0
2545 32.0 29.0 10.0 2.0 2.0 5.0 1.0
2550 44.0 28.0 10.0 2.0 3.0 1.0 1.0
2655 31.0 27.0 10.0 4.0 2.0 2.0 1.0
3025 28.0 27.0 10.0 10.0 2.0 1.0 1.0

If we wish to maintain a list of parameters that maxmises the number of nouns available, we can sub-select the range where NUM_BASIS_NOUN=10 as

df_sub.where(df_sub["NUM_BASIS_NOUN"] == 10).dropna()

sentences patterns NUM_BASIS_NOUN NUM_BASIS_VERB BASIS_NOUN_DIST_CUTOFF BASIS_VERB_DIST_CUTOFF VERB_NOUN_DIST_CUTOFF
2540 32.0 29.0 10.0 2.0 2.0 4.0 1.0
2545 32.0 29.0 10.0 2.0 2.0 5.0 1.0
2550 44.0 28.0 10.0 2.0 3.0 1.0 1.0
2655 31.0 27.0 10.0 4.0 2.0 2.0 1.0
3025 28.0 27.0 10.0 10.0 2.0 1.0 1.0

Selecting the id=2655 row, we can encode 27 unique patterns, using the following parameters:

df.iloc[2655]
sentences                 31
patterns                  27
NUM_BASIS_NOUN            10
NUM_BASIS_VERB             4
BASIS_NOUN_DIST_CUTOFF     2
BASIS_VERB_DIST_CUTOFF     2
VERB_NOUN_DIST_CUTOFF      1
Name: 2655, dtype: int64

Aside: submission script for above job output.

scr="""
#!/bin/bash
#SBATCH -J qubits_full
#SBATCH -N 1

#SBATCH -p GpuQ
#SBATCH -t 18:00:00
#SBATCH -A "ichec001"
#SBATCH --mail-user=lee.oriordan@ichec.ie
#SBATCH --mail-type=ALL

#no extra settings
cd /ichec/work/ichec001/loriordan_scratch/intel-qnlp-iqs2/build
module load intel/2019u5 gcc cmake3
source ../load_env.sh

for nbn in $(seq 2 2 10);do
for nbv in $(seq 2 2 10);do
for bndc in $(seq 1 1 5);do
for bvdc in $(seq 1 1 5);do
for vndc in $(seq 1 1 5);do

echo "NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc}"
NUM_BASIS_NOUN=${nbn} NUM_BASIS_VERB=${nbv} BASIS_NOUN_DIST_CUTOFF=${bndc} BASIS_VERB_DIST_CUTOFF=${bvdc} VERB_NOUN_DIST_CUTOFF=${vndc} srun --ntasks 32 -c 1 --cpu-bind=cores -m plane=32 python ../modules/py/scripts/QNLP_EndToEnd_MPI.py
echo ""

done
done
done
done
done
"""

note="""
The ../modules/py/scripts/QNLP_EndToEnd_MPI.py file was modified to exit upon calculating the 
vec_to_encode (i.e. the patterns) list, and prints the following:

    d ={"sentences" : len(sentences),
        "patterns" : len(vec_to_encode),
        "NUM_BASIS_NOUN" : NUM_BASIS_NOUN,
        "NUM_BASIS_VERB" : NUM_BASIS_VERB,
        "BASIS_NOUN_DIST_CUTOFF" : BASIS_NOUN_DIST_CUTOFF,
        "BASIS_VERB_DIST_CUTOFF" : BASIS_VERB_DIST_CUTOFF,
        "VERB_NOUN_DIST_CUTOFF" : VERB_NOUN_DIST_CUTOFF
    }
    print(d)

"""
Last updated on 1 Mar 2020
Published on 1 Mar 2020
Edit on GitHub