QNLP  v1.0
QNLP.proc.VectorSpaceModel.VectorSpaceModel Class Reference
Collaboration diagram for QNLP.proc.VectorSpaceModel.VectorSpaceModel:
Collaboration graph

Public Member Functions

def __init__ (self, corpus_path="", mode=0, stop_words=True, encoder=enc.gray.GrayEncoder(), use_spacy=False)
 
def load_tokens (self, corpus_path, mode=0, stop_words=True, use_spacy=False)
 
def sort_basis_helper (self, token_type, num_elems)
 
def define_basis (self, num_basis={"verbs":8, "nouns":8})
 
def sort_tokens_by_dist (self, tokens_type, graph_type=nx.DiGraph, dist_metric=lambda x, np.abs(x[:, np.newaxis] - y) y)
 
def sort_basis_tokens_by_dist (self, tokens_type, graph_type=nx.DiGraph, dist_metric=lambda x, np.abs(x[:, np.newaxis] - y) y, num_basis=16, ham_cycle=True)
 
def assign_indexing (self, token_type)
 
def calc_diff_matrix (self)
 
def getPathLength (self, token_type)
 

Data Fields

 pc
 
 tokens
 
 encoder
 
 distance_dictionary
 
 encoded_tokens
 
 ordered_tokens
 

Private Member Functions

def _create_token_graph (self, token_dist_pairs, graph_type=nx.DiGraph)
 
def _get_ordered_tokens (self, nx.DiGraph token_graph, ham_cycle=True)
 
def _tsp_token_solver (self, nx.DiGraph token_graph)
 
def _calc_token_order_distance (self, token_order_list, token_type)
 

Detailed Description

Use vector space model of meaning to determine relative order of tokens in basis (see disco papers).
Plan:

-   1. Populate set of tokens of type t in corpus; label as T; O(n)
-   2. Choose n top most occurring tokens from T, and define as basis elements of type t; 2 <= n <= |T|; O(n)
-   3. Find relative (pairwise) distances between these tokens, and define as metric; n(n-1)/2 -> O(n^2)
-   4. Sort tokens by distance metric; O(n*log(n))
-   5. Assign tokens integer ID using Gray code mapping of sorted index; O(n)

After the aboves steps the elements are readily available for encoding using the respective ID. Tokens that appear
relatively close together will be closer in the sorted list, and as a result have a smaller number of bit flips of difference
which can be used in the Hamming distance calculation later for similarity of meanings.

Definition at line 108 of file VectorSpaceModel.py.

Constructor & Destructor Documentation

◆ __init__()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.__init__ (   self,
  corpus_path = "",
  mode = 0,
  stop_words = True,
  encoder = enc.gray.GrayEncoder(),
  use_spacy = False 
)

Definition at line 123 of file VectorSpaceModel.py.

123  def __init__(self, corpus_path="", mode=0, stop_words=True, encoder=enc.gray.GrayEncoder(), use_spacy=False):
124  self.pc = VSM_pc()
125  self.tokens = self.load_tokens(corpus_path, mode, stop_words, use_spacy)
126  self.encoder = encoder
127  self.distance_dictionary = {}
128  self.encoded_tokens = {}
129  self.ordered_tokens = {}
130 

Member Function Documentation

◆ _calc_token_order_distance()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel._calc_token_order_distance (   self,
  token_order_list,
  token_type 
)
private

Definition at line 345 of file VectorSpaceModel.py.

345  def _calc_token_order_distance(self, token_order_list, token_type):
346  sum_total = []
347  for idx in range(1, len(token_order_list)):
348  # May be ordered either way
349  v0 = self.distance_dictionary.get(token_type).get( ( token_order_list[idx-1], token_order_list[idx] ) )
350  v1 = self.distance_dictionary.get(token_type).get( ( token_order_list[idx], token_order_list[idx-1] ) )
351  if v0 == None:
352  sum_total.append( np.min(v1) )
353  else:
354  sum_total.append( np.min(v0) )
355 
356  return (np.sum(sum_total), sum_total)
357 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.distance_dictionary.

◆ _create_token_graph()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel._create_token_graph (   self,
  token_dist_pairs,
  graph_type = nx.DiGraph 
)
private
Creates graph using the (basis) tokens as nodes, and the pairwise 
distances between them as weighted edges. Used to determine optimal 
ordering of token adjacency for later encoding.

Definition at line 239 of file VectorSpaceModel.py.

239  def _create_token_graph(self, token_dist_pairs, graph_type=nx.DiGraph):
240  """
241  Creates graph using the (basis) tokens as nodes, and the pairwise
242  distances between them as weighted edges. Used to determine optimal
243  ordering of token adjacency for later encoding.
244  """
245  # Following: https://www.datacamp.com/community/tutorials/networkx-python-graph-tutorial
246 
247  assert( callable(graph_type) )
248 
249  token_graph = graph_type()
250 
251  for tokens_tuple, distances in token_dist_pairs.items():
252  # Prevent multiple adds. Should be no prob in any case
253  if not token_graph.has_node(tokens_tuple[0]):
254  token_graph.add_node(tokens_tuple[0])
255  if not token_graph.has_node(tokens_tuple[1]):
256  token_graph.add_node(tokens_tuple[1])
257 
258  # If multigraph allowed, add multiple edges between nodes.
259  # If not, use the min distance.
260  if graph_type == nx.MultiGraph:
261 
262  if not isinstance(distances, int):
263  for d in distances:
264  token_graph.add_edge( tokens_tuple[0], tokens_tuple[1], weight=d )
265  else:
266  token_graph.add_edge( tokens_tuple[0], tokens_tuple[1], weight=distances )
267 
268  else:
269  d_val = np.min(distances) if not isinstance(distances, int) else distances
270  token_graph.add_edge( tokens_tuple[0], tokens_tuple[1], weight=d_val )
271 
272  if graph_type == nx.DiGraph:
273  token_graph.add_edge( tokens_tuple[1], tokens_tuple[0], weight=d_val )
274 
275  return token_graph
276 

Referenced by QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_tokens_by_dist(), and QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_tokens_by_dist().

Here is the caller graph for this function:

◆ _get_ordered_tokens()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel._get_ordered_tokens (   self,
nx.DiGraph  token_graph,
  ham_cycle = True 
)
private
Solves the Hamiltonian cycle problem to define the ordering of the basis tokens.
If a cycle is not required, can solve the TSP problem instead (ham_cycle = False).

Definition at line 279 of file VectorSpaceModel.py.

279  def _get_ordered_tokens(self, token_graph : nx.DiGraph, ham_cycle = True):
280  """
281  Solves the Hamiltonian cycle problem to define the ordering of the basis tokens.
282  If a cycle is not required, can solve the TSP problem instead (ham_cycle = False).
283  """
284  #Must be a directed graph
285  assert( isinstance(token_graph, nx.DiGraph) )
286  #Must be fully connected
287  assert( nx.tournament.is_strongly_connected(token_graph) )
288  if ham_cycle:
289  return nx.tournament.hamiltonian_path(token_graph)
290  else:
291  token_order, dist_total = self._tsp_token_solver(token_graph)
292  return token_order[:]
293 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel._tsp_token_solver().

Referenced by QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_tokens_by_dist(), and QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_tokens_by_dist().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ _tsp_token_solver()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel._tsp_token_solver (   self,
nx.DiGraph  token_graph 
)
private
Using or-tools to solve TSP of token_graph.
Adapted from or-tools examples on TSP

Definition at line 296 of file VectorSpaceModel.py.

296  def _tsp_token_solver(self, token_graph : nx.DiGraph):
297  """
298  Using or-tools to solve TSP of token_graph.
299  Adapted from or-tools examples on TSP
300  """
301  dist_dat = {}
302  dist_dat['distance_matrix'] = nx.adjacency_matrix(token_graph).todense().tolist() #Doesn't seem to like np matrices
303  dist_dat['num_vehicles'] = 1 #Single route
304  dist_dat['depot'] = 0 #Starting position at index 0; considering trying alternative options
305 
306  mgr = pywrapcp.RoutingIndexManager(
307  len(dist_dat['distance_matrix']),
308  dist_dat['num_vehicles'], dist_dat['depot']
309  )
310  routing = pywrapcp.RoutingModel(mgr)
311 
312  def dist_matrix_val(idx_0, idx_1):
313  """Returns value from distance matrix between nodes"""
314  n0 = mgr.IndexToNode(idx_0)
315  n1 = mgr.IndexToNode(idx_1)
316  return dist_dat['distance_matrix'][n0][n1]
317 
318  transit_callback_index = routing.RegisterTransitCallback(dist_matrix_val)
319 
320  routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
321 
322  # Setting first solution heuristic.
323  search_parameters = pywrapcp.DefaultRoutingSearchParameters()
324  search_parameters.first_solution_strategy = (
325  routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
326 
327  # Solve the problem.
328  assignment = routing.SolveWithParameters(search_parameters)
329 
330  dist_total = assignment.ObjectiveValue()
331 
332  index = routing.Start(0)
333  plan_output = ''
334  token_order = []
335  token_list = list(token_graph)
336 
337  while not routing.IsEnd(index):
338  token_order.append(token_list[mgr.IndexToNode(index)])
339  previous_index = index
340  index = assignment.Value(routing.NextVar(index))
341  return token_order, dist_total
342 

Referenced by QNLP.proc.VectorSpaceModel.VectorSpaceModel._get_ordered_tokens().

Here is the caller graph for this function:

◆ assign_indexing()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.assign_indexing (   self,
  token_type 
)
5. Encode the ordered tokens using a code based on indexed 
location. Values close together will have fewer bit flips.

Definition at line 360 of file VectorSpaceModel.py.

360  def assign_indexing(self, token_type):
361  """ 5. Encode the ordered tokens using a code based on indexed
362  location. Values close together will have fewer bit flips.
363  """
364  t_dict = {}
365 
366  for idx,token in enumerate( self.ordered_tokens[token_type]):
367  t_dict.update({token : self.encoder.encode(idx, token_type) })
368 
369  self.encoded_tokens.update( {token_type : t_dict })
370 
371  return self.encoded_tokens
372 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.encoded_tokens, QNLP.proc.VectorSpaceModel.VectorSpaceModel.encoder, and QNLP.proc.VectorSpaceModel.VectorSpaceModel.ordered_tokens.

◆ calc_diff_matrix()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.calc_diff_matrix (   self)

Definition at line 375 of file VectorSpaceModel.py.

375  def calc_diff_matrix(self):
376  "WIP"
377  return
378  X = np.zeros([len(a.tokens['verbs'])]*2)
379 
380  # Add forward and inverse mapping in dictionary
381  tup_map_f = {k:i for i, k in enumerate(self.tokens['verbs'].keys()) }
382  tup_map_i = {i:k for i, k in enumerate(self.tokens['verbs'].keys()) }
383 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.tokens.

◆ define_basis()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.define_basis (   self,
  num_basis = {"verbs": 8, "nouns":8} 
)
2. Specify the number of basis elements in each space. 
Dict holds keys of type and values of number of elements to request.

Definition at line 156 of file VectorSpaceModel.py.

156  def define_basis(self, num_basis = {"verbs": 8, "nouns":8}):
157  """ 2. Specify the number of basis elements in each space.
158  Dict holds keys of type and values of number of elements to request."""
159  if self.tokens == None:
160  return
161  basis = {}
162  for t in ("nouns", "verbs"):
163  basis.update( self.sort_basis_helper(t, num_basis[t]) )
164 
165  return basis
166 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_helper(), and QNLP.proc.VectorSpaceModel.VectorSpaceModel.tokens.

Referenced by QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_tokens_by_dist().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ getPathLength()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.getPathLength (   self,
  token_type 
)

Definition at line 384 of file VectorSpaceModel.py.

384  def getPathLength(self, token_type):
385  "Calculate cumulative path length of resulting basis ordering"
386  total = 0
387  for i in range(len(self.ordered_tokens)-1):
388  try:
389  total += self.distance_dictionary[token_type][( self.ordered_tokens[token_type][ i ], self.ordered_tokens[token_type][ i+1 ])]
390  except:
391  total += self.distance_dictionary[token_type][( self.ordered_tokens[token_type][ i+1 ], self.ordered_tokens[token_type][ i ])]
392  return total

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.distance_dictionary, and QNLP.proc.VectorSpaceModel.VectorSpaceModel.ordered_tokens.

◆ load_tokens()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.load_tokens (   self,
  corpus_path,
  mode = 0,
  stop_words = True,
  use_spacy = False 
)

Definition at line 134 of file VectorSpaceModel.py.

134  def load_tokens(self, corpus_path, mode=0, stop_words=True, use_spacy=False):
135  " 1. Wraps the calls from process_corpus.py to tokenize. Returns None if path is "
136  if path.exists(corpus_path):
137  return self.pc.tokenize_corpus( pc.load_corpus(corpus_path), mode, stop_words, use_spacy)
138  else:
139  raise Exception("No corpus found at the given path.")
140 
def tokenize_corpus(corpus, proc_mode=0, stop_words=True)

References QNLP.proc.VectorSpaceModel.VSM_pc.pc, QNLP.proc.VectorSpaceModel.VectorSpaceModel.pc, and QNLP.proc.process_corpus.tokenize_corpus().

Here is the call graph for this function:

◆ sort_basis_helper()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_helper (   self,
  token_type,
  num_elems 
)

Definition at line 144 of file VectorSpaceModel.py.

144  def sort_basis_helper(self, token_type, num_elems):
145  basis_dict = {token_type : {} }
146  #consider Counter.most_common(num_elems)
147  for counter, elem in enumerate(sorted(self.tokens[token_type].items(), key=lambda x : x[1][0], reverse = True)):
148  if counter >= int(num_elems):
149  break
150  basis_dict[token_type].update({elem[0] : elem[1]})
151  return basis_dict
152 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel.tokens.

Referenced by QNLP.proc.VectorSpaceModel.VectorSpaceModel.define_basis().

Here is the caller graph for this function:

◆ sort_basis_tokens_by_dist()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_basis_tokens_by_dist (   self,
  tokens_type,
  graph_type = nx.DiGraph,
  dist_metric = lambda x,
np.abs(x[:, np.newaxis] - y)  y,
  num_basis = 16,
  ham_cycle = True 
)

Definition at line 202 of file VectorSpaceModel.py.

202  def sort_basis_tokens_by_dist(self, tokens_type, graph_type = nx.DiGraph, dist_metric = lambda x,y : np.abs(x[:, np.newaxis] - y), num_basis = 16, ham_cycle = True):
203  " 3. & 4."
204  basis_tk_list = list(self.define_basis(num_basis={"verbs":num_basis, "nouns" : num_basis})[tokens_type].keys())
205 
206  basis_dist_dict = {}
207  # pairwise distance calc
208  if len(basis_tk_list) > 1:
209  for c0,k0 in enumerate(basis_tk_list[0:]):
210  for k1 in basis_tk_list[c0:]:
211  if k0 != k1:
212  a = self.tokens[tokens_type][k0][1]
213  b = self.tokens[tokens_type][k1][1]
214 
215  basis_dist_dict[(k0,k1)] = np.min(dist_metric(a,b)) # list(map(np.unique, dist_metric(a,b)))
216  #sorted([ dist_metric(i,j) for i in self.tokens[tokens_type][k0][1] for j in self.tokens[tokens_type][k1][1] ])
217  else:
218  basis_dist_dict.update({ (basis_tk_list[0],basis_tk_list[0]) : 0})
219 
220  self.distance_dictionary.update( {tokens_type : basis_dist_dict } )
221 
222  """ Maps the tokens into a fully connected digraph, where each token
223  is a node, and the weighted edge between them holds their
224  respective distance. In the event of multiple distances between
225  two node, assumes the minimum of the list.
226  """
227  token_graph = self._create_token_graph(basis_dist_dict, graph_type)
228  """ Using the token graph, a Hamiltonian path (cycle if [0]
229  connected to [-1]) is found for the graph, wherein the ordering gives
230  the shortest path connecting each node, visited once. This gives a
231  sufficiently ordered list for the encoding values.
232  """
233 
234  self.ordered_tokens.update( {tokens_type : self._get_ordered_tokens(token_graph, ham_cycle) } )
235  return self.ordered_tokens
236 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel._create_token_graph(), QNLP.proc.VectorSpaceModel.VectorSpaceModel._get_ordered_tokens(), QNLP.proc.VectorSpaceModel.VectorSpaceModel.define_basis(), QNLP.proc.VectorSpaceModel.VectorSpaceModel.distance_dictionary, QNLP.proc.VectorSpaceModel.VectorSpaceModel.ordered_tokens, and QNLP.proc.VectorSpaceModel.VectorSpaceModel.tokens.

Here is the call graph for this function:

◆ sort_tokens_by_dist()

def QNLP.proc.VectorSpaceModel.VectorSpaceModel.sort_tokens_by_dist (   self,
  tokens_type,
  graph_type = nx.DiGraph,
  dist_metric = lambda x,
np.abs(x[:, np.newaxis] - y)   y 
)

Definition at line 171 of file VectorSpaceModel.py.

171  def sort_tokens_by_dist(self, tokens_type, graph_type = nx.DiGraph, dist_metric = lambda x,y : np.abs(x[:, np.newaxis] - y) ):
172  " 3. & 4."
173  tk_list = list(self.tokens[tokens_type].keys())
174  dist_dict = {}
175  # pairwise distance calc
176  for c0,k0 in enumerate(tk_list[0:-1]):
177  for k1 in tk_list[c0:]:
178  if k0 != k1:
179  a = self.tokens[tokens_type][k0][1]
180  b = self.tokens[tokens_type][k1][1]
181 
182  dist_dict[(k0,k1)] = np.min(dist_metric(a,b)) # list(map(np.unique, dist_metric(a,b)))
183  #sorted([ dist_metric(i,j) for i in self.tokens[tokens_type][k0][1] for j in self.tokens[tokens_type][k1][1] ])
184 
185  self.distance_dictionary.update( { tokens_type : dist_dict} )
186 
187  """ Maps the tokens into a fully connected digraph, where each token
188  is a node, and the weighted edge between them holds their
189  respective distance. In the event of multiple distances between
190  two node, assumes the minimum of the list.
191  """
192  token_graph = self._create_token_graph(dist_dict, graph_type)
193  """ Using the token graph, a Hamiltonian path (cycle if [0]
194  connected to [-1]) is found for the graph, wherein the ordering gives
195  the shortest path connecting each node, visited once. This gives a
196  sufficiently ordered list for the encoding values.
197  """
198 
199  self.ordered_tokens.update( { tokens_type : self._get_ordered_tokens(token_graph) } )
200  return self.ordered_tokens
201 

References QNLP.proc.VectorSpaceModel.VectorSpaceModel._create_token_graph(), QNLP.proc.VectorSpaceModel.VectorSpaceModel._get_ordered_tokens(), QNLP.proc.VectorSpaceModel.VectorSpaceModel.distance_dictionary, QNLP.proc.VectorSpaceModel.VectorSpaceModel.ordered_tokens, and QNLP.proc.VectorSpaceModel.VectorSpaceModel.tokens.

Here is the call graph for this function:

Field Documentation

◆ distance_dictionary

◆ encoded_tokens

QNLP.proc.VectorSpaceModel.VectorSpaceModel.encoded_tokens

◆ encoder

QNLP.proc.VectorSpaceModel.VectorSpaceModel.encoder

◆ ordered_tokens

◆ pc

QNLP.proc.VectorSpaceModel.VectorSpaceModel.pc

◆ tokens


The documentation for this class was generated from the following file: