My Progress

15th May

To start off, I created a Python notebook to demonstrate some basic functionality of the NLTK library and the use of WordNet synsets that come along with it.

For an input sentence

"""At eight o'clock on Thursday morning, Arthur didn't feel very good."""

the basic functionality of the NLTK Tokenizer was demonstrated.

After tokenizing the sentence, and tagging them on the basis of their positions in the sentence, entities can be generated.

Then I practiced with the WordNet corpus, with the help of CorpusReader available in the NLTK library. I wanted to extract synsets from the corpus, in order to see how they work and also use the definitions that are available for every lemma and use it to create my dataset to train the definition encoder later on.

17th May

I have been looking into practical implementations for ideas related to learning concept representations and entity, that go beyond word embeddings. That brought me to this paper.

Here, they have talked about two types of evaluation tasks that I will work on to get a better understanding. The Analogical Reasoning and Concept Learning.

Typically, analogies take the form "a is to b, is same as c is to _?", where a, b and c are present in the model. Here, a similarity function can be used to answer these analogical reasoning questions.

d = arg max Sim( vec(d), vec(b) - vec(a) + vec(c) )

Here, d can be predicted by using Sim(), which is the similarity function.

21st May

The Google Analogy Task dataset contains 19,544 questions divided into semantic analogies and syntactic analogies. The questions file contains these pairs, divided into various categories.

London England Berlin Germany

So, I will be using this file to evaluate the pre-trained embeddings on these question-answer pairs. To do this, I use the first three words, "London England Berlin" to predict the word and check if it matches the answer, the fourth word "Germany". This file only contains single words, no multi-word phrases. Thus, there is still a need for a better dataset to evaluate wiki embeddings. I am planning to extract the wiki embeddings and train the model on them by using anchor text.

24th May

I pushed the scripts to the repository along with the updated README.md and the requirements.txt.

It currently includes the scripts to train a FastText model. Usage of the scripts has been explained in the README.md file.

Pre-Training and Evaluating Analogy

The script, pre-train.py takes the following arguments:

Files

Input File: Clean wiki dump
Output File: Saved model

Model Hyperparameters

Vector Size, - s: Defines the Embedding Size
Skipgram, - sg: Decides whether model uses skipgram or CBOW
Loss Function, - hs: Loss function used is Hierarchal Softmax or Negative Sampling
Epochs, - e: Number of epochs

$ mkdir model

$ python src/pre-train.py -i data/fil9 -o model/pre_wiki -s 300 -sg 1 -hs 1 -e 5

Next, we try to see how these pre-trained embeddings perform on the Google Analogy Task. For this, we have the analogy.py.

$ python analogy.py -i data/questions-words.txt -m model/pre_wiki

Question: high is to higher as great is to?

Answer: greater

Predicted: greater

Question: glendale is to arizona as akron is to ?

Answer: ohio

Predicted: ohio

Question: ethical is to unethical as comfortable is to ?

Answer: uncomfortable

Predicted: comfortably

Question: netherlands is to dutch as brazil is to ?

Answer: brazilian

Predicted: brazilian

Question: free is to freely as happy is to ?

Answer: happily

Predicted: happily

Question: luanda is to angola as monrovia is to ?

Answer: liberia

Predicted: liberia

27th May

This week I am working on the most efficient way to use the dump, as well as extract entities from it. The benchmark model will train on the word embeddings and the possible multi-phrase entity embeddings the same. This paper has some great insights on how to represent complex characteristics of a word. It also provides potential evaluation tasks that can be used for our purpose.

30th May

The current task is to extract entities from the corpus, the entity descriptions, and the WordNet synsets. These are the nouns, verbs, adjectives, and adverbs grouped into sets of cognitive systems, each expressing a distinct concept. These concepts are known to exhibit semantic and lexical relations.

>>> for sen in wn.all_synsets():
. . . print(sen.lemmas()[0].name(), ":", sen.definition())
. . . break
able : (usually followed by `to') having the necessary means or skill or know-how or authority to do something

Next up, I need to improve the Mention Extraction of the entities from the corpus. For this, I will be looking into the JSON files available. These wiki dumps seem to have all the possible surface forms of a given entity.

1st June

The JSON dump containing all Wikipedia entities was downloaded from here. The files are more than 20 GBs in size and will just be used to create the dictionary for the entities along with their possible surface forms in the corpus. These surface forms can be found in the JSON dump according to the format in which all these entities are stored in the form of a single JSON array. The surface forms are extracted from the labels and aliases in this JSON array.

{
"labels": {
"en": {
"language": "en",
"value": "New York City"
}
},
"aliases": {
"en": [
{
"language": "en",
"value": "NYC"
},
{
"language": "en",
"value": "New York"
},
]
}
}

Thus, for an entity like "New_York", I will create a dictionary that saves all the surface forms for this entity and then performs Mention Extraction on the corpus.

Any mention of "New York City", "New York" or "NYC" is detected and changed to "entity/New_York".

4th June

Working with Surface Forms introduces a new problem. Whenever we encounter an instance of an entity that can represent different entities in different contexts, it is difficult to simply replace that instance with the entity. Here is an example that explores this problem in detail:

Let us say that we are dealing with the surface form "rock", then we have the following possible entities that are present in the dictionary:

Rock Music

Rock ( Geology )

Dwayne 'The Rock' Johnson and many more.

So, this kind of ambiguity in aliases leads to a major problem in identifying the particular entity that the surface form represents. Thus, simply replacing the surface form in the corpus will lead to unpredictable results while evaluating the entity embeddings. As we can see, we need to perform a specific task here known as "Disambiguation". A single surface form could essentially represent a place, a person, a genre of music or simply a substance. The only thing we can use to distinguish between them is the context in which the surface form is present.

So, the next thing we need to look into is an NER tool.

6th June

The blog post by OpenAI explains their system that can be used for discovering types for Entity Disambiguation. This aligns with our plan, that makes use of the description of entities from Wikipedia articles. I will follow the same approach as used in deeptype to extract entities along with the best-suited type for its surface form depending upon the context and the description that can be obtained from the abstract of the Wikipedia article.

8th June

Using a similar approach to last year, I am going to tag all the surface forms with their respective entities.

After the mentions have been extracted, a simple tag can be added to distinguish an entity from the rest of the corpus.

11th June

The scripts work with the labels, descriptions, and aliases in the following way:

Each surface form stores the entity that it represents as a key.
dictionary[surface form] = entity

For example, the dictionary may contain the following entries:

dictionary['NYC'] = 'New_York_City'

The surface forms are extracted from the values of JSON dictionary entry "aliases", and the entity is obtained from the value of the entry "labels".

Finally, the dictionary containing all the entities have the descriptions of those entities as the keys.

dictionary['New_York_City'] = 'largest city in New York and the United States of America'

The applicability of such descriptions in our model is fairly straightforward. It helps to identify the relations between different entities. This means that the semantics can be established for the entities as:

New_York_City instance_of City part_of United_States. They are the key properties and a mapping from all Wikipedia article names to a Wikidata ID.

15th June

I evaluated the model using the Analogy Task to establish a benchmark score and the results for the evaluation are as follows:

Questions: 19,544

Semantic: 8,869

Syntactic: 10,675

i) Model Hyperparameters ( FastText )

Vector Size: 200

Epochs: 3

Skipgram using Hierarchal Softmax

Final Accuracy: 51.14%

ii) Model Hyperparameters ( FastText )
Vector Size: 300
Epochs: 5
Skipgram using Hierarchal Softmax

Final Accuracy: 54.63%

iii) Model Hyperparameters ( Word2Vec )
Vector Size: 300
Epochs: 5
Skipgram using Hierarchal Softmax

Final Accuracy: 55.78%

The FastText model using character tri-grams to generate embeddings performs better on the syntactic questions, as compared to the semantic country: capital questions. This makes sense because breaking the words into character n-grams do not preserve the semantic information of those words in the embeddings.

17th June

This week, my work is focused on incorporating the deeptype system into the WikiDetector.py script. There are no pre-trained models available so I will have to train it myself. Here, the procedure followed is the same as they described in the repository.
Originally, the surface forms are loaded from the dictionary

20th June

The JSON dump will be processed by the script entity.py, and it will work in a similar fashion as MakeDictionary.py. Along with this, I have also begun working on the evaluation task of Named Entity Recognition, in which I will use the CoNLL-2003 dataset.
For now, I will focus my work on the current pipeline. The evaluations for all the baselines will be dealt with during the later weeks. I shall progress as mentioned in my proposal, working out the model, and the hyperparameter tuning.

22nd June

The first thing needed for the current pipeline is the hash map between Wikipedia article titles and their abstracts. I will create this dictionary and store it in a JSON file. The abstracts can be obtained from this dump. The format of this XML dump is quite straightforward.

<feed>
<doc>
<title>Wikipedia: Anarchism</title>
<url>https://en.wikipedia.org/wiki/Anarchism</url>
<abstract>Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies,"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies.</abstract>

Using some simple regular expressions, I will store the mapping as:
dictionary['title'] = 'abstract'
While training the model, the target variable will be the entity embedding, "anarchism", and the input will be its abstract. After training the model against these pre-trained entity embeddings, the model will be able to predict entity embeddings on the go for out of vocabulary entities with the help of their abstracts.
x_train contains all the vectors for abstracts.
y_train contains the vector embedding for the target entity.

23rd June

I finished training all the baselines discussed above. The current corpus contains 243,425 articles extracted using the extract.py script.

venv ❯ python src/package/extract.py ../data/enwik9 -l -o ../data/output

Once the dump was cleaned, I trained it using Word2Vec as well as FastText to compare the results. These constituted the baselines with entities injected into the corpus.

24th June

Currently, I am trying to figure out which format of descriptions will be best suited for the LSTM. The aim is to train the model so that it predicts the output entity embedding given the input embeddings of its description. The two possible dictionaries that can be used are:
-> Using abstracts from Wikipedia dump with all Wiki articles and their abstracts.
-> Other possibility is to use the same corpus to generate this dictionary in which the entities were already injected.
The second approach appears to be better in terms of preserving the semantic relations of a given entity with the other entities present in the description of the said entity.

25th June

The baselines are on the cloud. The pre-trained embeddings, the processed corpus and the descriptions extracted in both aforementioned methods have been pushed to Dropbox. I will share the link along with the information regarding each and every file.

After the XML dump has been processed and the extracted text is in the output directory, we need to further find the surface forms in the corpus and then inject entities wherever we find these surface forms. This is done with the help of three scripts: MakeDictionary.py, CheckPerson.py and WikiDetector.py
venv ❯ python src/MakeDictionary.py ../data/output/
venv ❯ python src/CheckPerson.py ../data/text
venv ❯ python src/WikiDetector.py ../data/output/
Next up, the scripts description.py and combine.py are used to extract all the abstracts from the dump and combine the files generated by the Wiki extractor respectively. combine.py uses the first line of the resource article as the description.
venv ❯ python src/description.py ../data/enwiki/enwiki-latest-abstract.xml ../data/abstracts.json
venv ❯ python src/combine.py ../data/output ../data/filedocs9 ../data/descriptions.json

27th June

29th June

The current progress regarding the pipeline that was mentioned in the original proposal can be reviewed in the open pull request. The PR is a work in progress. I am training the model with different parameters to see what works and what doesn't. The heuristic remains the same:
I first created the dictionary that contained the entity title along with its description. This description was then encoded into its embedding form which is to be used by the Definition Encoder to output a possible entity for the target.

1st July

The mappings once loaded from the descriptions are variable in length. However, the input sequences to the network must have the same length. To deal with this, I have used padding where I am padding the encoded description with zeros in order to make all the descriptions of equal length.

3rd July

As of now, after training the model with different parameters and architectures, I have found that the training on descriptions is most effective with a Recurrent Neural Network. The LSTM Network works as good as the RNN but the outputs contain resources that are more related to the n-grams and less related to the semantic of the target entity.

10th July

As of this week, we have commenced with the second evaluation for the program. I have been working towards cleaning the code. Code in the repo was styled and proper documentation was added wherever necessary. All Python code written is in compliance with PEP-8 styling guide. The README for this repo was updated and all the changes made in the repo can still be viewed in the open PR to the main repository.
Next and final phase is to write an evaluation scheme. The problem with that is to identify the task that will help us accomplish the required evaluations of the system.

18th July

The encoder is working and the next step is to design an evaluation scheme for the predicted embeddings of the all the entities based on their abstracts. Once we have the database for all the embeddings, we can run the simple implementation of whether the output vector of a given entity shows any similarity with the vector that represents the class to which that entity belongs. Here is an example to show what I mean.
In this example, I have taken the entity "agatha_christie".
For this entity, the following abstract was encoded, "dame agatha mary clarissa christie lady mallowan dame_commander_of_the_order_of_the_british_empire née miller 15 september 1890 – 12 january 1976 was an english writer she is known for her 66 detective novels and 14 short story collections particularly those revolving around her fictional detectives hercule_poirot and miss_marple christie also wrote the worlds longest-running play a murder mystery the_mousetrap and six romance_novel under the name mary westmacott in 1971 she was elevated to dame_commander_of_the_order_of_the_british_empire dbe for her contribution to literature".
Finally, when this description of the entity was provided to the model as input, an output vector was obtained. This output vector was then fed to the model containing the pre-trained embeddings trained on FastText, and the most similar vectors were printed along with the similarity score, which in this case was the simple cosine distance between the two vectors.
"[(‘writer’, 0.7674428224563599), (‘wallace_stevens’, 0.7610886693000793), (‘tragicomedy’, 0.7577036619186401), (‘robert_silverberg’, 0.7564967274665833), (‘fredric_jameson’, 0.7519606351852417), (‘sangster’, 0.7509233355522156), (‘novelist’, 0.750686526298523), (‘atlas_shrugged’, 0.7377546429634094), (‘lester_del_rey’, 0.7359971404075623), (‘writer-editor’, 0.7347295880317688)]".
Our approach to evaluate such embeddings is quite straightforward. A simple Vanilla scoring function that checks the similarity between the predicted vector for the entity and the vector representing the ontological class that contains this entity. Considering this example, we can see that the vector predicted for "agatha_christie" shows some similarity with the vector that represents the entity "writer" which in this case is also the class that encloses entities such as the one we predicted.
A vanilla scoring function would collect all the ontological classes for the DBpedia entities, map those entities to their respective ontological classes and finally the predicted embeddings will be evaluated based on the similarity between the two vectors. I will set a threshold that shall give us the results with some confidence.

20th July

In order to compute some evaluations, I am going to generate the test set using DBpedia dumps and sparql queries. AS we know, DBpedia stores different kinds of information about each and every resource in its knowledge base. So, if we select a few properties from the DBpedia graph, we can compare the results of the model, that is the predicted vector of a resource which is the encoded description of the said resource, with the embedding of the selected property from the graph. In these days, that is what I am exploring. I will experiment with a few different classes and properties that will enable me to write a simple scoring function for checking the extent to which the semantics of the description are being encoded into the final vector. Resources:

A simple example would be exploring the information available on the resource page for Agatha Christie and see what all properties are available. We have the label, type, and the url.

For this example, we can try to relate the predicted embedding with:

dbo:occupation

|-dbr:short_story_writer

|-dbr:novelist

|-dbr:playwright

rdf:Type

|-owl:Thing

|-foaf:Person

|-dbo:Person

|-dbo:Agent

|-dbo:Writer

24th July

I wrote the following sparql queries in order to fetch these resources and their respective properties:

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?s ?abstract ?url ?p

WHERE {

?s a/rdfs:subClassOf* dbo:Organisation .

?s foaf:isPrimaryTopicOf ?url ;

a ?p .

?p a owl:Class .

?s dbo:abstract ?abstract .

filter(langMatches(lang(?abstract), "en"))

}

With this, I was able to generate the test set for resources that had the parent class type as dbo:Organisation in the DBpedia ontology tree.

My experiments with the evaluations can be bifurcated as following:

Changing the pre-trained embeddings between FastText, Word2Vec and GloVe.
Using text with and without the surface forms being replaced in them.
Changing the attention between just the description and between description and the respective Class of the resource.
Finally, running the evaluations on different Classes with different sets of weights.

Finally, the weights, or the state_dict of the model, can be used to encode descriptions using the model when we downstream some common NLP tasks. This shall give us an advantage of not having to use the standard approach of replacing every unknown resource in the text with <unk> token, and instead, be able to use the embedding generated on the fly with the model.

27th July

The dictionaries were stored as mentioned above. Apart from the entities belonging to the class Organisation. First and foremost, the evaluation metric I want to use is plotting the entities on a graph with the help of a dimensionality reduction technique. In this project, I am going with t-SNE. For the purpose of plotting the entities, I used the dictionary person.json, which you can find here. I will share the insights drawn from the sub-plots of the graphs in the next post.

As discussed, firstly I ran the scoring function to check whether the encoded vectors resembled with the vectors of their respective classes. I ran two tests and got the following result:

I measured the cosine similarity between the entity vector and the vector for 'organisation'. This, when averaged over the number of instances and the hits, gave the score of 0.52

I also ran the same test for the cosine similarity between the entity vector and 'company', and the averaged score was 'company'. Since more instances in the dataset belonged to the sub-class of 'company', the score improved when I made the comparison with that.

30th July

As discussed earlier, I plotted the entities from the person dataset and obtained the following result.

With the help of these sub-plots, I want to point out the fact that the entities responding to persons that belong to the same field or domain are situated nearby. For example, in these plots, the entities gabriel_crouch, katja_eichinger, filip_remunda, rula_jebreal and armand_mastroianni are all a part of the movie business and thus relate with one another because of their common similar field of work which include movie actors, producers, director and writer. Similar patterns were observed for the entities belonging to the domains of politics and sports as well.

2nd August

While the plots provided some evidence as to whether or not the abstracts were encoded into meaningful embeddings or not, I am looking into more evaluation metrics that can be used to measure the meaningfulness of the embeddings and more importantly, measure them against some of the vanilla implementations for computing embeddings for OOV entities. I am thus looking into the measure of similarity between the entity vector, for the OOV entity, and the averaged vector of all the resources present in the annotated abstract for that OOV entity. I will be using this as the final metric and try to implement as many vanilla implementations as possible. Currently, I am looking into the use of Euclidean distance as the way to compute these baseline embeddings.

4th August

Here are the approaches that I have used to compare the embeddings generated by the abstract encoder:

Abstract vector

Vector predicted by the abstract encoder for that entity.

Zero vector

Vector for the OOV entity was initialized with zero.

Random vector

Vector for the OOV entity was initialized with random values.

Title vector

Another method to compute an embedding is to find the weighted mean of the words in the label of the entity.

Distanced mean vector

Similar to averaged vector, except that the distances between the entity and the words in the abstract were also taken into account.

Averaged weighted vector

Using the vectors for the words present in the abstract, I used the mean of those vectors as the entity vector used in the embedding.

6th August

Before closing the project and completing the final submission and evaluation, all the links were updated. Here are the links that you can refer to, to get all the information and the code for this project. I spent some time with @thiago merging the pull requests with the base repo.

Have a look at these links:

DBpedia embeddings database that I generated using the abstract encoder. It tagged over 350,000 entities from the dataset that were not present in the training data. Apart from these, I have also uploaded two small subsets of the database for the entities of classes organisation and person.
You can get the code here.
Link to the base repo.
I pushed my work with the help of two Pull Requests. The first PR included all the modules used for fetching data, obtaining the pre-trained embeddings for that data, pre-processing the text data and training the LSTM model on that data. The second PR is for the evaluation branch with code for creating the database of embeddings when provided with an input dictionary of abstracts, generating the graph for such a database, and finally running the evaluation on that dictionary with abstracts for OOV entities.