Samenvatting

Full Summary - Text Retrieval and Mining (6013B0801Y) || UvA BSc Econometrics & Data Science

Beoordeling

Verkocht

Pagina's

Geüpload op

20-04-2025

Geschreven in

2024/2025

This is a summary of all you need to know for your final of Text Retrieval and Mining in your third year of your BSc Econometrics and Data Science or BSc Business Analytics. I studied this and got a 8.4 on the final, with only 2 days of studying.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

Vocabulary / Dictionary all unique words appearing in corpus size V, number of words
Token smallest unit of text word, punctuation, etc
Corpus all documents
Corpus Frequency number of times the word appears over all documents
Term Frequency number of times the word appears in one document TF
Document Frequency number of documents in which the word appears in the whole corpus DF
IDF inverse of document frequency IDF
TF-IDF Token Frequency * Inverse Document Frequency high: word appears in the document but not a lot in the overall corpus. // low: word appears in the document but also in the overall corpus
Stopping removing stopwords
Filter by Token Pattern excludes token based on patterns
Stemming removing plurals and conjugation
Lemmatization reducing words to their lemma (an actual word)
Document Frequency Filtering removing tokens that appear too frequently (not useful) or too rarely (typos, names, numbers)
N-Grams collecting groups of N consecutive tokens, see which N-Grams repeat
Part-of-Speech (POS) Tagging assigns grammatical categories (tags) to words in a sentence noun, verb, pronoun, etc.
Constituency (/Dependency) Parsing analyzing the grammatical structure of a sentence defines relations between words
Named Entity Recognition (NER) identifies specific entities in text and classifies them into categories labeling is context dependent, training material: annotated samples, full text + spans + categories
Entity span N consecutive words forming an entity 1 entity = 1 span = N consecutive words
Knowledge graph shows relationships between entities
Entity Recognition which things are known people/places/names
Named Entity Linking what are the people/places/names
Distributional Semantics the meaning of a word can be inferred from the contexts in which it appears
Word Context a word’s meaning can be inferred from the words around it
Context window the (N-1)/2 words before and after our word window size 2: 2 + 1 + 2 = 5-Gram
Word co-occurrence matrix a table showing how often words appear together within a context window 1 cell = count when TARGET word appears in the window of CONTEXT word
Word embedding a numerical vector representation of a word that captures its meaning, context, and relationships with other words
Word vector (co-occurrence) row vector from the co-occurrence matrix word embedding
Word2Vec given a word, we build a vector that builds the probability distribution of words around this word, as observed in our corpus
Word vector (Word2Vec) a vector that builds the probability distribution of words around this word
Document embedding compose document vector from word vectors
Encoder architecture the role of the encoder is to transform the document into a vector representation the final vector c is document embedding, encoder(doc) = vectors
Transformer encodes each token token encoding is influenced by ALL other tokens
Masked Language Model drop some words at random, ask the model to predict the missing word the MASK token is the missing word
Attention the transformer focuses on the most important parts of the input input & output N vectors of dimension d
CLS (classification) token a special token added to the beginning of an input sentence [CLS] 'sentence' [SEP]/[STOP]
Fine-Tuning pre-trained (already powerful) models can be adapted to specific tasks model weights get updated, additional HEAD layer is trained, supervised, done multiple times
Pre-Training Language Model Training self supervised, done once
Transformer Architecture class, the structure of the model
Transformer Trained on a Task weights, learned parameters of the model
Decoder the part of a transformer model that transforms numerical vectors back into text starts with BOS token, ends with STOP token, text is predicted token by token
Topic modeling an unsupervised learning technique used to discover hidden topics in a collection of documents
Topic weighted words, keywords per topic rated with TF-IDF
Cluster Topic
Cluster-Based Corpus (CBC) corpus made of documents that were clustered and concatenated
BERT pre-trained transformer-based language model, designed to understand the meaning of words in context each word is represented by a contextual embedding, every token's embedding is influenced by all other token, possible because of attention mechanism
BERTopic topic modeling language model, learns topics automatically from text embedding + clustering
Large Language Model (LLM) computes the probability of a sequence of words
N-Gram Language Model predicts the next word based on the previous N-1 words
Greedy Generation select the most likely next token, based on the highest sequence probability
Prompt Completion first token is based on the prompt, second token is based on the prompt + first token, etc. - instead of greedy sampling
Top-K select from the top K probable tokens instead of just the most probable
Temperature (T) controls randomness low T = more predictable, more weight to high-probability tokens
Emergent Abilities larger language models, when trained on more data, can exhibit performance levels that are not seen in smaller models
Retrieval Augmented Generation (RAG) retrieving relevant documents based on a user query and then using a language model to generate an answer grounded in those documents.
SystemMessage provides instructions to the model
HumanMessage represents user input in the conversation

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit van Amsterdam (UvA)
Studie: Econometrics and Data Science
Vak: Text Retrieval and Mining (6013B0801Y)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 20 april 2025
Bestand laatst geupdate op: 20 april 2025
Aantal pagina's: 1
Geschreven in: 2024/2025
Type: SAMENVATTING

Onderwerpen

text representation
text pre processing
machine learning for text classification ranking

€8,53

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

eva6590

Maak kennis met de verkoper

eva6590 Universiteit van Amsterdam

Bekijk profiel

Volgen

Verkocht

Lid sinds

1 jaar

Aantal volgers

Documenten

Laatst verkocht

0,0

0 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper eva6590. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,53. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49928 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Full Summary - Text Retrieval and Mining (6013B0801Y) || UvA BSc Econometrics & Data Science

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Meer vakken binnen Universiteit van Amsterdam (UvA) > Econometrics and Data Science

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?