Samenvatting

Summary Information Retrieval Entire Course

Beoordeling

Verkocht

Pagina's

Geüpload op

21-03-2022

Geschreven in

2021/2022

Everything you need to know for the Information Retrieval Exam!

Instelling

Vak

Voorbeeld van de inhoud

Information Retrieval Summary
§1 Introduction to IR

Information retrieval (IR): finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers)

Information Retrieval (IR) versus Databases (DB):
 IR: unstructured data, loose semantics, partial query matching, probabilistic
 DB: structured data, well defined query, exact query matching, deterministic

Criteria for good search engines:
 How fast does it search? Latency as function of index size
 How fast does it index? Number of documents per hour
 Expressiveness of query language: ability to express complex information needs
 Disk space requirements
 User happiness

Search speed: latency (fastest to slowest)
 Main memory reference (read random byte from memory)
 Compress 1K of data (compress 1000 bytes in memory)
 SSD random read (read random byte from solid-state drive)
 Round trip within the same datacenter (send one byte to another computer in the
same fast datacenter network and back)
 Hard disk seek (read random byte from hard disk)
 Send one byte from the Netherlands to California and back

§2 Indexing and Boolean Retrieval

Term-Document Incidence Matrix: matrix with on the rows the words and on the columns
the documents. This is a sparse matrix: 1 if document contains the word; 0 otherwise.

Inverted Index:
 For each term (word) t we need to store a list of all documents that contain t
 Identify each term by a document ID
 We need variable-size lists (posting lists)
o On disk: continuous run of postings is best
o In memory: linked lists or variable-length arrays

Inverted Index Construction: tokenizer > normalization > indexer
 Indexer Step 1: Token sequence: remove capitals and punctuation marks and make a
table of all words together with the corresponding document
 Indexer Step 2: sort by terms (and then by doc ID)

,  Indexer Step 3: split into dictionary (term) and postings (doc id); merge multiple term
entries and document frequency is added

For AND queries, we need to merge 2 postings lists, one way to do this: walk through the 2
postings simultaneously and only look at the head element, add to the result list if these 2
are the same (and then move 1 element forward), otherwise discard the smaller one by
moving forward for that list. Posting lists are sorted by doc ID

The Boolean retrieval model allows for queries that are Boolean expressions:
 Each document is viewed as a set of words
 AND, OR, and NOT can be used to join query terms
 It is precise: every document either matches the condition or not
 Query optimization: process in order of increasing frequency so start with the
smallest set (therefore we keep the doc frequency)

‘to sleep perhance to dream’
 Token: instance of a sequence of characters in some particular document that are
grouped together as a useful semantic unit for processing
o = 5: to, sleep, perhance, to, dream
 Type: class of all tokens of consisting of exactly the same character sequence
o = 4: to, sleep, perhance, dream
 Term: a (normalized) type that is included in the IR system’s dictionary
o = 3: sleep, perhance, dream

An index for small collections (like Reuters) fit into memory, but more realistic collections
like Common Crawl or Google don’t, that should be stored on disk or SSD. The bottleneck is
sorting, this is way too slow. Therefore, we need an external sorting algorithm:
 Blocked Sort-Based Indexing (BSBI): define blocks, sort each block in memory and
then write back to disk, in the end merge all blocks
 It is most efficient to do an n-way merge, where we read from all (or many) block
simultaneously.
 Load and write sorted blocks as buffered streams from and to disk
 With proper buffer size, disk access is minimized

Distributed Indexing with MapReduce: see picture slide 37

§3 What to Index?

We want to be able to answer queries like “information retrieval” as a phrase. Such that
“there is no information about the retrieval of raw materials” is not a match. How?

Solution 1 is biword indexes: index every consecutive pair of terms as a phrase:
 “Friends, Romans, Countrymen”: would create the biwords:
o friends romans
o romans countrymen
 Each of these biwords is now a dictionary term
Problems of biword indexes:

,  False positives in answer set (for queries with more than 2 words)
 In particular for phrases with frequent words like “beer of the month”
 Index blow-up due to bigger dictionary
 Infeasible for more than biwords

Solution 2 is positional indexes: store in the positing for each term the position(s) in which
tokens of it appear
<term: number-of-docs-containing-term; <be: 993427;
Doc1: position1, position2, … ; 1: 3, 149;
Doc2: position1, position2, … ; 2: 17, 191, 291, 430, 434;
…> …>
 Positional index requires substantially more space in memory and on disk. But it is
the standard solution due to the power and usefulness of phrase queries
 Positional index is 2-4 times larger than non-positional index
 Size of positional index is 35-50% of volume of original text

Only document 2 can contain “to be or not to be”
Processing a phrase query: extract inverted index entries for each distinct term: to, be, or
not. Merge their document position lists to enumerate all positions with “to be or not to be”

These 2 can also be combined, it requires 26% more space than just positional index but a
query can be executed in ¼ of the time of just a positional index.

Normalization:
 How to tokenize/ normalize tokens like Den Haag, or 20/12/21?
 Use compound splitting in Dutch for nouns: ziektekostenverzekering
 Case folding: reduce all letters to lower-case (exception upper-case mid-sentence?)
 Tokenization and normalization may depend on the language and so is intertwined
with language detection
 Stop words: create a list of the most common words, such as “of” and “the”, and
remove them from the token list before the indexing step.

Lemmatization: reduce inflectional / variant forms to base form. This implies doing proper
linguistic reduction to dictionary headword form
 Am, are, is, were, being -> be
 Car, cars, car’s, cars’ -> car

Stemming: reduce terms to their roots before indexing, this is crude affix chopping and
language dependent
 Automate, automates, automatic, automation -> automat
 Porter’s algorithm is the most common for stemming English, conventions are
applied in 5 phases of reduction, and each phase has a set of rules
o *sses -> *ss
o *ational -> *ate
o *tional ->*tion
Spelling correction: used to correct documents being indexed and used to correct user
queries to retrieve right answers.

Meld schending auteursrecht

Geschreven voor

Instelling: Vrije Universiteit Amsterdam (VU)
Studie: Business Analytics
Vak: Information Retrieval

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 21 maart 2022
Aantal pagina's: 17
Geschreven in: 2021/2022
Type: SAMENVATTING

Onderwerpen

information
retrieval
business
analytics
data
science
minor
summary

$8.45

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

femkestokkink

4.0

(3)

Maak kennis met de verkoper

femkestokkink Vrije Universiteit Amsterdam

Bekijk profiel

Volgen

Verkocht

Lid sinds

4 jaar

Aantal volgers

Documenten

Laatst verkocht

2 jaar geleden

4.0

3 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper femkestokkink. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $8.45. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49231 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Summary Information Retrieval Entire Course

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?