AM
BUAL5660 EXAM 2 QUESTIONS AND ANSWERS WITH
COMPLETE SOLUTIONS
Leave the first rating
Save
Terms in this set (67)
researchers collect their own data; driven by theory/hypothesis; examples --
surveys, interviews, experiments, and direct observations; complete control on
Primary data
data collection; can control for external conditions; mostly confirmatory analysis;
traditional statistics methods
somebody else already collected the data; driven by broader topic of study;
Secondary data examples -- customer data, census, etc.; not in control; cannot control additional
variables; mostly exploratory analysis; data analytics techniques
an application-programming interface is a set of programming instructions and
standards for accessing a Web-based software application or web tool; a
API software company releases its API to the public so that other software developers
can design products that are powered by its service; software-to-software
interface, not a user interface; ex: payment for a movie ticket
(Representational State Transfer). A software architectural style for implementing
Rest API
web services. historical data; need authorization from the company
access to live inputs and automatically receive new info without requesting this
Streaming API
again.
Google Maps, FLICKR, YouTube, Amazon Product Advertising, Wikipedia,
Public REST API
LinkedIn, Facebook
"beautiful soup"; a search engine employs special software robots, called spiders,
to build lists of the words found on Web sites. When a spider is building its lists,
the process is called Web crawling. The spider will begin with a popular site,
indexing the words on its pages and following every link found within the site. The
Web Crawling or Spidering
spidering system quickly begins to travel, spreading out across the most widely
used portions of the Web. Words occurring in the title, subtitles, meta tags and
other positions of relative importance were noted for special consideration during
a subsequent user search.
Your IP may get banned by the website; denial of service attacks; data behind the
Problems with web scraping
login wall
in KNIME
RSS Feed
Excel Reader --> RSS Feed Reader
Supervised learning Target variable/Dependent Variable known and present
Target variable/Dependent Variable NOT known
Unsupervised learning
goal is to extract relationships between variables; clusters
Text Analytics information retrieval + information extraction + data mining + web mining
1/
6
, 4/15/25, 11:16
AM
"Knowledge discovery in textual data"
85-90% of all corporate data is in some kind of unstructured form (e.g., text);
unstructured corporate data is doubling in size every 18 months; tapping into
these information sources is not an option, but a need to stay
competitive
a semi-automated process of extracting knowledge from unstructured data
sources, aka text data mining or knowledge discovery in textual databases
Text Mining
Benefits of text mining are obvious especially in text-rich data environments [ law
(court orders), academic research (research articles), finance (quarterly reports),
medicine (discharge summaries), biology (molecular interactions), technology
(patent files), marketing (customer comments)]
Electronic communication records (spam filtering, email prioritization and
categorization, automatic response generation)
information extraction, topic tracking, summarization, categorization, clustering,
concept linking, question answering
Both seek for novel and useful patterns
Data Mining vs. Text Mining
Both are semi-automated processes
structured data in databases
unstructured data Word documents, PDF files, text excerpts, XML files, and so on
document unit of analysis
corpus The collection of documents, required for text analysis
terms words that you analyze in the document
concepts the collection of words that you analyze
In keyword searching, word endings are automatically removed (lines becomes
stemming
line);
In database searching, "stop words" are small and frequently occurring words like
and, or, in, of that are often ignored when keyed as search terms. Sometimes
stop words
putting them in quotes " " will allow you to search them. Words that you do not
need for your analysis
synonyms words that have similar meanings
Words with the same and a related meaning e.g. "foot" at the bottom of you leg
polysemes
and "foot" of a mountain
tokenization the process of breaking up a given text into units called tokens
remove inflectional endings only and to return the base or dictionary form of a
lemmatization
word
collection of terms specific to a narrow field that can be used to restrict the
term dictionary
extracted terms within a corpus
word frequency The frequency with which a word appears in a language is called
The process of marking up the words in a text as corresponding to a particular
part-of-speech tagging
part of speech based on a word's definition and context of its use
2/
6