Natural Language Generation types
1) Input looks English but is just data, output is in a language. Data-to-text
o 1. Summary of data, input data – output text
2) Input query – output text
o Same architecture can be used for both scenarios
• Questions for these examples
o What is the model’s output based on? What is it grounded in?
o Can we handle these cases with the same architecture?
o What questions to ask when to evaluate it?
• Natural language generation: tasks of generating text from some input in any natural language. Settings:
o Text-to-text, e.g. classical tasks: summarization, machine translation
o Data-to-text, e.g. summarizing tables (sports/weather data), summarizing patient data
o Media-to-text, e.g. captioning images, describing videos
o Open-ended (creative) generation, e.g. generating (fictional) stories, poetry based on prompts
▪ Deep neural networks (transformers) offer a unified framework which deal with all of
these kinds of language generations. So they can perform all these tasks
• What to say and how to say it - Thompson
o Strategic choices: what the system/human chooses to say
▪ Based on: input, additional knowledge and target language
▪ E.g. in the street organ example: street, organ, people
o Tactical choices: how to say it
▪ Highly dependent on language
▪ E.g. in the street organ example: a street organ on a city street.
• History
o Difficult to stop hallucinations from happening
o Extrinsic evaluation of smoking text
o Learning without parameter updates: learn by showing examples
• Dimensions when generating text
o Language: fluency, variation, style & coherence
o World: accuracy with relation to input, faithfulness to input &
truthfulness
o Interpersonal (pragmatics, sociolinguistics): alignment to
communicative intent, avoidance of harm
1
,L2 Subtasks Involved in Generating Text
Modular vs end - to- end
• Modular architecture: breaks down the main task into subtasks, modelling each one separately. Dominant
approach in ‘classical’ (pre-neural) NLG systems
• End-to-end models: no or fewer explicit subtasks. Less attention is paid to designing the steps in between,
but attention is paid in designing a learning framework. Contemporary models are end-to-end trained.
o Start from pairings from input and output, system needs to find the pathway itself
o Harder to figure out where the choices are being made
• Various tasks can be grouped in a three-stage pipeline: starting from input, generate text based on this input
and in more steps?
o Architecture represented a ‘consensus’ view
Reiter’s pipeline architecture, highly modular
• Document planner: picking what info it will convey and how it is organized. About the what
o Domain and task related things
• Microplanner: about the how, how you bring the information
o More language specific (words we use depend on which language is generated)
• Surface realiser: turns it into actual text
• Diagram doesn’t show knowledge sources! Like domain knowledge (e.g. information about the weather),
lexical/grammatical knowledge, model of the user
• Strategic tasks
o Selecting the messages to be included
o Rhetorical: relates words with relation e.g. contrast relation
o Ordering
o Segmentation
• Tactical tasks
o Lexicalisation: the words that we choose (use warm or hot?)
o Referring Expression Generation: how do we refer to things in the domain?
o Aggregation: how do we merge to get more fluid sentences?
• Tactical tasks
o Choosing syntactic structures
o Applying morphological rules
o Rendering the text as a string
• When dealing with raw, unstructured data, steps have to be taken before generating text. Data has to be
analysed in order to:
o 1) Identify the important things and filter out noise
o 2) Map the data to appropriate input representations
o 3) Perform some reasoning on these representations
• Extension of original architecture pipeline to handle data pre-processing – Reiter (2007)
o Signal analysis: to extract patterns and trends from unstructured input data
o Data interpretation: to perform reasoning on the results
2
,• Example BabyTalk (data-to-text system)
• Document planning/content selection
o Main tasks: content selection & information ordering
o Typical output is document plan
▪ Tree whose leaves are messages
▪ Nonterminals indicate rhetorical relations between messages (e.g. justify, part-
of/includes, cause, sequence)
• Lexicalisation: events in a sequence can be described in many ways to express the same thing
o SEQUENCE(x,y,z)
▪ x happened, then y, then z
▪ x happened, followed by y and z
▪ x,y,z happened
▪ there was a sequence of x,y,z
o With enough data, this variation can be learned
• Aggregation: given 2 or more messages, identify ways in which they could be merged into one, more concise
message
o e.g. be(HR, stable) + be(HR, normal)
▪ (No aggregation) HR is currently stable. HR is within the normal range
▪ (conjunction) HR is currently stable and HR is within the normal range
▪ (adjunction) HR is currently stable within the normal range
• Referring expressions: given an entity, identify the best way to refer to it unambiguously, e.g. bradycardia: a
bradycardia, the bradycardia, it, the previous one.
o Depends on discourse context: pronouns only make sense if entity has been referred to before
• Syntactic planning: sentence form can vary and still express the same thing
3
, o Realisation, subtasks
▪ 1) Map the output of microplanning to a syntactic structure
▪ 2) Identify the best form, given the input representation (Which is the best alternative?
Very hard to model in a rule-based fashion. Statistical approaches provide a solution.)
▪ 3) Apply inflectional morphology (plural, past tense etc) and then linearise as text string
• Key takeaways
o Text generation involves a series of choices
o Strategic choices (what) → context selection and microplanning
o Tactical choices (how) → microplanning and realisation
o Classic systems
▪ Heavily engineered
▪ Often modular
▪ Full control of choice behaviour
▪ Limited fluency and variation
o Contemporary models
▪ Trained (neural)
▪ Choice behaviour is stochastic, and learned from data
▪ Harder to control
▪ Much more fluent, broader variation
o Generating meaningful text is really hard
Image captioning - Modular and Data - Driven approaches
• General setup is similar to data-to-text scenario, only input is now a picture
• Kulkarni et al (2011)
o Key contribution: map from object/attribute detections to generated sentences
▪ Blue = objects
▪ Orange = spatial relations
▪ Green = other attributes
o “This is a photograph of one person and one brown sofa and one dog. The person is against the
brown sofa. And the dog is near the person and beside the brown sofa.”
o Modular step by step pipeline with dog
• Mitchell et al (2012)
o Key contribution: exploit corpus-based knowledge for generation
o Finding the most likely way to relate the words that described the image, but this could be wrong
4