Uploaded by Aravind Karunakaran 23MCB0013

Text and Speech Analytics - Text Generation

advertisement
Text and Speech Analytics
Theory Digital Assignment
Abstract:
In the realm of natural language processing, text generation stands as a pivotal challenge, especially in
generating contextually coherent and extended textual content. This project addresses the difficulties
associated with automated text generation, which is crucial for enhancing user interactions in applications
ranging from automated chatbots to creative writing aids. The proposed solution utilizes a Long Short-Term
Memory (LSTM) network, known for its efficacy in capturing long-term dependencies within data sequences,
to generate text character by character. This model is built with an architecture that includes 128 LSTM units
followed by a dense output layer activated by a softmax function, designed to predict the probability
distribution of the next character. Training on a subset of text allows the model to learn nuanced language
patterns and generate text with varying degrees of randomness and creativity, controlled by a temperature
parameter during the generation process. Initial results showcase the model's ability to produce text that is
not only coherent but also contextually appropriate at different levels of creativity, demonstrated by
generating text at temperatures ranging from 0.2 to 1.0. This flexibility highlights the model's potential in
adapting to various stylistic requirements of text-based applications.
Text Generation in Academic Environments:
Research and Writing Assistance: Text generation tools can assist in drafting academic papers, reports, and
literature reviews by suggesting sentence completions, paraphrasing ideas, or even generating entire
sections of content based on initial inputs. This can help streamline the writing process and improve
productivity.
Educational Material Production: For educators, generating customized content for courses, quizzes, and
explanatory notes can be made easier and faster. These tools can adapt content to different learning levels
and styles, enhancing educational accessibility and personalization. Language Learning Tools:
Text generation can be used to create practice exercises for language learning, such as forming sentences,
grammar tests, and essay questions. It can provide instant feedback to students, enabling more effective and
interactive learning experiences.
Simulation of Historical Texts:
In the study of literature or history, text generators can simulate writing styles of historical periods or specific
authors. This can be used for educational purposes, such as understanding linguistic and stylistic elements
characteristic of different eras or writers.
Data Augmentation in Research:
In computational linguistics and data science, generating synthetic text can help in augmenting datasets
where textual data may be scarce or imbalanced. This is useful in training more robust models for language
understanding tasks.
Academic Integrity and Plagiarism Education:
Text generation models can be used to demonstrate examples of plagiarism or how to avoid it by
paraphrasing content correctly. They can also be used to develop better plagiarism detection tools by
creating diverse examples of text manipulation.
Enhancing Accessibility:
Generated text can be used to create summaries of long academic texts, making them more accessible to
students with disabilities or those who face challenges with large volumes of reading material.
Conference and Seminar Management:
Text generators can assist in creating abstracts, notifications, and promotional materials for academic events,
streamlining the administrative workload and improving communication effectiveness.
Python Code
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Activation, Input
from tensorflow.keras.optimizers import RMSprop
#Importing the corpora
filepath = r'D:\AK\ed3book.txt'
text = open(filepath, 'rb').read().decode(encoding='utf-8').lower()
text = text[300000:800000]
text
import re
def clean_text(text):
# Remove special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove extra whitespaces
text = ' '.join(text.split())
# Remove numbers
text = re.sub(r'\d+', '', text)
# \d+ matches one or more digits
# Convert to lowercase
text = text.lower()
return text
# Read text from your file (assuming filepath is set correctly)
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
# Clean the text
text = clean_text(text)
text = text[300000:800000]
#Creating a set that contains all characters in selected text
characters= sorted(set(text))
#Converting characters to numeric values
char_to_index = dict((c,i) for i,c in enumerate(characters))
#Numeric values to characters
index_to_char = dict((i,c) for i,c in enumerate(characters))
SEQ_LENGTH = 40 #use 40 characters to predict next character
STEP_SIZE = 3
#Gathering sentences and their next character as training data
sentences = [] #features
next_characters = [] #target
for i in range(0, len(text)-SEQ_LENGTH, STEP_SIZE):
sentences.append(text[i:i + SEQ_LENGTH])
next_characters.append(text[i + SEQ_LENGTH])
#Converting these to NumPy arrays
x = np.zeros((len(sentences), SEQ_LENGTH, len(characters)), dtype = np.bool_)
y = np.zeros((len(sentences), len(characters)), dtype = np.bool_)
for i, sentence in enumerate(sentences):
for t, character in enumerate(sentence):
x[i,t,char_to_index[character]] = 1
y[i, char_to_index[next_characters[i]]] = 1
# Building the neural network
model = Sequential()
model.add(Input(shape=(SEQ_LENGTH, len(characters)))) # Corrected input shape
usage
model.add(LSTM(128)) # 128 neurons
model.add(Dense(len(characters)))
model.add(Activation('softmax')) # Scales output so that all values add up to 1
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(learning_rate=0.01))
model.fit(x, y, batch_size=256, epochs=4)
model.save('textgenerator.keras')
model = tf.keras.models.load_model('textgenerator.keras')
#Helper function
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
#Text generation function (predict the first next character)
def generate_text(length, temperature):
start_index = random.randint(0, len(text) - SEQ_LENGTH- 1)
generated = ''
sentence = text[start_index:start_index + SEQ_LENGTH]
generated += sentence
for i in range(length):
x = np.zeros((1, SEQ_LENGTH, len(characters)))
for t, character in enumerate(sentence):
x[0,t,char_to_index[character]] = 1
predictions = model.predict(x, verbose = 0)[0]
next_index = sample(predictions, temperature)
next_character = index_to_char[next_index]
generated += next_character
sentence = sentence[1:] + next_character
return generated
print("Temperature = 0.2")
print(generate_text(300,0.2))
print("Temperature = 0.4")
print(generate_text(300,0.4))
print("Temperature = 0.6")
print(generate_text(300,0.6))
print("Temperature = 0.5")
print(generate_text(300,0.5))
print("Temperature = 0.8")
print(generate_text(300,0.8))
print("Temperature = 1.0")
print(generate_text(300,1.0))
Temperature = 0.2
chrf limitations while automatic character to the sentence of the sequence of the
entire context the tasks are the hidden layer of the context to a sentence to the
same sentence of the input to the sentence in the sentence in the same of the
context to the entire sequence of the sentence of the language models the most
and the model and t
Temperature = 0.4
ng the best hidden state sequence for the context and presented the entire well
are the entire to the most and context and the posse value the vector of the
previous of the answer the encoder to the model to compute language model and are
the next the most the context to a bit the context and the and algorithm the most
and the but context
Temperature = 0.5
aking use of gated units can be unrolled as and reculling are accoward to assign
of the prediction of different in the because the some probability of the sample
with the loss in the sentence in used on the document we called a sentence of the
context of the representation of the input span between the sentence of the
system in for each w
Temperature = 0.6
e hidden state at time t represents information for and more becaully aroter
based hundon low the language models of use the context the input to the states
is of with edf text to the will from the same a more on the word for beam vector
that of the squrcheided for each p as a of the input to also partofspeech context
to be the cat obe we
Temperature = 0.8
ve might provide improved performance on the sentence of is the semant
predictionar each smhated of auts is document fact and obe de showoghauding a
weige pech for the sparse language mode samplifical for a probability derivation
algorithm for sambiged seed the mpeses lisagual and earlier mt a final during
and distraind to english trans
Temperature = 1.0
pect to each of the parameters in logisty is gebectives differention r a mative
of recogpreems sime of the recogn the sentence and useful urars fuly for ondigna
i to the input lext chunget occurs are hipter contto assolinglt functions and l n
d
i ranger approach fulter chante of the lers of language more e uny x figure
probability he
Download