Text and Speech Analytics Theory Digital Assignment Abstract: In the realm of natural language processing, text generation stands as a pivotal challenge, especially in generating contextually coherent and extended textual content. This project addresses the difficulties associated with automated text generation, which is crucial for enhancing user interactions in applications ranging from automated chatbots to creative writing aids. The proposed solution utilizes a Long Short-Term Memory (LSTM) network, known for its efficacy in capturing long-term dependencies within data sequences, to generate text character by character. This model is built with an architecture that includes 128 LSTM units followed by a dense output layer activated by a softmax function, designed to predict the probability distribution of the next character. Training on a subset of text allows the model to learn nuanced language patterns and generate text with varying degrees of randomness and creativity, controlled by a temperature parameter during the generation process. Initial results showcase the model's ability to produce text that is not only coherent but also contextually appropriate at different levels of creativity, demonstrated by generating text at temperatures ranging from 0.2 to 1.0. This flexibility highlights the model's potential in adapting to various stylistic requirements of text-based applications. Text Generation in Academic Environments: Research and Writing Assistance: Text generation tools can assist in drafting academic papers, reports, and literature reviews by suggesting sentence completions, paraphrasing ideas, or even generating entire sections of content based on initial inputs. This can help streamline the writing process and improve productivity. Educational Material Production: For educators, generating customized content for courses, quizzes, and explanatory notes can be made easier and faster. These tools can adapt content to different learning levels and styles, enhancing educational accessibility and personalization. Language Learning Tools: Text generation can be used to create practice exercises for language learning, such as forming sentences, grammar tests, and essay questions. It can provide instant feedback to students, enabling more effective and interactive learning experiences. Simulation of Historical Texts: In the study of literature or history, text generators can simulate writing styles of historical periods or specific authors. This can be used for educational purposes, such as understanding linguistic and stylistic elements characteristic of different eras or writers. Data Augmentation in Research: In computational linguistics and data science, generating synthetic text can help in augmenting datasets where textual data may be scarce or imbalanced. This is useful in training more robust models for language understanding tasks. Academic Integrity and Plagiarism Education: Text generation models can be used to demonstrate examples of plagiarism or how to avoid it by paraphrasing content correctly. They can also be used to develop better plagiarism detection tools by creating diverse examples of text manipulation. Enhancing Accessibility: Generated text can be used to create summaries of long academic texts, making them more accessible to students with disabilities or those who face challenges with large volumes of reading material. Conference and Seminar Management: Text generators can assist in creating abstracts, notifications, and promotional materials for academic events, streamlining the administrative workload and improving communication effectiveness. Python Code import random import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Activation, Input from tensorflow.keras.optimizers import RMSprop #Importing the corpora filepath = r'D:\AK\ed3book.txt' text = open(filepath, 'rb').read().decode(encoding='utf-8').lower() text = text[300000:800000] text import re def clean_text(text): # Remove special characters text = re.sub(r'[^\w\s]', '', text) # Remove extra whitespaces text = ' '.join(text.split()) # Remove numbers text = re.sub(r'\d+', '', text) # \d+ matches one or more digits # Convert to lowercase text = text.lower() return text # Read text from your file (assuming filepath is set correctly) with open(filepath, 'r', encoding='utf-8') as f: text = f.read() # Clean the text text = clean_text(text) text = text[300000:800000] #Creating a set that contains all characters in selected text characters= sorted(set(text)) #Converting characters to numeric values char_to_index = dict((c,i) for i,c in enumerate(characters)) #Numeric values to characters index_to_char = dict((i,c) for i,c in enumerate(characters)) SEQ_LENGTH = 40 #use 40 characters to predict next character STEP_SIZE = 3 #Gathering sentences and their next character as training data sentences = [] #features next_characters = [] #target for i in range(0, len(text)-SEQ_LENGTH, STEP_SIZE): sentences.append(text[i:i + SEQ_LENGTH]) next_characters.append(text[i + SEQ_LENGTH]) #Converting these to NumPy arrays x = np.zeros((len(sentences), SEQ_LENGTH, len(characters)), dtype = np.bool_) y = np.zeros((len(sentences), len(characters)), dtype = np.bool_) for i, sentence in enumerate(sentences): for t, character in enumerate(sentence): x[i,t,char_to_index[character]] = 1 y[i, char_to_index[next_characters[i]]] = 1 # Building the neural network model = Sequential() model.add(Input(shape=(SEQ_LENGTH, len(characters)))) # Corrected input shape usage model.add(LSTM(128)) # 128 neurons model.add(Dense(len(characters))) model.add(Activation('softmax')) # Scales output so that all values add up to 1 model.compile(loss='categorical_crossentropy', optimizer=RMSprop(learning_rate=0.01)) model.fit(x, y, batch_size=256, epochs=4) model.save('textgenerator.keras') model = tf.keras.models.load_model('textgenerator.keras') #Helper function def sample(preds, temperature=1.0): preds = np.asarray(preds).astype('float64') preds = np.log(preds) / temperature exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) probas = np.random.multinomial(1, preds, 1) return np.argmax(probas) #Text generation function (predict the first next character) def generate_text(length, temperature): start_index = random.randint(0, len(text) - SEQ_LENGTH- 1) generated = '' sentence = text[start_index:start_index + SEQ_LENGTH] generated += sentence for i in range(length): x = np.zeros((1, SEQ_LENGTH, len(characters))) for t, character in enumerate(sentence): x[0,t,char_to_index[character]] = 1 predictions = model.predict(x, verbose = 0)[0] next_index = sample(predictions, temperature) next_character = index_to_char[next_index] generated += next_character sentence = sentence[1:] + next_character return generated print("Temperature = 0.2") print(generate_text(300,0.2)) print("Temperature = 0.4") print(generate_text(300,0.4)) print("Temperature = 0.6") print(generate_text(300,0.6)) print("Temperature = 0.5") print(generate_text(300,0.5)) print("Temperature = 0.8") print(generate_text(300,0.8)) print("Temperature = 1.0") print(generate_text(300,1.0)) Temperature = 0.2 chrf limitations while automatic character to the sentence of the sequence of the entire context the tasks are the hidden layer of the context to a sentence to the same sentence of the input to the sentence in the sentence in the same of the context to the entire sequence of the sentence of the language models the most and the model and t Temperature = 0.4 ng the best hidden state sequence for the context and presented the entire well are the entire to the most and context and the posse value the vector of the previous of the answer the encoder to the model to compute language model and are the next the most the context to a bit the context and the and algorithm the most and the but context Temperature = 0.5 aking use of gated units can be unrolled as and reculling are accoward to assign of the prediction of different in the because the some probability of the sample with the loss in the sentence in used on the document we called a sentence of the context of the representation of the input span between the sentence of the system in for each w Temperature = 0.6 e hidden state at time t represents information for and more becaully aroter based hundon low the language models of use the context the input to the states is of with edf text to the will from the same a more on the word for beam vector that of the squrcheided for each p as a of the input to also partofspeech context to be the cat obe we Temperature = 0.8 ve might provide improved performance on the sentence of is the semant predictionar each smhated of auts is document fact and obe de showoghauding a weige pech for the sparse language mode samplifical for a probability derivation algorithm for sambiged seed the mpeses lisagual and earlier mt a final during and distraind to english trans Temperature = 1.0 pect to each of the parameters in logisty is gebectives differention r a mative of recogpreems sime of the recogn the sentence and useful urars fuly for ondigna i to the input lext chunget occurs are hipter contto assolinglt functions and l n d i ranger approach fulter chante of the lers of language more e uny x figure probability he