{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Minería de texto: Modelo de tópicos LDA\n",
    "En esta clase vamos a ver el modelo de tópicos __Latent Dirichlet Allocation (LDA)__.\n",
    "\n",
    "Cubriremos los siguientes temas:\n",
    "* Tratamiento de textos\n",
    "* Preprocesamiento de textos\n",
    "* WordClouds\n",
    "* Document-Term Matrix\n",
    "* LDA\n",
    "* Interpretación de resultados\n",
    "    * Palabras más importantes\n",
    "    * Distribución de textos por tópicos\n",
    "* Selección de modelo\n",
    "* Visualización (LDAvis)\n",
    "\n",
    "Para estudiar LDA, a parte del paper original, les recomendamos el siguiente video: https://www.youtube.com/watch?v=3mHy4OSyRf0&t=6s\n",
    "\n",
    "La base de datos que trabajaremos será una muestra de noticias de [Reuters](https://www.reuters.com/) obtenida a través de webscrapping por Germán González.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "import re # El paquete para tratar texto. Expresiones regulares\n",
    "from sklearn.feature_extraction.text import CountVectorizer # Vectorizador de palabras y DTM\n",
    "from sklearn.decomposition import LatentDirichletAllocation # Modelo de LDA\n",
    "from scipy.sparse import csr_matrix # Para tratar Sparse Matrix\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data=pd.read_csv('reuters.csv') # CArgo los datos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.Noticia.iloc[25] # Exploro una noticia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocesamiento\n",
    "* Tokenizar: Separar el texto en párrafos, frases, etc...\n",
    "* Limpieza: Minúsculas, quito puntuación, remuevo palabras de 3 caracteres.\n",
    "* Stopwords\n",
    "* Lematizar: cambio de tiempos verbales\n",
    "* Stemmed: enviar palabras a sus raíces"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Limpieza básica"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(len(data.Noticia)):\n",
    "    data.Noticia.iloc[i]=re.sub('[,\\.!?\\-!?\\n\\)\\(]', '',data.Noticia.iloc[i]) # Borro Puntuaciones\n",
    "    data.Noticia.iloc[i]=re.sub('[0-9]', '',data.Noticia.iloc[i])\n",
    "    data.Noticia.iloc[i]=re.sub('reuters', '',data.Noticia.iloc[i])\n",
    "    data.Noticia.iloc[i]=re.sub('said', '',data.Noticia.iloc[i])\n",
    "data.Noticia=data.Noticia.str.lower() # Convierto minúsculas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.Noticia.iloc[25] # Volvemos a ver la misma noticia"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ahora construiremos la matriz término-documento\n",
    "n_vocab=1500 # máximo tamaño de vocabulario\n",
    "tf_vectorizer = CountVectorizer(max_df=0.8, min_df=2, max_features=n_vocab, stop_words='english') # Al igual que un modelo, defino el objeto que construirá la matriz\n",
    "tf = tf_vectorizer.fit_transform(data.Noticia) # Aplico el objeto a un conjunto de textos\n",
    "tf_feature_names = tf_vectorizer.get_feature_names() # Veo el vocabulario"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TF_detallada=pd.DataFrame(csr_matrix(tf).todense(), columns=tf_feature_names) # Vuelvo de sparse a densa para explorarla\n",
    "TF_detallada.head() #Veo las primeras 5 filas\n",
    "print(TF_detallada.shape) # Veo las dimensiones, a qué corresponden?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TF_detallada.head() # Exploramos la matriz término-documento"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.Noticia.iloc[0] # Cuántas veces aparece years?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ¿Que tal si estudiamos las frecuencias de las palabras?\n",
    "frecuencias=pd.DataFrame(TF_detallada.sum(), index=tf_feature_names, columns=['Freq'])\n",
    "frecuencias.sort_values(by=['Freq'], ascending=False, inplace=True)\n",
    "frecuencias.head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "frecuencias.head(30).plot(kind='bar', figsize=(12,6))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install wordcloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from wordcloud import WordCloud #importo la función"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exploremos los stopwords\n",
    "tf_vectorizer.get_stop_words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cloud=WordCloud(background_color='white', width=700, height=700, max_words=100, max_font_size=300, stopwords=tf_vectorizer.get_stop_words(), colormap='Reds',random_state=23) # Construyo el generador de la nube\n",
    "cloud.generate('.'.join(list(data.Noticia))) # Genero la nube\n",
    "cloud.to_image() # Despliego la imagen de la nube"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modelo LDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_topics=10 # Cuántos tópicos deseo\n",
    "lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10,doc_topic_prior=0.1, topic_word_prior=0.1, n_jobs=-1,random_state=23, verbose=1) # Construyo el objeto que es el modelo\n",
    "lda.fit(tf) # Estimo el LDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(lda.components_.shape) # De que tma~no es el resultado?\n",
    "lda.components_ # Exploremos el resultado"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Construyo la función que me ayuda a ver las palabras más importantes de cada tópico\n",
    "def print_topics(model, count_vectorizer, n_top_words):\n",
    "    words = count_vectorizer.get_feature_names()\n",
    "    for topic_idx, topic in enumerate(model.components_):\n",
    "        print(\"\\nTopic #%d:\" % topic_idx)\n",
    "        print(\" \".join([words[i]\n",
    "                        for i in topic.argsort()[:-n_top_words - 1:-1]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print_topics(lda, tf_vectorizer, 15) # Veo las 15 palabras más importantes de cada tópico"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Como se ven los documentos?\n",
    "lda_output=lda.transform(tf) # transformo la matrix de término-documento en tópico-documento\n",
    "print(lda_output.shape) # Qué indican las dimensiones?\n",
    "docs=['doc'+str(i) for i in range(lda_output.shape[0])] # Nombres de filas\n",
    "topics=['topics'+str(i) for i in range(lda_output.shape[1])] # Nombres de columnas\n",
    "lda_output=pd.DataFrame(lda_output, index=docs, columns=topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exploremos la salida desde el punto de vista de documentos\n",
    "lda_output.head().sum(axis=1) # Porque las filas suman 1?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cómo se distribuye el documento promedio?\n",
    "lda_output.head().mean(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creemos la pertenencia al tópicos\n",
    "topico_dominante = np.argmax(lda_output.values, axis=1) \n",
    "lda_output['Topico_dominante']=topico_dominante\n",
    "lda_output.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lda_output.Topico_dominante.hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selección de modelo\n",
    "Al ser análisis no supervisado no es nada fácil escoger el mejor modelo, y es aún más retador cuando es texto. Tenemos una aproximación, la máxima verosimilitud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time\n",
    "# Juguemos con un hiper parámetro\n",
    "likelihood=[]\n",
    "values=[i for i in range(2,21)]\n",
    "for i in values:\n",
    "    modelo = LatentDirichletAllocation(n_components=i, max_iter=10,doc_topic_prior=0.1, topic_word_prior=0.1, n_jobs=-1,random_state=23) # Construyo el objeto que es el modelo\n",
    "    modelo.fit(tf)\n",
    "    likelihood.append(modelo.score(tf))\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualizamos\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.plot(values, likelihood)\n",
    "plt.xlabel('Número de tópicos')\n",
    "plt.ylabel('log-likelihood')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualización del LDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pyLDAvis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyLDAvis # Paquete que crea la visualización\n",
    "from pyLDAvis import sklearn as sklearnlda\n",
    "import pickle # Paquete para manejar .pkl\n",
    "import os # paquete para navegar por el pc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LDAvis_prepared=sklearnlda.prepare(lda, tf, tf_vectorizer ) # Preparo el modelo y sus resultados para la visualización\n",
    "pyLDAvis.save_html(LDAvis_prepared, 'LDA.html') # Guardo la visualización como html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pyLDAvis.display(LDAvis_prepared) # Lo visualizo dentro del notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
