Text Mining Techniques with Orange Software

good evening friends so we are going to start this session the topic for today is text mining and uh the one of the important the input for the qualitative research is it text document so um today we will study the basics of text mining and we will use the software orange for doing the exercise and we later on we will also do the same exercise on the nvo but first of all we will cover the orange part whenever you collect the data the data comes in the format of PDF files Word files text files because all the all these files are readable and sometimes you are having the text in Excel file that can also be imported even the research papers are also considered as Text data the interview transcript is also considered as a text Data so sometimes you have the PDF files news text files so all are considered as a text Data so we are doing some operations on this text Data and we need to clean the text Data and then we can do the analysis on uh on the text there are many algorithms available on the text Data so we will see some of the algorithms in Orange and some of the algorithms in and Vivo so text Analytics the text analytics is the basic and the most important part of the qualitative research and we believe that the words are more important and they are not only important the the there are many algorithms which are natural in the words and there's a huge there's a complete science called NLP natural language processing it is a big area and in this we examine the words some words are near to each other like whenever you say online it may be online work online teaching so online and teaching are coming together so some words are coming together and sometimes if some word is very very important you speak many times so frequency of words is one of the important criteria to decide whether the word is important or not some words comes always closer to each other some words comes and after some words so we can do the clustering on the words so we can count the frequency and uh words also have an important property sometimes we need to identify which are the words which is coming before a particular word and we also interested to know what are the different words which comes after this like for example if the word is success what what are the different words which comes before that in the sentence and what are the different words which comes after that so there are many algorithms available to find out the words before a particular word and the different words which comes after a particular word another thing another aspect of the world is uh some words are positive in nature like good best excellent happy these are the positive words so the sense of the words is positive and some words we are using for negative sense bad right so uh if we divide our dictionary into three categories positive words negative words and neutral words then we can perform the sentiment analysis on the words and there are many softwares like in nvo even the positive words can be divided into two highly positive less positive similarly negative words can be divided into two categories highly negative and moderately negative so the dictionary is divided into the category of words so there are different algorithms we are using these algorithms count how many words are highly positive how many words are moderately positive negative and on the basis of the count we can develop some sentiment index so this is also the part of text mining and word cloud is uh is a graph which represents the different important words along with their frequency the most frequent words comes larger number and less frequent words come smaller number we can look cluster analysis correspondence analysis so you can see different methods that we can do in on the text so we will do all these methods okay now in this graph if in this slide there are two word cloud this word cloud the first word cloud is the raw data and the second cloud is clean data now if you see carefully if I ask you to differentiate the two graphs you can see the the word the so this is actually uh this is this word is don't have any meaning similarly there are many words which don't have any meaning like you can see this Dot and to of on so some of the words are not important at all we are we are not using these words right so there are many uh many things which we actually don't require in the analysis so cleaning is Nick required cleaning is required sometimes we are having the numbers like 19 and this number is actually a noise in the text Data so when we got when we get the raw data we need to clean it so there are many uh techniques which are available for data cleaning so we can see that how we can clean the data now the question is that what are the different methods that we have to apply on cleaning of the text Data so let's see okay so now you can see these are some of the methods which are available for cleaning beat text number one digits like whenever we are having the numeric data we cannot analyze the text you cannot plus a plus a is equal to B you cannot say any you cannot apply any of the mathematics on alphabets so when you have numeric data the alphabets are not required similarly when you are having texted data numbers are not required so first of all we have to remove the numbers from the text Data second there are many special characters like dollar at the rate and so these special characters are also not required we need to remove them punctuations dot coma invertise comma so these are also not required all these things are considered as a noise in the text Data then uh either we have to convert all the words in in small letters or we have to convert all the words in capital letters because normally when we write some alphabets comes in capital and some alphabets comes in small so we have to convert every word either in lower case or in the uppercase and we also have the problem of white space white space means when you uh delete some of the word the white space comes in between the two words so you need to remove the white space and we also have the problem of is top words stemming and the beggar words so let me explain this concept so there is a concept of the stop words and this is the uh one of the most important concept in NLP in text mining so let's try to understand so what is the concept of stopped words there are many words in the English sentence which do not have any meaning like two from all the words you can see in this slide are basically the stop words these words do not have any meaning so we need to remove these stop words almost every software gives you the facility to declare these talk words and can remove them we have this facility in Orange we have this facility in nvivo so first of all you need to declare these top words and then remove them so remember that the stop word is one of the important noise not important noise major noise which we have to perform so stop words we need to taken care of and sometimes like suppose somebody is giving the lecture on in English normally there are many words which are originally in Hindi but we are using in English uh like by it is a Hindi word but uh if you are taking interview for office person if this word comes then you need to declare this word as a stop word because it is not having any any meaning so um in PM speech normally uh we heard the word like bio and Bano so you we have to declare these words as this top word so stop words it is one of the criteria of removing the noise from the text mining the another thing is that there is a concept of stemming stemming like if you see a tree the tree is having the stem and the branches so similarly in English language we are having this concept of the stemming like consult is stem and uh if you add a and d it becomes consultant Consulting consultative consultant Consulting now the question is can we say all these words are similar to consult or all these words are different like nvo always ask you do you want to consider all these words at same all or do you want want to keep all these words as different like for example uh in one of the study we are using the word education now the uh the words like Educators is coming so we were doing research on online education and we are having the words like education educator educating like that educate so the question is that can we consider all these words same or all the words are different so in that history we declare that all these words are different because educator is a different person education is different educating is different online education is different so sometimes uh the words are different and sometimes words are same like for example connect so we are using connect connected connection connections connect so all these words are representing the same thing so sometimes we need to declare that all these words are same and sometimes we need to declare that all these words are different and similar to the stemming there is another issue that is called synonymism sometimes uh the words are having synonyms and some softwares also give me give us the facility to declare synonym as the same word so there are many words which are having synonymism so these two words like it is talk and speak so somebody is speaking somebody is talking so talking and speaking is arsenalism so you have still you have a facility to declare that these two words are same so while in text mining we have to keep all these things in mind that whether the words have slamming different words or synonymous words so these are the basic things that we have to keep in our mind after removing all these problems uh we left with some important words these important words are known as the feedback of words so once we remove all the problems we remove the uh stop words we remove after removing the digits the special characters punctuations after removing all the important problems now whatever is left whatever is left means we are left with important words so this import these important words are known as bag of words so bag of words is a is a technical word technical word which we use in different softwares so the bag of words represent all important words after cleaning the data and after the bag of words you can apply the algorithm there are many algorithms which we will learn and all these algorithms should be applicable after cleaning process if you apply all the algorithm before cleaning process you will not get the correct results so therefore cleaning is very important in Text data okay now uh this is a program which we are going to learn today so for your ease I put the diagram in the PPT for because you need to do the practice on that and and we will will give you the result like word cloud this is called word tree so this is a nvivo output but I will show you that how we can do it on orange also uh the word tree is a output where we identify a important word like successful is an important word keyword and these are the five five words which is coming before this word and these are the five different words which are coming after this so this is important this technique is very important in doing the literature review because whenever you are doing a literature review you must have some theme in your mind so you have some important keywords especially when you are talking about the constructs and the items so if you identify if you type the construct name you can identify the different set different words which are coming before that words coming after that so this technique is very very useful in systematic literature review or you are doing thematic review so we will see how this technique can be done in Orange today and then this is a mind map we which we can do in uh nvo in okay so after learning the basic things about it explaining so now less rest of the work I will do in Orange software and I hope you will like it so we have the orange software I already opened it and I gave you I will give you the data in data folder let me show you the data on which we will work today so so first first this we are having this data set and by this you can understand that in what format we can have the data set in this data set this is the data set original data set and you can also do the prediction of the data set so this is for prediction but prediction we will do in another day today I am showing you the how to keep the data so there are five folders here and the first folder is having the information about the business this is the folder where the there is information about the entertainment food medicine sport and if you go in any of the folder so you can see that we are having the text documents and there are 50 text document in this data set okay so another data set I am showing you is the transcript these are the interview transcripts and all the interview transcript are considered as the text Data so when you open the text Data so you can see that we are having the questions and answer question and answer question and answer so in this format you have to maintain so this is the question and this is the answer key so this is the question and this is the answer so this is a the interview transcript and then we have one PDF file this is a research paper so we can also take the PDF files for the text mining and those who are doing the uh literature review they can take all the files in the software and perform the analysis the only difference is that uh if you do it on large data it will take more time otherwise there is no problem everything is same whatever sequence I will show you whatever model I will form you can also perform 155 000 files the only difference is that it will take time so we are having three tax data files data set Text data set Word file and PDF so all the three are considered as a text Data I will give you in the drive Google Drive so you can perform the analysis later on but and one more thing this is a stopwatch because whenever you are examining doing the text mining you must have this file so if I open it uh you can see that there are many uh words written here and slowly and slowly you can improve this file so whatever words you mentioned here that is considered as a stop word and you can attach this file in in the analysis to remove the stop word astrovert is this text file is also available in Google you can because many softwares provided this top word list this is the list in which I just for created for the exercise okay so this is my text data so now you understand that text Data comes in three formats text file Word file PDF file even if you are using the uh working on the news so most of the news you can put in the word file or the PDF file if you are working on the research papers most of them are in PDF file so this is considered as a text Data now first of all we need to import this data so I go to the text mining and take this folder this folder is import import document now click on that and you need to click on this folder and let's say I select the interview transcript normally all the files should be kept in a folder because we have to declare the folder this is the default folder on which we are going to work so now I am showing you the folder this folder and select folder go to select folder so when you select folder the software will import all the files of a folder and you can see that there are seven files you you can see these seven there are seven files in this and by default you are having three options Lemma POS right so you just click on this and I will tell you that what it is seven document this software is telling you seven documents if I select uh let's say this data set select folder the software is importing the data and now software is telling you that there are two 49 documents in divided in five categories so software by default automatically read the data so by why it is considered as five categories because the folders are there are five folders containing different type of files so if you have many folders then this kind of information is coming but I'm not working on 249 because it will take more time so for coming just for time you can select on any of the like suppose if I go to food or and let's say entertainment select folder so software is saying that you are having 50 documents so in this way we can import the text Data so I I'm coming back to the data transcript that's it because the there are only seven document it will going fast processing okay so after taking the data now you can click on so basically these circles these brackets are known as the channel so this is the output Channel and there is no input channel for this so let me connect with Corpus viewer so Corpus viewer is a icon which is having the input Channel because it is taking input from here and provide the output from this one so Corpus wear means if you want to see the file you can just click on them and you can see that there are five documents seven documents if you click on any document the complete document you can read from here so if you click on any document the complete document opens here so name path content is automatically taken by the software so name of the name of the file path location show content okay so similarly you can see the fold the files here all the files now I just making the word cloud so this is a word cloud which is without any pre-processing without any cleaning so if I click on this so let me show you in big so you can see that the word cloud comes and in this word cloud there are many problems you can see this invertise comma comma the digest upward who is his top word is and so all the unnecessary words are coming and you can see that the words are tilted and you can increase the degree of kill test which means now you can see that we are increasing to 60 percent and if you take it to the zero percent then all words are coming like this so you have this facility to improve the word cloud but because I make the word cloud without any cleaning so it is containing a lot of problems and we should not use this word cloud so you can if you see the words and the weight so here all the words are coming but comma full stop comes to 40 times Da comes to 219 times 2 comes quantity times and all the words are useless so we should not do this we should rather go for pre-process text so pre-process means now I am going to clean the data so let me put this one this way so because I will get main content after the pre-processing so let me explain what is the P processing if we if you click on the pre-processing software software will ask you you want to lowercase yes you want to remove Ascent yes so if you ask me what is ascent then Ascent is like sometimes you see that a and this hat on a a the this one so all these problems will be removed pass HTML sometimes uh we are having the HTML www that that kind of thing especially when you're collecting the data from the Twitter then it will remove htmls and urls so first of all we are removing this and then tokenization tokenization means removing the punctuation type punctuation so you can click on the word punctuation or you can click on the white space sentence tweet so these are the different formats but I suggest you to go for rig X so re exp is a kind of algorithm which will remove you can see all these if you see here all the star and dollar slashes Dash dots all are removed automatically so rather than click on word punctuation or wide display or anything else you should click on rig X so it is a uh it is a python dictionary which which do not allow any of the symbol any of the this out of this in the text files after that we are having the filtering process so for filtering process the most important process is these top words so you have to go to this folder and declare this stop folder because this is very very important so stop folder and click on open so now this is top folder list is coming so whatever is written in this file will not come in the word cloud or will not come in the back of words so for data cleaning this stockword text is very much okay then lexicon lexicon it means there are some technical Words which are which are specific to the area like for example if you talk about physics sum there are some technical words in physics if you talk about the uh so basically lexicon is a word which is comes in the dictionary of a particular subject so it will filter numbers we don't want numbers or if you want to include numbers you can click on this remove all their problems in our document frequency means how many uh frequencies like high rather than this we can click on most frequent tokens means how many important word you want to represent you can represent a thousand words you can you can identify the top 100 words like that so number of words is your choice and post X means uh post X actually it is a algorithm which identify whether the word is a noun or a verb so I this algorithm identify that whether this any word is a noun or a verb because if the word is a noun or verb it is considered in the analysis so you can go for post X so these are the uh process for cleaning the data and after doing this process if I make the word cloud so you can see the difference now you don't have any problem Okay so this word cloud is far better than the previous word cloud because in previous word cloud we have all the problems now I am showing you one thing suppose this every I want to remove every from this cloud so the process is you go back to the stop word and type every and cross it and save it we are saving is important and after that you just click on it and here you can see the refresh so just refresh so you can see this refresh and after the refresh the every word will disappear so you can you can remove any of the unnecessary word from the word cloud so normally it takes time because there are there could be many words like making or need makes so there are many words which you may not need in the workload so you can remove all the word cloud from that process you just type that word in the stop word list and refresh and the word will not be there okay so this is these are the basic exercises now after that uh let's go to the uh so there is a word called con or dance concordance Corpus to Corpus okay so now I am showing you what is the meaning of the Concordance in the PPT I showed you a word tree if you click on it right now suppose uh let me see the word cloud and success so let's say I want to do a research on this word success so I will type here as Hue double c e s s and let me show you the complete picture so this is the output of the concordance so you can see that success comes in between uh success so you can you can select all okay so success comes in between and the question is that how many how many words you need before it and after it so I am saying that I want to see only three words before the success and after the success so like have to achieve success an individual have to achieve success your client so I don't get any meaning so I am improving the number of words so uh if if you say number of word five it means one two three four five five words comes before this one one two three four five five after this so this is known as concordance means if you are studying a particular word you can get all these sentences containing that word but you have to increase the number of words so you can go to maximum number 10 number so between three to ten you can put any any number because it is believed that the uh 10 words before a particular word comes within the sentence so now you can read the full sentence there are three qualities an individual must have to achieve access success are three qualities an individual must to have the achieved success so now you if you read all these sentences but it will take time but after reading all these sentences you will have everything which is written about the success in all the documents so that is a very fantastic algorithm which you can run in Orange and when you do it in nvivo you can also do it in nvo but in nvivo V5 is fixed you you can have only five before this word and five after the word but in Orange we are having this facility we can choose the word between 3 2 10. so this exercise I am telling you is very helpful for literature review thematic review and especially when you are talking about the constructs and the variables so um I hope this exercise will you like this exercise okay so this is the Concordance so after the uh concordance you can go further you can see the data table or Corpus viewer but um let's do some other exercise another thing is I'm showing you is the bag of words now a bag of words means whatever the important words are there after cleaning process you all have important words in this so bag of words is quite important and you can again go to the word cloud from this bag of words and this is nothing don't it is just a warning so you will get everything not a problem and now suppose I want to see that which words are closer to each other and which words are far from each other so I am applying an algorithm this is called distance so distance means what is the distance between the words like for example if I say online teaching so teaching is just coming after the online and if I say online teaching is very good so good comes after four words so how how these words are closer and how much these words are far from each other so this is a basic analysis for input for uh clustering so I am interested in doing the clustering of the words and if you click on the distance the distance between the words is is found between different algorithms like gluten distance Manhattan normalized cosine distance so these are the different algorithms which are available for doing the cholesterolnesses so and uh suppose I select cosine distance okay cosine distance and send this to hierarchical clustering so clustering is a method which make the different clusters on the basis of the distances between them now if you click on that so you can see that the so this is one cluster this is another cluster because we are having only seven files so less number of clusters are coming and here you can increase or decrease let me check yeah so if I uh height ratio height ratio means if I select my height ratio to 65.6 percent this software is saying that there are two bigger clusters we can form in with the words or you can if you want to see the three clusters then this is cluster number one this is question number two this is cluster number three so these are the two clusters three one cluster so the com this is the complete cluster and you can declare cluster as per your choice so one two three four five six clusters so you can make clusters as per your your choice suppose I go for go with three clusters so these are the three clusters let's see and let me put this cluster here and after that I want to see that which words comes in which cluster so purpose viewer and let's open them so this is the Corpus viewer and let me side by side also open the cluster okay okay if I click on this cluster only the software is saying that three three files are similar to each other so they are using the similar words if I click on this so then these three files are same so you can divide the uh different files because we are doing work on seven files so seven files can be divided into three three clusters so this is the class cluster one cluster 2. so these three files comes in one cluster these three files in another cluster so because we are having only seven files so seven files are divided into so these seven files or these clusters are based on the words which are there in the files on the base of similarity we we can do the clustering and suppose if you are having 240 files then you can divide the files into different clusters okay now I am showing you another thing uh many of the scholars are available uh interested to do the sentiment analysis so now I am showing you the the sentiment analysis that we can do on the uh text documents so you if you see carefully in this icon there are two phases one is set and one is happy so this is happy and this is sad so it means the sentiment analysis can be done on the documents and word can be divided into two categories the uh the positive words as well as the negative words so let me uh connect to the data table this data table is telling you that in first file apology the positive words are 15 the negative words are three percent B neutral words are 81 percent similarly in third file the positive words are 22 percent be 0.2 percent the negative words are only two percent uh I have seen many papers which compares the speeches given by the different leaders and they are finding out that this leader whenever give these speeches that 50 words of the words are positive this much words are negative so you can compare the speech of the different leaders that how many words are positive and how many words are negative so this is called sentiment analysis so you can see that the percentage of positive words percentage of negative words neutral words compound words similarly when you are analyzing the financial documents so you can identify that how many words are used by the company are negative positive and all so sentiment analysis is quite important and helpful in the analysis and [Music] let's say so I can save this data set let me check yeah so you have to select all the files first all the because whatever you select only that portion will be saved so you can go to the save data and I am saving this data and let's say in the document in the document and saved so the file document already exists over it yes so now I am showing you that once you save it in what format the result comes so this is the file I saved and uh if I open it so you can see that now it is coming like this and you can do one thing you can make a graph out of it like insert and uh you can make the different graphs you can make different graphs and also now you can also make the index so basically we convert these data into a sentiment index the sentiment index can be calculated by the percentage of positive words divided by the negative words so now I calculated this index and this index can be considered as a dependent variable or an independent variable or any so you can further use this data coming out from the sentiment analysis okay so um after this exercise you can also see the sentiment analysis and I'm also doing one more thing the extract keywords so you can identify the the important keywords so the processing is going on yes okay so this processing is going on meanwhile I am showing you one more thing so after the pre-processing we developed a word cloud even we downloaded workload here so we can convert this word cloud in the data table and if you click on this part software is saying that you need selected words Corpus or the word count so sometimes we need this word count and the word frequency comes here okay and you can see the the word the most important word frequency the online word comes four times the success word comes like this so sometimes we need this table basically this is called weight so um that weight means the the count of this words divided by the total number of words and because we make this table out of the after the bag of words if you do it after just pre-processing you have a different output I'm clicking on the word count to the data okay now a pure word count will come the online comes 20 times the success comes 27 times the product come 22 times like that and you can identify that how many words I need uh let me check how many words are there so we are having 100 words if you click on then I need only uh 25 words the software is taking automatically and now you can see only 25 words the top 25 words so this is a text mining process that I have showed you and this is just uh one simple exercise that we can do we can do many more things uh in Orange which I will demonstrate in the coming session so um yeah so you can see now the keywords so these are the keywords which uh are calculated by the software by different algorithms so top 16 words okay so the the percentage is coming so you are having the weight of the different keywords here so what is the weight sometimes in qualitative research when I show you the nvo we have to uh I actually there are many options here so I just click him on the egg and I got the word and their weight so which word is occurring more time as compared to others so this is the simple exercise on orange about the text mining and uh I decided to um to demonstrate up to this and in the coming session so I we will go to the more complicated exercises like the similarity hashing method topic modeling method document embedding so Corpus to the network so there are many Advanced things which we will discuss in the next week so I hope for today I think up to this is sufficient do we practice and come out with the queries and then we will discuss on that so that's it for today thank you very much give me a minute

Transcript for:Text Mining Techniques with Orange Software

Transcript for:
Text Mining Techniques with Orange Software