AllExperts > Experts 
Search      

Java

Volunteer
Answers to thousands of questions
 Home · More Questions · Answer Library  · Encyclopedia ·
More Java Answers
Question Library

Ask a question about Java
Volunteer
Experts of the Month
Expert Login

Awards

About Us
Tell friends
Link to Us
Disclaimer

 
 
 
 
About Beenish Zaidi
Expertise
I can answer questions related to Core Java(no swing), Spring, to some extent i can try helping in solving problems related to JSP's and Servlets as well. Have hands on experience on Jakarta Struts and Struts Validator framework. Any questions within this domain are welcome.Just started with Design patterns, so any questions will be welcome in this category also.

Experience
2 and a half years.

Education/Credentials
Bachelors in Computer Sciences

 
   

You are here:  Experts > Computing/Technology > Focus on Java > Java > java to read / parse /find text recursively in .doc files/folder/dir of .doc files

Topic: Java



Expert: Beenish Zaidi
Date: 6/30/2008
Subject: java to read / parse /find text recursively in .doc files/folder/dir of .doc files

Question
QUESTION: Algorithm of hat i ant ot achieve in Java and word files using POI or anything that might be helpful

i have a use case name - text

I look up the use case as known as UC in the BPD doc which is word doc
I scope my search to this  Usecase parageraph in the ord doc of BPD

In this scoped text, i search the string "see usecase<"

When i find it , i copy/store the text1 in the angle brackets in a report.doc/txt .


Recursion starts
same steps now

Now i look up this text1 UC in BPD doc .
or look it up over other BPDs docs in the dir of docs
Open the BPD doc
I scope my search to this text1 Usecase paragraph

In this scoped text, i search the string "see usecase<"
When i find it , i copy/store the text1 in the angle brackets in a
report.doc .
If not found , come out of this BPD
And go back to original UC where we had found the first instance of the
string "see usecase<" "
Then again look 2nd instance of string "see usecase<"

And so on till all instances of string "see usecase<" in the UC are found
and stored .




ANSWER: Hi,

Please provide a detailed explanation of the problem.
The verbal text used in the email is not enough to
describe the problem.

Thanks,
Ben

---------- FOLLOW-UP ----------

QUESTION: Sorry for the confusion ben , I have tried to be elaborate over here.
Hope this helps .

It will be great if you can put your time in helping me get a headay into this solution i am looking for  or atleast clearly validating the fact that it infact is achievable entirely using Java - POI  programming.

A clear indication from your side regarding the feasiblity of Java-POI or an equivalent  programming to achieve solution of the problem would be so helpful.


Use Java  to parse /read a .doc file  , then search/find text in the file using recursion

Task description is given below


Regarding Chained Use cases Report generation = i named it.
Regarding Chained Use cases Report generation

i/p = is a design folder
containing business process design documents

The business process design document has
•   use case diagrams  ,  
•   screenshots (images) ,
•   text (Use cases),
•   special characters


o/p = is a report
which displays the tree map of use cases (aka UC)  for a business process design (aka BPD)

So if I have a BPD doc having 2 UCs

The report should contain

<<BPD1 name>>
  |
<<UC1 name>>   regexed from the BPD1 doc
         |
         +  <<called UC1.1 name>>  regexed from BPD1 doc
                               |
•   <<called UC 1.1.1 name>>  

regexed from BPD1  or BPD<<x>> doc  …. So we are hunting in the design folder (tuned search would be search through only the TOC of the other BPD docs in the design folder  or search through the Use case catalogue from these BPD docs )

         +   <<called UC1.2 name>>
 |
 |
<<UC2 name>>   regexed from the BPD1 doc
        |
•   <<called UC2.1 name>>  regexed from BPD1 doc
         +        <<called UC2.2 name>>  regexed from BPD1 doc
                            |
•   <<called UC 2.2.1 name>>   regexed from BPD1  or BPD<<x>> doc                        


Such a tree view with collapsible and expandable items needs to be rendered.

But we can skip the tree view and keep a static print of the chained and dependent use cases for starters.

Mostly I believe this report generation is a static generation unless the BPD-UC has modifications, which is when a fresh report will need to be generated.


Algorithm:

1.   Ask user which BPD report is required
2.   read user i/p which is a  BPD file name
3.   Pick up BPD file from design folder (store BPD name  in report object)
4.   loop and  parse the BPD for all its UCs

Loop Starts:
5.   Parse the BPD doc to reach the first BPD  UC  = UC1  (store UC name  in report object in the hierarchy)

Recursive pgm starts

6.   Parse the doc for  search string  - “see Use Case <xxx>”  scoped till another BPD UC is found
7.   if the search string  - “see Use Case <xxx>”  is found  =>  an embedded UC 1.1 exists (store UC name  in report object in the hierarchy)
else goto step 14
8.   Search this UC in the current BPD.
9.   if found in the current BPD doc ,  do the recursive task
10.   if not found  , search the design folder for this UC name
11.   Pick up searched BPD file from design folder
12.   Parse the BPD folder to reach the searched  UC  = UC1.1  (store UC name  in report object in the hierarchy)
13.   do the recursive task
14.   if another BPD UC is found    => no more embedded UC  

End of recursion

15.   Parse the BPD doc from current position (last UC encountered in the selected BPD doc)
16.    do the recursive task

Loop Ends

17.    Parse the BPD doc to coverall the UCs

Regex text
Which has
•   See Use Case <Maintain Customer>

I mean we can employ regex over here, to go to this relevant text.




1.   Business rules: In a particular BPD document, there is no chaining of business rules, as the final definition of the business rule is specified in another document.
2.   UC – there is chaining of UCs spanning over multiple documents

I am not sure whether all of this is achievable with Java -POI.
But this might be a time consuming activity for even an Java programmer.
Personally I have only worked with  elementary Java applications and never done Java –POI programming .

probably now you will understand my previous algorithm better.

Having said that, I can still RnD to see if we can do part of it.
Let me know our views.

If you feel it faster to discuss this on emails , i can share my email id with you.

THanks
DD  

Answer
Hi dd,

Seems like a complex problem at hand.

Well, onething is that, you can't directly read the word
file in java, as you do with normal txt documents. For
microsoft based file formats like

.doc
.xls

Apache has made a special API, which is known as Apache POI.
To parse a word document, first you need to read it and it
can be done through Apache POI(POI-HWPF) API, which is specially
made for .doc file formats to read. Additionally still, it is
in the early stages of development and you won't find much
references regarding POI-HWPF. For making this solution perfectly
work, it is not easy to implement compeletely in Java, you need
to look into third party API's for this solution. Currently,
i have adviced you a starter to look at, if you find something
interesting, do lemme know as well..

Hope this helps.
Ben

Add to this Answer    Ask a Question



  Rate this Answer
   Was this answer helpful?
Not at allDefinitely              
   12345  

     
About Us | Advertise on This Site | User Agreement | Privacy Policy | Help
Copyright  © 2008 About, Inc. About and About.com are registered trademarks of About, Inc. The About logo is a trademark of About, Inc. All rights reserved.