How to put a PDF cleanly into Word or into your TM tool using really (really!) simple skills

I have often observed that while many translators and project managers may be skilled users of a number of sophisticated software tools, they sometimes lack some really simple skills in Word. Like for instance knowing how to find and replace tabs or paragraph and line markers…

“But why would we ever want to do that?” they might ask.

In this post we look how such simple skills can be used to solve some awkward problems. As an illustration, we’ll look at how to drop a PDF file into Word (and from there into a TM tool, if required).

So – what’s the problem with PDF files?

Many translators are dismayed when they discover that the source text is in PDF format – and for good reason. Getting it into an editable format or getting it into a TM tool is not always straightforward. A quick search on Google will turn up a variety of different “PDF converters”. Some TM software tools will also convert PDFs into an editable format. However, in my experience, it is very rare that the converted text is without some mucky problems. Many translators just give up on trying to extract text from PDFs.

Alejandro Moreno-Ramos has the best possible solution:

However, if Alejandro’s method fails, (and you really want editable text) then…

This is what you can do:

If you are able to select the text in a PDF with your mouse then you will be able to copy it and paste it directly into Word (if not, you should quickly abandon all hope!). Copying and pasting the text will not transfer the document properties (e.g. margins, columns etc), but you’ll get the text with most of its formatting properties (fonts, text size, bold and italics etc):

You’ve now got editable text… But whoa! In the illustration above, you can see that there is a paragraph marker at the end of every line! The text doesn’t wrap properly in the Word document.

Hopeless?

Not at all! It’s easy enough to get rid of the paragraph markers (as we’ll see) using simple find & replace. But this would make the whole document one huge paragraph. We need to retain one critical piece of information – where the real paragraphs start and end!

We need to get rid of the surplus paragraph markers (shown in the red circles below) – but we need to keep the ones marked in blue. These mark the end of the real paragraphs.

There are a few steps involved in doing this – but they are all very simple – the only skills required are to know how to copy, paste and find & replace!  Here’s how to do it:

1 Get the text from the PDF into Word

Select the text in the PDF. Copy it and paste it into Word.

(Some care needs to be taken when selecting text in a PDF. You may find that the PDF won’t allow you to select paragraphs in the correct order. You may need to copy & paste several individual sections, one at a time, to ensure you get the text flowing in the right sequence.)

2 Make sure that you have Word’s “Show/Hide” button switched to “Show”.

Toggling this button to “show” displays the document’s formatting marks (tabs, paragraph marks, picture anchors etc.) [1]. I’ve noted that many young translators (and some older ones too) try to work in Word with this button switched to “Hide”. The usual excuse is that seeing the formatting marks is aesthetically unappealing or distracting. My usual response is “Get over it!” (I don’t usually get a good reaction to that advice!) But my opinion is that working on a document with the formatting marks turned off is like groping around in a dark room – you unwittingly bump into, trip over and break things. Has your formatting ever gone unexpectedly haywire? Maybe you too like to keep this button in “Hide” mode? However, being able to see all the formatting marks helps you understand the structure of the document and lets you see if the original author has made any silly formatting mistakes. Walking blind into the client’s problems can spoil your day! Try it… It really doesn’t hurt (much)!

3 Identify where the real paragraphs end

Now, this requires a few minutes of manual work – it requires hitting the [Enter] key a few times on every page of the document to mark the end of each paragraph. If you really want editable text – it’s worth the small effort required.

Look for the spot where each paragraph ends, insert your cursor and then hit [Enter] to create an empty line. In many cases it is clear to the eye where each paragraph should end – but not always! So keep an eye on the original text. It only takes a few minutes to do this for an average-sized document.

Now you have double paragraph markers (aka “a blank line”) which indicate where the paragraphs are supposed to end:

4 Preserve these paragraph breaks

Our ultimate task is to get rid of all the surplus paragraph markers at the end of the lines. This is easy to do – we just replace them with spaces using Word’s Find & Replace function. 

But!

If we replace all the paragraph markers with spaces, we’ll lose the paragraph breaks we’ve just marked with a blank line! They’ll just turn into two consecutive spaces (there may be lots of other double spaces hiding in the document too!). So we need to temporarily mark the paragraph breaks with something else before we can get rid of the unnecessary markers at the end of each line. You can use pretty much any sequence of characters you like – you just need to be sure that whatever you use is unlikely to occur in the text. You might like to make up something like “@#$%” or some such. I always use “[para]” as a placeholder.

The task now is to replace all instances of two consecutive paragraph markers with the temporary placeholder. Word uses the characters ^p to represent a paragraph marker (or ^p^p for two of them), so:

  • Type “^p^p” into the Find what box; and
  • Type “[para]” into the Replace with box; then
  • Click Replace All:

This is how the document should change once the blank lines have gone:

5 Now get rid of all the redundant paragraph marks

We are now going to search for all the extra paragraph markers and replace them with spaces. (Look at the paragraph markers – if there is already a space in front of them, then you need to replace them with “nothing” – i.e. you leave the Replace with box blank.)

  • Type ^p into the Find what box;
  • Put your cursor into the Replace with box and hit the space bar. (If you don’t need spaces, use your mouse to select and delete any invisible spaces which might be lurking there); then
  • Hit Replace All:

You document should now be a complete mess and look something like this (paragraph breaks highlighted):

6 Now reinstate the paragraph breaks

This is where the magic really happens and the mess instantly becomes a nicely formatted document. We now need to get rid of the temporary “[para]” placeholders and replace them with real paragraph breaks.

  • Type “[para]” into the Find what box;
  • Type “^p into the Replace with box [2];
  • Hit “Replace All”:

If all has gone to plan, then you should have a nice, clean, plainly formatted document that you can edit or import into your favourite TM tool!

Postscript

  • Don’t try to do this with tables (the subject of a future post maybe).
  • If you use a PDF-to-Word converter, then these same Find & Replace techniques can often be used to fix up poorly converted text.
  • Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to do the job on the toolbar. One click and the job is done! (Again this could be the subject of another post!)
  • Qabiria.com have an excellent, detailed article on using PDF-to-Word converters here:  http://bit.ly/9TqbGH  
  • The examples in this post were illustrated using Microsoft Word 2007 and Adobe Reader X.

[1] You can control which formatting marks you would like to have displayed when you switch the “Show” button on. Click Office Button|Word options|Display. Because translators usually work on documents that other people have created and formatted, I recommend that they select “Show all formatting marks” so that they can always see (and work around) formatting mistakes made by others.
[2] If you want to control paragraph spacing with a blank line, then you’ll want to use two paragraph markers (i.e. “^p^p”).

This entry was posted in OpenBorder, Tips 'n tricks and tagged , , , , . Bookmark the permalink.

31 Responses to How to put a PDF cleanly into Word or into your TM tool using really (really!) simple skills

  1. Martin says:

    >Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to do the job on the toolbar. One click and the job is done! (Again this could be the subject of another post!)

    I really ought to do this … every day (almost, anyway) I get new jobs in PDF format and convert them using ABBYY Finereader (best program I’ve found for the job to date), and then do a series of manual search-and-replaces to set up my preferred formatting structure. The number of times I must have done this will be in the thousands by now, yet I remain macro-shy. Something I will have to get over, evidently!

    (Speaking of getting over it, I am one of the “older” – I guess, by now – translators who keeps the formatting marks hidden, calling them into view only when actually required.)

    One thing users of OCR software need to get used to is the notion that the program *won’t* do everything automatically – text in columns needs to be selected column-by-column to avoid having everything run together if the program doesn’t recognise all the column breaks, for instance. One of my worst experiences was a very large Korean document that had started out in life as multi-column text but, before it came to me, all the column breaks (and the returns within the columns) had been lost, so I had to separate the text out before even starting the translation. Still gives me the heeby-jeebies to think about it.

  2. Pingback: Easy! Easy! How to copy a table from a PDF into Word… for beginners #xl8 #t9n | The translation business

  3. Pingback: “I want to translate a web page… but using Word!” Converting HTML to Word format | The translation business

  4. ISO 9001 says:

    Very good post, I was really searching for this topic, as I wanted this topic to understand completely and it is also very rare in internet, that is why it was very difficult to understand.

    Thank you for sharing this.

    regards:
    ISO 9001

  5. Paul says:

    Great article. I found that replacing “.^p” with “.[para]” means I don’t have to find each proper paragraph break individually. Although if your paragraphs don’t all end with a full stop, you’ll need to take this into account.

  6. Kemal says:

    I have just translated a PDF file badly rendered into Word using someone else’s conversion package and it was a nightmare. I will try the tips but I think next time round, I will just charge triple rate. Most clients don’t care a toss about wasting the translator’s time and patience. The only solution is to make customers pay for their ignorance and disdain.

  7. Isabelita says:

    Great post….but it did not solve my problem!…yes, Im a translator and need to type over a word page. I converted my pdf doc online, it was email back to me with the format text wrapping borders…After I read this post I copy and paste the pdf doc on word and it again shows the text wrapping borders…I try all the choices; back or front of text…nothing!…please help..Thanks, Issy

  8. bernagora says:

    Great article. It is well advisable to learn all the special formatting symbols and how to use them in Word replace. Often I have to remove soft line breaks (^l) – but first converting double soft line breaks to real line breaks (^l^l to ^p).

  9. HakanE. says:

    Great post. Thank for for sharing this.

  10. lindawonders says:

    Thank you so much for sharing this. I had to copy at least 50 articles from PDF into Word and just about drove myself nuts. This has saved me so much time and my sanity. 🙂

  11. Vin Jat says:

    Thank You so much!!! My problem just got sorted.

  12. Joey H says:

    I don’t know if this will help you all but for me it made it super-easy. It doesn’t make the words in the sentences connect together as far as I can tell, but it takes the original document (I was converting DOS text letters to PDF files) and when you paste it into a program, such as Gmail or I’m sure any other program like wordpad or Word, it preserves the formatting of the spaces between the paragraphs. Here’s how I did it: Using Foxit Reader choose the option in the VIEW tab and then TEXT VIEWER, and then copy and paste your text. I noticed this was all I needed since I was just taking text from a DOS program and wanting to get it into an email. Hopefully this will help someone else. I’m not sure if Adobe Reader has the same option but I didn’t see it.

  13. willem says:

    Just saved me so much work… Thank you very much for making the world a better place!

  14. Per Hansen says:

    Wonderful post and so very very helpful. Definitely solved my problem and thanks to the other people who commented as well.

  15. Claire says:

    Immensely helpful – so simple and intelligent. Thanks!

  16. Gabrielle says:

    I was so thrilled to find your post, thank you. I use the GIRDAC PDF Converter to convert PDFs from magazine scans. It works brilliantly, however I end up with article columns all over my Word document which make it maddening when using Trados… Is there any way of adapting your method so that I have simple paragraphs without the columns?

    • Hi Gabirelle. There are several things you can do. First, try selecting the whole Word document and then from the main menu select “Format” and then “Columns…”. In the “Columns” dialog box, set columns to “One” (or “Number of columns” to 1.) This should get rid of all the columns which have been imported from the magazine article.

      If other aspects of formatting such as bolds and italics etc are not important, then try saving the whole document as a plain text file. Close the document and re-open the text file in Word. You will have a nice clean document with all the formatting (except the division into paragraphs) stripped away.

  17. Hello! I have converted my Pdf to Word, but it won’t let me select all, (only parts of the text in columns/boxes,etc). I am going mad trying to find how to select the whole lot. I know there’s a simple way, because I’ve done it before, but I cannot for the life of me work out how to do it. Any help you can give would be greatly appreciated. Thanks!

    • It’s possible that some of the text is in text boxes and some of it not. In Word you could first “Save as” a “text file” (i.e. as “plain text”). Close the document and then open the plain text file you have just saved. Saving it as “plain text” will strip out all columns, text boxes and other formatting leaving you with just the text (and paragraph marks). You can now reformat the document from scratch.

  18. Katrina J. says:

    Hello there! Just wanted to let you know about a free and useful toolkit to convert any pdf file to text, mobi or epub formats, when needed. Simply upload files and the tool will convert them quickly: http://kitpdf.com/. Thanks!

  19. sahil malhotra says:

    thanks sir

  20. This trick is obvious, I used it for years, so thanks for nothing.. The problem is, when copy-pasting text from almost any PDF document, the copied text h a s r a n dom s pa ces between characters. I haven’t yet found a solution for this.. No converter can get it right.

  21. Pingback: Medicare Easy Pay » How Do I Use Etc1

  22. Luca says:

    You know what, damn. Thanks. Usually when searching for a solution on the internet for a little problem with formatting or some other organisational issue I find the information and quickly close the tab and continue on my way.

    But, this solution, whilst not overly elegant, works so damn well.

    Thanks!

  23. Neil Richard Innis says:

    Thanks that really helped me a lot!
    Also, I wanted to know if you’ve already wrote something on
    “Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to do the job on the toolbar. One click and the job is done!..”
    Because if I can save the pain of going through all those replacements everytime it will be a bliss.

  24. Thanks for sharing useful Information. I appreciate your content.

Leave a comment