directory
- preface
- Reading Word content
- NPOI
- NPOI installation
- NPOI extracts Word content
- Package DLLS with Costura.Fody
- python-docx
- Reading PDF content
- Python-docx automatically generates Word
- Global font
- The content type
- Cell merge
- The last
preface
Word is one of those things that’s hard to use, ugly, but you have to use, and it’s even better than Windows at that. (After all, Gates got access to Macintash and Xerox by writing Office for fruit, you know, Windows can use macOS + PlayStation for 1000% replacement, no more 0(manual funny). But Office can’t. It’s not that there’s nothing better than Office. It’s a vestige of history, like those puzzling fields in the CPU of a toothpaste factory.
In summary, Word and PDF processing is made more automatic by using libraries such as python-docx for Python, pdfbox for C# and npoi.
Finally, if you want to design custom features, you want to learn from official documentation, not from blogs. Especially when you only have a sufficient degree, not a song, the old are given in the first few pages of search results do not work again, to be honest, with a certain degree is not as good as in the blog site to site search, but I recently found that some day the headline of the entire network to search for the result is good, if you do not have a song, you can use one day under the headline, It is better not to use Bing than to use.
Reading Word content
All right, let’s cut the crap. Look directly at getting content from Word. This can be implemented using NPOI in C# and python-docx.
NPOI
NPOI installation
Let’s take a look at wiki. Apache POI is an open source library of the Apache Software Foundation, which provides apis for Java programs to read and write Microsoft Office files..NET developers can use NPOI(POI for.NET) to access POI functions.
In fact, in recent years, giant hard through the launch like. Cross-platform application development frameworks like NET Core have brought C# back to life a little bit. I don’t like the hard stuff, but I’m a big fan of this strategy and, of course, even put Surface android devices like Duo on the hardware. Although I’ve used C# in some of my previous Unity games, this is the first time I’ve used C# from a software development perspective, and I have to say, NuGet is very impressive and easy to use.
This assumes that you already have VS2019 or an older version, but note that.net Framework applications can only be developed on Windows because I haven’t touched Mono yet, so if you are interested, you can try it out.
- Creating a.NET Framework Console application:
- Then you just type nuget in the search box and click Manage Nuget package:
- Then search for NPOI, click Install, and you’re done. Probably more than the MAC
brew install
Linuxapt-get install
And the pythonpip3 install
It’s two more steps, but I’m happy with it. It’s much more 9102 than finding DLLS, copying DLLS and so on.
- After installation, you can already see the added libraries in the solution reference on the right:
NPOI extracts Word content
In fact, NPOI is powerful enough to do everything related to Word, but here is just to extract the content of Word, because there is a much lighter python-Docx library behind, you do not need vs do not need Windows, you can handle docX type files.
The source code is as follows:
using NPOI.XWPF.UserModel;
using System.IO;
using System.Text;
namespace getWord
{
class Program
{
static void Main(string[] args)
{
string in_path = System.Console.ReadLine();
string out_path = System.Console.ReadLine();
Stream stream = File.OpenRead(in_path);
XWPFDocument doc = new XWPFDocument(stream);
string text = "";
string tmp_text;
foreach (var para in doc.Paragraphs)
{
tmp_text = para.ParagraphText;
if(tmp_text.Trim() ! ="")
text += tmp_text + "\n";
}
StreamWriter swPdfChange = new StreamWriter(out_path, false, Encoding.GetEncoding("gb2312")); swPdfChange.Write(text); swPdfChange.Close(); }}}Copy the code
I read the input and output paths from the console, then read the Word contents in a loop to write to the cache, and finally transcode into GB2312 to the output file.
In the end, I hope you go to the NPOI website.
Package DLLS with Costura.Fody
If you build a Release like this, you’ll get yelled at by your boss. It’s unprofessional. At the very least you should package DLLS into EXE or DLL.
You could package DLLS as resource files, but that would be unelegant and cheesy. Again, let’s do what we should have done in 9102.
Search for Costura.Fody in NuGet and install it. This way, when you compile to the Release version, you package it as an EXE file.
python-docx
Okay, now that we’re in Python, we’re all comfortable, forget the gigabytes or tens of gigabytes of vs that we installed to write C#, After all, Gates said, ‘640K is more memory than anyone will ever need.’ Now all you need is:
pip3 install python-docx
Copy the code
Also, the official documentation is pretty good, and I found that a Python-docx Chinese article in the Homework tribe (yes, that’s the CMD Markdown recommended in what’s on my macOS) is pretty much the official Chinese. So, I basically relied on these two and Google to complete the content learning, of course, you will find that the difficulty is in the Table processing and style modification.
import docx
doc = docx.Document('./t.docx')
doc_text = ' '
doc_table_text = ' '
for paragraph in doc.paragraphs:
doc_text += paragraph.text + '\n'
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
doc_table_text += cell.text + '\n'
with open('./tt.txt'.'w') as f:
f.write(doc_text)
f.write(doc_table_text)
# doc.save ('./tt.docx')
Copy the code
The code is actually quite easy to understand. In addition to the official documentation, I will share some of my processing experience in the automated generation of Word later. Of course, there are more problems in processing (manual has no alternative).
Reading PDF content
Again, this time I’m using a C# library called Pdfbox. Actually, the Pdfbox is a Java library. Is made by the Apache PDFBox team. NET generated.
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
using System.IO;
using System.Text;
namespace getPDFCon
{
class Program
{
static void Main(string[] args)
{
string in_path = System.Console.ReadLine();
string out_path = System.Console.ReadLine();
PDDocument doc = PDDocument.load(in_path);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText(doc);
// Console.WriteLine(Utf8ToGB2312(text));
// Console.ReadKey();
StreamWriter swPdfChange = new StreamWriter(out_path, false, Encoding.GetEncoding("gb2312")); swPdfChange.Write(text); swPdfChange.Close(); }}}Copy the code
It’s almost the same idea as reading Word before, so I don’t have to say more.
Python-docx automatically generates Word
Here I’ll go through some of the operations of Python-docx. From the style modification, table merge processing these difficulties to talk about. New pits will be updated gradually.
Global font
First, you can set the global font.
doc.styles['Normal'].font.name = u'宋体'
doc.styles['Normal'].font.size = Pt (9)
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
Copy the code
Note that if it is a Chinese character, the third line must be added, otherwise it will not take effect. The second line is to set the font size. You need to import Pt from docx.shared. Of course, you just import the entire DOCX package.
The content type
If you want to change the font of only one piece of content without affecting the whole world, the previous solution will not work.
p = doc.add_paragraph ()
font = p.add_run ('title').font
font.bold = True
font.size = Pt (14)
p.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
Copy the code
Here you see a P.add_run. This is the Run instance added to the current Paragraph instance. That is, some run-time Settings apply only to the current Paragraph instance. Look at the comparison with setting the properties of a Paragraph instance directly.
doc = Document ()
p = doc.add_paragraph ()
font = p.add_run ('heading 1').font
font.bold = True
font.size = Pt (14)
p.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
p2 = doc.add_paragraph ()
p2.text = ('title 2')
p2.style.font.size = Pt (20)
p3 = doc.add_paragraph ()
p3.text = ('title')
p3.style.font.size = Pt (40)
doc.save ('a.docx')
Copy the code
Unfortunately, the font Settings for heading 3 overwrite the font Settings for heading 2, while heading 1, which is set via the run object, is not affected.
Thus, to properly handle the style of a piece of content, you must use run. Otherwise it will be associated with something else. Change the code to add_run and see if this is the case:
Good. That’s what we want. The same is true in the table content, which will not be repeated.
One other thing to note, though, is that if you’ve already assigned text to p.ext, but you set it to p.add_run (‘ title ‘).font, you’ll get two copies of the text. So, it’s important to note that if you fill with styles, you don’t have to use the text field for assignment.
Cell merge
Let’s say I create a table and try merging. Then you’ll find that, after merging, you keep both pieces, which is fine if that’s what you need. But if not, you need to think about the strategy of merging content. You can’t do it one by one. A good strategy is to use temporary variables to keep the content you want, and then overwrite the merged content after the merge.
The last
In fact, both NPOI and Python-Docx are excellent libraries to help developers automate word generation. If you don’t think so, let me give you a counter example. . Microsoft Office. Interop. Word is the giant hard com components, so how to use it, you’ll have to install Windows, Office, reload Office2013 correspond to the com components of 15 x version, 12.x. Of the corresponding component of Office2007. And then you write the code, and every time you run it, you have to start Word, which you can do in the background, but it does, so it’s very inefficient.
Of course, this chapter will continue to be updated in future use until I feel there is no need to update it. If you like it, you can give it a thumbs up. If you have any comments or suggestions, see you in the comments section