使用 Weasyprint 將 HTML 轉成 PDF

Posted by blueskyson on August 15, 2021

在幫 tldr page 修改 pdf 渲染器時偶然發現 Weasyprint 這個好用的函式庫,以下介紹三種使用方法。首先透過 pip 安裝:

$ pip3 install weasyprint

讀取 HTML 檔轉換成自動換頁的 PDF

首先我找了一個 HTML 範例 如下:

mystyle.css

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
body {
    padding-left: 11em;
    font-family: Georgia, "Times New Roman", Times, serif;
    color: purple;
    background-color: #d8da3d
}

ul.navbar {
    list-style-type: none;
    padding: 0;
    margin: 0;
    position: absolute;
    top: 2em;
    left: 1em;
    width: 9em
}

h1 {
    font-family: Helvetica, Geneva, Arial, SunSans-Regular, sans-serif
}

ul.navbar li {
    background: white;
    margin: 0.5em 0;
    padding: 0.3em;
    border-right: 1em solid black
}

ul.navbar a {
    text-decoration: none
}

a:link {
    color: blue
}

a:visited {
    color: purple
}

address {
    margin-top: 1em;
    padding-top: 1em;
    border-top: thin dotted
}

sample.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>

<head>
    <title>My first styled page</title>
    <link rel="stylesheet" href="mystyle.css">
</head>

<body>

    <!-- Site navigation menu -->
    <ul class="navbar">
        <li><a href="index.html">Home page</a></li>
        <li><a href="musings.html">Musings</a></li>
        <li><a href="town.html">My town</a></li>
        <li><a href="links.html">Links</a></li>
    </ul>

    <!-- Main content -->
    <h1>My first styled page</h1>
    <p>Welcome to my styled page!</p>
    <p>It lacks images, but at least it has style. And it has links, even if they don't go anywhere&hellip;</p>
    <p>There should be more here, but I don't know what yet.</p>

    <!-- Sign and date the page, it's only polite! -->
    <address>Made 5 April 2004<br>
  by myself.</address>

</body>

</html>

sample.htmlmystyle.css 放在同一目錄下即可在瀏覽器中產生以下預覽:

接著我們在當前目錄寫一個小小的 python 腳本:

convert.py

1
2
from weasyprint import HTML
HTML("sample.html").write_pdf("sample.pdf")

執行腳本後便會得到一個漂亮的 PDF 檔了

$ python convert.py

將 HTML 字串轉換成自動換頁的 PDF

邏輯和讀取 HTML 檔大同小異,但是在轉換時需要將所有 css 檔放入 list 裡,然後作為參數丟給 weasyprint,一樣使用上面的 sample.htmlmystyle.css 作為範例。

convert_string.py

1
2
3
4
5
6
7
8
9
10
from weasyprint import HTML, CSS

# Append as many style sheets as you want
csslist = []
csslist.append(CSS("mystyle.css"))

with open('sample.html', 'r') as file:
    # convert HTML string to PDF file 
    data = file.read()
    HTML(string=data).write_pdf("sample.pdf", stylesheets=csslist)

手動換頁

PDF 與文字文件最大的差異就是 PDF 有頁數,因此將很長的網頁轉換成 PDF 就需要考慮排版的問題。我們可以利用

<p style="page-break-before: always" ></p>

來強制把此行 HTML 以下的內容放到下一頁。每當需要手動換頁時就加上此行即可達到效果。我提供了另一個文檔範例

sample.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
    <title>Hello World</title>
</head>
<body>
    <!-- Main content -->
    <h1>What is GitHub?</h1>
    <p>GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.</p>
    <p>This tutorial teaches you GitHub essentials like repositories, branches, commits, and Pull Requests. You’ll create your own Hello World repository and learn GitHub’s Pull Request workflow, a popular way to create and review code.</p>
    <h2>No coding necessary</h2>
    <p>To complete this tutorial, you need a GitHub.com account and Internet access. You don’t need to know how to code, use the command line, or install Git (the version control software GitHub is built on).</p>

    <h1>Step 1. Create a Repository</h1>
    <p>A repository is usually used to organize a single project. Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs. We recommend including a README, or a file with information about your
        project. GitHub makes it easy to add one at the same time you create your new repository. It also offers other common options such as a license file.</p>
    <p>Your hello-world repository can be a place where you store ideas, resources, or even share and discuss things with others.</p>
    <h2>To create a new repository</h2>
    <ol>
        <li>In the upper right corner, next to your avatar or identicon, click and then select New repository.</li>
        <li>Name your repository hello-world.</li>
        <li>Write a short description.</li>
        <li>Select Initialize this repository with a README.</li>
    </ol>
</body>
</html>

透過 convert.py 轉換後會像這樣

一班來說我們會希望大標題在每一頁的開頭,於是我將 <p style="page-break-before: always" ></p> 放在 <h1>Step 1. Create a Repository</h1> 前:

1
2
3
4
5
6
7
    ...use the command line, or install Git (the version control software GitHub is built on).</p>
    
    <!-- the folowing will be in next page -->
    <p style="page-break-before: always" ></p>
    
    <h1>Step 1. Create a Repository</h1>
    <p>A repository is usually used to organize a single project...

透過 convert.py 轉換後便能分成兩頁