Write the mobile album import tool in Python

Instead of working on Spring Cloud these days, Python and I have fallen in love. Well, I was underestimating Python at first, but it took me almost two weeks to get anything like it.

The cause of

Recently, my small cell phone always reminds me of insufficient storage space, and sometimes even fails to take photos. As an experienced driver, I used to find a few apps that I didn’t use very much and delete them so that I could make more space. But recently, it doesn’t seem to work. Yes, that’s the case.

Later, I checked it again and found that the album was too big, with a total of more than 80 GB.

Alas, with the high pixel cameras, more and more photos are hard to delete.

But now there is no way, since reluctant to delete, that export to the computer. First up was my main desktop (with a second generation i3 processor), a good Linux operating system (no need to hunt around for an activation code), and shotwell, a popular web app for photo import that looked something like this.

Shotwell automatically imports the slices to the specified directory at the time when the slices are created. As long as you can “connect to your phone,” it’s pretty easy to do, but it’s a little difficult to connect to your phone, like this error message.

Since you can’t import photos directly from your device, you can import photos from a folder instead. After a long wait, it pops again.

(Pretend to have picture here)

Oh, ok. Maybe my small phone doesn’t fit the cold Linux, so why don’t I try a more down-to-earth Windows? Take out your old laptop, connect it to your small phone, and select Photo Import. Looking at the familiar file copy window, can not help but sigh, or Windows understand life ah.

Ah? Ah! What a mistake again! Looks like the big factory software is not so good.

Is the mobile phone pretentious, or the computer prude, I do not want to understand, in short, they are not together.

Isn’t that a copy of a file? Why is it so hard? If you have to, do it yourself.

Make your own food and clothing

As usual, look at the final image first, think it’s ok, and we’ll talk about it later (after all, there’s a lot of nonsense).

This is a Python script that runs on a terminal. Why Python? Because I think that little photo import function, isn’t it recursive copy? Python is enough!

Original plan

The program is actually quite simple, according to the following steps to achieve the photo import.

Recursively traverses the specified directory.
If a file of the specified type is traversed, the MD5 value is obtained. Otherwise, the file is skipped.
Search for the MD5 value in the database. If the MD5 value can be found, skip this step and go to step 2.
Can’t find a new file, get its metadata (resolution, geographic information, etc.)
Copy the file to the specified directory
Save metadata to the database along with the MD5 value
Repeat the second step until the traversal is complete.

The above looks like a lot of steps, the actual code is written very little, and very simple. Walk through with os.walk().

Huh? How to obtain image metadata? This is also not difficult, using PIL library can get all kinds of metadata photos. However, there may be some unexpected situations when the image does not contain metadata (it may have been removed, or the image itself may not be present), in which case other useful information (such as when the file was created) must be obtained from the file properties. See the code here:

from support.utils.file.base import MediaFile from PIL import Image from PIL.ExifTags import TAGS from support.ui.console import Log class Picture(MediaFile): KB = 1024 KEY_WIDTH = ['ImageWidth','ExifImageWidth'] KEY_HEIGHT = ['ImageLength','ExifImageHeight'] KEY_GPS = 'GPSInfo'  KEY_DATE = 'DateTimeOriginal' WIDTH = 'Width' HEIGHT = 'Height' DATE = 'Date' LOAC = 'Local' HASH = 'Hash' def __calcDMS(self, value): return value if len(value)==1 else value[0]/value[1] def __transDMS(self, value): if('2' not in value.keys()): Return "" iQS = self.__calcdms (value[2][0]) + self.__calcdms (value[2][1])/ iQS + self.__calcdms (value[2][2]) / 3600.0 longitude = self. __calcDMS (value [4] [0]) + self. __calcDMS (value [4] [1]) / 60.0 + self __calcDMS (value [4] [2]) / 3600.0  return "{},{}".format(longitude if value[3].upper() == 'E' else -longitude, latitudinal if value[1].upper() == 'N' else -latitudinal) def __hasTag(self, filtermap, key): if(filtermap is None): return True for filter in filtermap: if(isinstance(filter, str) and key == filter): return True if(isinstance(filter, list) and key in filter): return True return False def __transKey(self, key): if(key == self.KEY_DATE): return self.DATE if(key == self.KEY_GPS): return self.LOAC if(key in self.KEY_HEIGHT): return self.HEIGHT if(key in self.KEY_WIDTH): return self.WIDTH return key def getMetaData(self, path, filtermap=None): metaData = {} metaData[self.HASH] = self.calHash(path) if(self.isVideo(path)): metaData[self.DATE] = self.getCreateDate(path) else: try: file = Image.open(path) info = file._getexif() if(info): for (tag, value) in info.items(): key = TAGS.get(tag,tag) if(self.__hasTag(filtermap, key)): metaData[self.__transKey(key)] = self.__transDMS(value) if(key == self.KEY_GPS) else value except Exception as e: Log.e(__file__, type(e), str(e)) finally: if(self.DATE not in metaData.keys()): metaData[self.DATE] = self.getCreateDate(path) return metaData def getData(self, meta, key, default=""): if(key not in meta.keys()): return default return meta[key]Copy the code

Note: metadata is not the necessary information for photo import, but WHEN I did it at the very beginning, I did not think of the rules for automatic classification of photos, according to date? Or by geography?

From the original scheme it can be seen that the photo copy depends on the main thread file by file processing. This isn’t a problem if you only have a small amount of data, but I have over 80 gigabytes of photos! One copy at a time, get everlasting?

This plan is totally playing me! No, it has to be optimized!

The second generation scheme

Since it’s too slow to process one file at a time, let’s do it in parallel, let’s do it in multiple threads.

Threads are a good thing, and the performance improvement is obvious, but the creation of threads consumes system resources. The total amount of 80GB small files, each file creates a thread, hehe ~

When we improve efficiency, we also have to pay attention to the occupation of system resources, so we use thread pool to manage. Fortunately, Python already has a very mature thread pool scheme, which you can use directly. The scheme is as follows:

[new] Initializes the thread pool of the specified size
Recursively traverses the specified directory.
If a file of the specified type is traversed, the MD5 value is obtained. Otherwise, the file is skipped.
Search for the MD5 value in the database. If the MD5 value can be found, skip this step and go to step 2.
Can’t find a new file, get its metadata (resolution, geographic information, etc.)
[new] Allocates a thread from the thread pool to perform the file copy action
If the number of threads in the current execution queue reaches the threshold, wait for at least one thread to complete the copy operation before continuing
Save metadata to the database along with the MD5 value
Repeat the second step until the traversal is complete.
[new] After traversal, wait for all threads in the execution queue to complete execution.

self.threadPool = ThreadPoolExecutor(maxqueue)
self._lock = RLock()
self.dashboard = dashboard
self.idleQueue = []
self.runningQueue = []
Copy the code

There is room for optimization by designing two queues to hold tasks.

The original idea was that the main thread would first push tasks into the idle queue, and when the number of tasks in the idle queue reached a threshold, it would take threads from the thread pool to execute them one by one, and push the allocated threads into the execution queue. Prevent the main thread from constantly asking for threads from the thread pool, causing the pool to create far more threads than expected.

However, when the number of threads in the execution queue reaches the threshold of 2 times, it is forced to wait for the completion of the execution of threads, hoping to make as much use of threads in the thread pool in this way.

However, there is an unmodifiable defect in this design. First, the thread directly returns to the thread pool after execution. The execution queue I designed cannot sense this change, so a thread is actually idle, but it is still in the execution queue. Secondly, threads need to wait for the number of tasks in the idle queue to reach the threshold when they really start to execute. In this case, threads are not properly utilized during the period before the threshold is reached. Meanwhile, due to the limitation of queues, there will be batch execution rather than rolling execution in actual operation.

Therefore, it is optimized to execute first and then decide whether to wait for the idle thread mode. But there are still traces of the previous solution in the code (in fact, I didn’t bother to change it).

def execJobs(self, force=False):
        self._lock.acquire()
        for task in self.idleQueue:
            self.runningQueue.append(self.threadPool.submit(task['func'], task['args']))
        self.idleQueue.clear()
        while(len(self.runningQueue) >= self.maxqueue):
            Log.i(Task.TAG, "waiting for all running thread completed", len(self.runningQueue))
            for thread in as_completed(self.runningQueue):
                self.runningQueue.remove(thread)
                break
        if(force):
            for thread in as_completed(self.runningQueue):
                Log.i(Task.TAG, thread)
            self.runningQueue.clear()
        self._lock.release()
Copy the code

Now the thread has, the execution speed has also been raised, with a small mobile phone to try, found that the normal copy of more than 80 G of the photo, and can be saved separately in the date named in each folder.

It would be great if things were cut off right here. But after I finished, I had the idea that it would be nice to have a progress bar to indicate the progress of the copy, so I fell into a big hole.

Rich

Rich is a Python library that provides you with Rich text and elegant formatting in terminals. The Rich API makes it easy to add a variety of colors and styles to terminal output. Rich can also draw beautiful tables, progress bars, markdowns, highlight syntax source code and tracebacks, and the list goes on.

The above are all transferred from Rich’s readme file, undeniably the effect is really pretty, and the packaging is very good. There’s just too little documentation. Domestic related documents are either copied from others’ READme or copied from the same article. Alas, perhaps this is also the state of the nation’s tech blogs.

Below is a record of Rich stepping on pits. All are for personal understanding, if conflict with Rich official documents, official

layout

Now that I’m using Rich, I certainly hope to make better use of its powerful display features. I not only want to have progress bars, but also want to have status, logs and other relevant information displayed. So here you need to use the Layout function to split the screen.

The Layout API is pretty simple. It splits the screen by row or column, and then splits each area by row or column. At the same time, you can specify the area size and proportion after segmentation. For example, this is how I split the whole screen.

layout = Layout(name="root")
layout.split(
    Layout(name="header", size=3),
    Layout(name="main", ratio=1),
    Layout(name="copyright", size=9),
)
layout["main"].split_row(
    Layout(name="info", ratio=2),
    Layout(name="progress", ratio=1)
)
layout["progress"].split(
     Layout(name="overall", size=3),
     Layout(name="jobs", ratio=1)
)
Copy the code

Once segmented, you need to fill each area with the controls you want to display. Rich supports a variety of fill controls, I mainly use the progress bar, text display panel.

The Progress bar Progress

style

You first need to define the display style for Progress. Rich encapsulates several common styles that can be combined on demand. You can specify the style when initializing progress:

self.overallProgress = SmartProgress(TextColumn("[progress.description]{task.description}"), CountColumn(), BarColumn(), TextColumn (" [progress percentage] {task. The percentage: > 3.0 f} % ")) self. JobsProgress = SmartProgress(TextColumn("{task.description}", justify="left"),"|",DownloadColumn(),BarColumn(bar_width=None),"|",TextColumn("[progress.percentage]{task.percentage:>3. 0f}%"),"|",TimeRemainingColumn())Copy the code

The SmartProgress here is my re-wrapped progress, but the initialization function is the same as progress, and the parameters seem quite complicated at first glance. But take it apart.

Take the first overallProgress. It takes TextColumn(), CountColumn(), BarColumn(), TextColumn(). Together, these parameters form the display style of overallProgress, and the order in which they are displayed is the order in which they are displayed. The display effect is:

Text counter Progress bar text

What does the first and second text actually show? Let’s continue to disassemble the parameters.

The first TextColumn argument is “[progress.description]{task.description}”, which indicates that the description property of Progeess will be displayed, It also displays the contents of the custom task.description property, sorted in the order defined in the parameters.

The second TextColumn is “[progress.percentage]{task.percentage:>3.0f}%”, which displays the percentage property for progress. Custom task.percentage is displayed with a percentage sign at the end.

If you go back and look at the overallProgress display, it looks like this:

Progress Description Current percentage of the counter progress bar

After disassembling the second jobsProgress, it can also be known that its display style is:

Task description | | progress bar | download count percentage | is expected to rest

column

Column is the term for progress, and when multiple Progresses are together, Rich keeps their columns aligned. So columns only need to focus on their own display. The CountColumn that appears in the progress parameter is also a double-wrapped column, with the following code:

from rich.progress import BarColumn, DownloadColumn, Progress, TextColumn, TimeRemainingColumn, ProgressColumn
from typing import Optional
from rich.text import Text, TextType
from rich.table import Column

class CountColumn(ProgressColumn):

    def __init__(
        self, binary_units: bool = False, table_column: Optional[Column] = None
    ) -> None:
        self.binary_units = binary_units
        super().__init__(table_column=table_column)

    def render(self, task: "Task") -> Text:
        if(task.total is None):
            count_status = f"{task.completed}/?"
        else:
            total = int(task.total)
            count_status = f"{task.completed}/{total}"
        count_text = Text(count_status, style="progress.download")
        return count_text
Copy the code

You can see that the main function of the column is to return what should be displayed at that time.

Progress bar control

In the case of multiple progress bars, Rich made sure the columns were aligned. However, it seems that we have only initialized 2 progress bars. Of course not. Take a look at the apis provided by Progress and you’ll see.

Progress is not a specific Progress bar as you see it, but rather a collection of classes (of the same style). Add_task allows you to add any progress bar that will actually show up. This method returns a taskID each time it is called, after which a specific progress bar can be manipulated by ID.

Progress can be operated on the content of the display, such as:

Description Indicates the description of the progress bar
Total Total length of progress bar (default 100)
Completed (Currently completed length)
Advance (similar to the append concept, advance is appended to the current completed and the completed is reset after each configuration.)
Custom attributes

In addition to the above content, you can also get all the current tasks, iterate over the status of each task, etc. See Rich’s source code progress.py for details.

Text display panel

In particular, a Panel is more like a “container” that wraps around another control, usually a table. grid control or Progress mentioned above. Here is an example of putting into a grid:

class CopyrightPanel: def __init__(self, title="[ Copyright ]", style="bright_blue"): self.title = title self.style = style def __rich__(self) -> Panel: sponsor_message = Table.grid(padding=1) sponsor_message.add_column(style="green", justify="right") sponsor_message.add_column(justify="left") sponsor_message.add_row( "Gitee", "[u blue link=https://gitee.com/ray0728/multimedia-file-synchronizer/tree/release-mfs]https://gitee.com/ray0728/multimedia-file-s ynchronizer/tree/release-mfs" ) sponsor_message.add_row( "Blog", "[u blue link=https://www.ray0728.cn/]https://www.ray0728.cn/" )\ sponsor_message.add_row( "Email", "[u blue link=mailto://[email protected]][email protected]" ) intro_message = Text.from_markup(""" Please provide more comments and bugs! or buy me a coffee to say good job. - Ray The UI is developed based on the [bold magenta]Rich[/] component. [u blue link=https://github.com/sponsors/willmcgugan]https://github.com/sponsors/willmcgugan[/] """) message = Table.grid(expand=True) message.add_column() message.add_column() message.add_row(intro_message, sponsor_message) message_panel = Panel( message, title=self.title, border_style=self.style, ) return message_panelCopy the code

This is a wrapped panel that provides static copyright information.

loading

Once you have the Progress and Panel ready, you can load them into the Layout.

self.logRedirect = LogPanel()
# Log.logRedirect = NullDev()
Log.logRedirect = self.logRedirect
layout["info"].update(self.logRedirect)
layout["copyright"].update(CopyrightPanel())
layout["header"].update(ClockPanel(title, "white on blue"))
layout["overall"].update(Panel(self.overallProgress, title="Overall Progress", border_style="orange4"))
layout["jobs"].update(Panel(self.jobsProgress, title="[ Jobs Progress ]", border_style="green"))
Copy the code

Refresh the Live

Rich uses Live to realize the control refresh, such as Progress. The controls loaded above can realize the periodic refresh of the control through Live.

The nature of Live refresh

The Live refresh is realized by searching for its children one by one through the root control and redrawing them according to their current data and state.

This method is encapsulated in the Live update. When Live starts the timer thread, the timer thread calls the update method.

Because Rich ultimately outputs in stdout or custom files. If multiple lives refresh the screen at the same time, the output must be out of order, so Rich does not allow multiple lives to be started at the same time.

It is found that the Rich performance is different in different operating systems, and the performance on Linux is richer and smoother than that on Windows.

Live Manual refresh

Back to my photo copy business, traversal of files must be the main thread’s job, copy of files is the child thread’s job, and refresh is the Live thread’s job. The main content displayed on the page is the copy progress.

Because there is no connection between Live and each copy sub-thread, the copy progress cannot be notified to Live and the interface is refreshed in real time. But giving copy child threads the ability to notify and trigger a Live refresh would run afoul of Live’s timed refresh mechanism.

Fortunately, Live provides a switch that does not open the timed refresh, so without invasive modification of Rich source code, just need to encapsulate the relevant actions of Live, you can manually refresh the interface without using the timed refresh thread.

def show(self):
    self.live = Live(self.layout, refresh_per_second=1, screen=True, auto_refresh=False)
    #self.live = Live(self.layout, refresh_per_second=4, screen=True)

    self.live.start(True)

def update(self):
    self._lock.acquire()
    if(self.live is not None):
        self.live.update(self.layout)
    self._lock.release()

def hiden(self):
    if(self.live is not None):
        self.live.stop()
Copy the code

When to refresh

In general, it can be called at any time, but to make the call more elegant, I’ve wrapped the refresh call logic in the thread task.py. In short, the child thread can call back the Live refresh interface if necessary.

The final result

Although it looks fine, there are still some bugs in the code, such as not being smooth enough to refresh, and sometimes the progress bar will jump. If any kind people find bugs, they are also welcome to submit modifications.

The source code is open source at Gitee