Scan Text [Date/Amount] - no results

Using several different statements (utility, bank, investments, etc.), I’ve attempted to use Scan Text [Date or Amount] with a string qualifier followed by an asterisk to capture the appropriate information, and it’s always been unsuccessful. The only time that I’ve managed to get any scanned information is when I use Scan Text [String], and that usually results in all of the text on the page.

For instance, using the included image of my energy bill, if I used Scan Text [Date] Due Date:*, I get nothing. Same applies for Total Amount Due.

Are documents of this type simply too complicated for the scanner to work on? Any other suggestions to obtain what I seek?

A screenshot of the smart rule would be useful. In addition, did you check the plain text of the document (e.g. after using Data > Convert)? Maybe the text layer is not the expected one.

@cgrunenberg, the document was supposedly converted to “PDF+Text”, however when I attempt a Find in the document for any text, I get no results. When I attempt to Convert -> to searchable PDF I receive Are you sure you want to convert this searchable PDF again?

Thanks for your assistance, as usual! I appreciate you & Jim.

1 Like

Just as an additional check: what do you get when you convert this file to plain text?

A bunch of garbage. No readable text.

99A A ?
! "# $ "% ! # & % '( ! " ) $ * $ %
%# $
$* $%
5* & 6% )
%#% &*$ (!# 5 *&6 %)
+

! #1 *-
*$ "& 1 *- * 5! #%:
; ; ; 3 6% 0 !
;;; 3 6% 0!

  • 3 3
  • % &6 -

4% #
9&(1
-
<
- %# $.)/
&%’ (!" ) 2# 3
(1 4 #0%
% #$
" & &% # $
4%0 $&(0 17")$
%# (#)"4* " & &% # $
4 %0 $ & ( 0 %# $)
% 4 ( ’ %& -
4%0 $&(0
,
,

  • )
    % 0% (’% 1 (#0% )$
    #%& 6- 5 & 6 %)
    ’ ‘’((
    <;
    - )$!3
    -
    6
    "& & %# $ 0 5 *& 6 %) ( #0 4 " 1 % *
    1( ) 0 ! " #$
    ! 8 + ,
    8 ! &
    4 (
  • $ %
    & %1 ( $
    / /
    '/
    00!"# $ " = %&: “% * $% : !$ *4 !” #$ " %: + !+
    ! "# $ # 04! )%1 :
    E>>LL PEE

. 01 0.230-
#$ %2+! +#$! . $ &/ '@ A&/'B"!$(

  • ! ) ,. !9++;$)9;,. ! ! ?D- 8 $! #$!$
    //<6 9%!.8 ." .94 .(&(54(’/’ '(()&09=0=.= $!$ , %
    ###! %> >
  • % $% % ! ++#$ !# + $ ! +
    $ + $!+!. 9;+
    2+! $ .9;#%!#$ !$#$!
    !+!:!!!$#$! !%#!%!
    / #?!$##$:! @!++!!?A98 $
    % !:!@!++?A "%9 #$$ ++ +
    (@! + !
    4!B$#$!%!
    ! “” %->; #(%+%!!!
    ! # $:!->; # %+ ++ !%:!#
    $#$!%.$!$!9 !!.!+!9!%!
    ) ,+ (?! $ $ +
  • AB< +!(
  • #( ! . ++ .
    !($%
    <###>+±%# ! .%###>
    -& ! 5/ !!&
    6!& 7+ ! $ 44
    -$)-,
    8%9! 5
    29 5 -*$
    (& ,;11
    ’ ‘’((
    " # $% & ’ (!) * ! * + ! # , * ! #
    -(
    ./00(12 -( #3!!+!
    +! !
    #1% $$!" !’.2$3.!'4 !'4
    5 7%189:%";! <"3 "!,'4 666 =
    ’! ‘’ *’" , # #1%
    0
    5
    :!+#-!%# +! .% ### (& (’ (/

?&"$$++$"++ +78!)78,+!$+7+! !
*…$$!$+ . $++$78-$ $
/

?
##$$$
5 6 "9
$; 9#)-,
7$#$ % ! +! -$$

  • (+(
    !
    #( +$ ++ ! 1 ** %1 1 !6 “( 6
    !”#!$%&’& !$($
    $0
    /
    /!""!#$$!$% %
    /!#$#%&#’()#" &’’(’(&’ )! %%$+
    ,
    "12 /’
    +( "-+ ++!.%### &(’(/
    $3
    +!#$!
    0

    !#+#’ 9
    9
    E ! # " ! # -"

# !- >#!%’> *#%#

!#+#’ 9
6(66 9 00(5 9
-F*’", !"
CD( 6 D00( CD( 055 0(00 0(05
0
$5677 !" - #
8( ( $#!$,"#-"
!% )
&)"!" * + !% )& !)"!" *

!%

6( (6
D5( 6 ( 5
( (
D 5(0 !+
*! )! *)"!" * + ! ) *)"!" *
0 (6 9 ( 0 9
C D (05
C D (06 6 E!#"!#-" (
9

# !- # !% >#!%’> *#%#

-F*’", !"
-% 0( # !- #!% -F*’",!"
’ ‘’((
( 4 $567 7 -% 0 ( ! “##” $ %&
,-
#I 5
“$@’ 6(09 J”#% = “$%”#% #$P
“”! " =$%9
L
#%1- ! +
# $%”#%&$$( #! +%!’ " ’
’ 1 ! ’ " #" # # -% + # ! # + % ‘"’ !-
$."%$,"%!" '(
#1%>#? PEE
#1%#,!"? 0
"%-
$? @=’-!"$,@’#1%
*8( (
$ 1 .& "#! !! / 0 "%
9 /
&

75C

)
)+ )) )&& )&, )&- ‘’ ‘’. ‘’/ ‘’

1# $+@’ 5(5
8
" .0’’! 1)0’’! "34
& 4&C
: ! 4
5 4
" 81; =
2

(4$(-707 5 -% 0 (
#1%>#? PEE
@ EL L
#1%

,!" ?

056
!% )* &)"!" * + ! ) )"!" *
!
+’! E!#"!9!"#(09CD( D(5 E!#"!!9!"# 0(69CD(66 (5 E!#"! 9*,# 00(5 9 CD(0 (0
)
E!#"!! 9*,#
!#+ ,’’!#%# !!’$ $ ! ! # + ’
6(66 9 CD( “#’(
(0 (6
!!’$ $!!#+3 4’ ! "
!"+G’ % $ $ %"# % “+ .# 1 - #( ’ !”#.#".&$% !%+""’
#%’
6 H#!&$!#+#+*#$%”#%
! # " !(Y # .$ % ’ EM G’ % # # $%"#% !#"! !-",#%,.""1

" '(

EM % !"!* '" .# 1- !-&$$ # $ %"#%

  • $1 #+’ #1% ‘(E ‘’ #1% ’ # ! “.# 1- - &+ (
    !1” + " % !" %"’ " @#
    -//!0//!&,& " $ #! , # ( G# …+ " $ .(
    ! "
    L*G# “”!%$!#!#+"$ #"’_
    !"+G’
    %
    $
    $ %"#%"+.# 1- #
    (. !!’
    $ %$
    ! ! # +(% ,
    ’ ‘’((
    ! “##” $ %&
    5 *’",# *,&#? 0
    ,-
    "$@’ 6(09

#)*’"!’# #-!%#’!"’. .$ ’ % !" %"?
5
$(-707 -%0(

( ’
4 5
( -%
0 (
! “##” $ %&
,-
#I 5
*##!” “# -! ##”#-!

!% *$".$

3 4 BB00 BB00 BB00 B6B00
#1%>#? PEE
#1% #,!"?
" %-*$?EQ ‘-!"$ #1%
( 555 #,’
( #$P
" $ @’ = ’ $!
##" #+
Q
’ ‘’((
6
!% )! 1 ) " !" * + !% )& ! ) " !" * 2
*
,
E ’

E ’
(56
(06
D(5 D(0

$$ # @’

!%
*#% # 3D ( 5 0 B #,4
(0
0( 6
#,’ 305 - +’ A ( #,‘CD(
#,‘B- +4 D (
(
#,‘B- +4 D0( 6
(
5)
*! )! * ) " !" * + *! )! $ ) " !" * 2
*
,
$$ !% @’
*#% # 3D ( 5 0 B #,4 5(-% 0(
#,’ 36 - +’ A #,'CD (
(
3 ’ .! !!! !! 0 &"

,’ 6

0

1# $+@’ (06
). )/ )’ )* )+ )) )&& )&, )&- ‘’ ‘’.
6

% , #"

% !’*,
&$ (

…$+ &+ !’
) 3
! # N ,:’’

# !

( *#
EM

  • , 1

!" !! ’ + (
‘" "
)3 (
!
.#1-’# ,.#1,!"’"$.9.+#,#!#+%!"’ !-
" $
$ $ 0 !, !, (
’ ,.$ )
‘" !’ "
.# . #% ! , F # ’ $ +,:’%O, - ( $%" # ‘. !-
3 # % $$ ,!-&&!0-0!0455(
#’!% '" . # +- # ) ‘" ’ ,:’ % !" ! $
$$ , !-
$ ,! -&& !0- 0!0 455(
& *" + *# ’ "+( = # - . # $ ! '( + * ’
!- *! '. % '.#

  • # '## *!- ! ’ ! $ %"# % . # $ !
    *!" ‘’,.$ ’ !
    !- 9 . + #’ $ " $’ ) ., !" !-
    $$ " " # *!- 9 . + *#’ $ !- " #’
    % # ’
    ’ ‘’((

Well that seems to indicate that OCR was not really successful. I suppose that you still have the original statement that you received from PG&E (and shouldn’t they be bloody big enough to send out PDFs with text layer?) In that case, you could try to open it in Preview and select some arbitrary text. If that works, try feeding this document to your rule…

1 Like

They’re a completely incompetent company, so no surprises about them not sending out layered PDFs.

I agree that the PDF wasn’t OCR’d properly, and will try it again, and see if that works any better.

Cheers!

What do you have set for primary and secondary languages in Preferences > OCR?

Hey Jim, I have English set as primary, and nothing as secondary. I just re-ORC’d it, and now doing a Find, results in found text. I’l see what the Rule does.

So when I vary the Rule between String and Date, using the same search string, I either get nothing or I get everything:

Using String with "Due Date:*"

Using Date with "Due Date:*"
Screen Shot 2021-10-20 at 13.42.59

Using String with "Total Amount Due by*"

Using Date with "Total Amount Due by*"
Screen Shot 2021-10-20 at 13.41.46

So now apparently there is text in the OCRd file to be found.
If so, it should be possible to find the about with a script accessing the plaintext property of the record.

However, I don’t see an easy way to figure out which amount you’re interested in given the amount of amounts in this document.

I usually go for the biggest number and that works in most cases. But here, you’ll need some other way to decide what you need.

How does the complete plain text conversion of the document look now?

@cgrunenberg, it’s much better as it’s real, readable text. Seeing this, I now may be able to format my query so that it obtains the needed information.

ENERGY STATEMENT
Account No: Statement Date: Due Date:
XXXXXXXXXXX-X 10/18/2021 11/08/2021
$197.69 -197.69
$0.00
$119.88 -17.20 33.63 14.02
Daily Usage Comparison
3 www.pge.com/MyEnergy Service For:
Questions about your bill?
Monday-Friday 7 a.m.-9 p.m. Saturday 8 a.m.-6 p.m. Phone: 1-800-743-5000 www.pge.com/MyEnergy
Ways To Pay www.pge.com/waystopay
Your Account Summary
Please return this portion with your payment. No staples or paper clips. Do not fold. Thank you.

Account Number: Due Date:
XXXXXXXXXXX-X 11/08/2021
Total Amount Due:
$150.33
Amount Enclosed: $
Amount Due on Previous Statement Payment(s) Received Since Last Statement
Previous Unpaid Balance
Current PG&E Electric Delivery Charges
Electric Adjustments
Peninsula Clean Energy Electric Generation Charges Current Gas Charges
Total Amount Due by 11/08/2021

I’ve played around with the “Scan Text” after looking at the plain text conversion, and have had a little luck. If I do Scan Text [Date] "XXXXXXXXXXX-X*", I’ll get a hit for the date 10/18/2021.

However, if I do Scan Text [Amount] "Total Amount Due:*", I get a hit on “33.63” which occurs on the 6th line of text and not the 19th like I’d hoped, which has “$150.33”

Insert a line break between Total Amount Due: (by pressing Alt-Return) and *, then it should work as scanning for amounts/dates is limited to lines.

This worked like a charm! Thank you so very much!

1 Like